Advances in Document Intelligence

The field of document intelligence is witnessing significant advancements, driven by the development of innovative models and datasets for document layout analysis, formula recognition, and bibliographic metadata extraction. Researchers are focusing on creating robust and efficient solutions that can generalize across diverse document types and formats. Notably, there is a growing emphasis on designing models that can balance accuracy and efficiency, enabling seamless integration into large-scale data processing environments. The introduction of new datasets and benchmarks is also facilitating progress in this area, providing valuable resources for the research community. Some noteworthy papers in this regard include: PP-DocLayout, which presents a unified document layout detection model that achieves high precision and efficiency. PP-FormulaNet, which introduces a state-of-the-art formula recognition model that excels in both accuracy and efficiency. TextBite, which provides a historical Czech document dataset for logical page segmentation. BiblioPage, which offers a dataset of scanned title pages for bibliographic metadata extraction. AnnoPage Dataset, which focuses on non-textual elements in documents with fine-grained categorization.

Sources

TextBite: A Historical Czech Document Dataset for Logical Page Segmentation

PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction

PP-FormulaNet: Bridging Accuracy and Efficiency in Advanced Formula Recognition

BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction

AnnoPage Dataset: Dataset of Non-Textual Elements in Documents with Fine-Grained Categorization

Built with on top of