Document Understanding and OCR Research

Report on Current Developments in Document Understanding and OCR Research

General Direction of the Field

The field of document understanding and Optical Character Recognition (OCR) is witnessing a significant shift towards more unified, high-capacity models that can handle a diverse range of tasks beyond traditional text recognition. This shift is driven by the need for intelligent processing of complex, multi-modal document content, including mathematical formulas, tables, charts, and even geometric shapes. The recent advancements are characterized by the development of end-to-end models that can process entire documents in a holistic manner, rather than focusing solely on isolated text elements.

One of the key trends is the integration of Vision Transformers (ViTs) into document processing pipelines, which is enabling more sophisticated feature extraction and aggregation techniques. These models are being trained in self-supervised fashions, reducing the dependency on labeled data and improving the generalization of features across different document types and historical datasets. Additionally, there is a growing emphasis on high-resolution document processing, which is being addressed through innovative compression techniques to balance computational efficiency and performance.

Another notable trend is the unification of image and text recognition within a single model architecture. This approach aims to mimic human visual recognition capabilities by integrating lightweight language decoders into vision encoders, thereby enhancing the model's ability to recognize and interpret both visual and textual content simultaneously. This unification is particularly beneficial for document-related tasks, where the context provided by both images and text is crucial for accurate understanding.

Noteworthy Innovations

  1. Self-Supervised Vision Transformers for Writer Retrieval: This work introduces a novel method for writer retrieval using ViTs, demonstrating superior performance on historical document datasets and the ability to generalize to modern datasets without fine-tuning.

  2. General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model: The proposed GOT model represents a significant leap towards OCR-2.0, capable of handling a wide range of "characters" and supporting interactive OCR features, making it highly versatile for practical applications.

  3. mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding: This model addresses the challenges of high-resolution document processing by introducing a compression module that significantly reduces GPU memory usage and improves inference times, setting new state-of-the-art benchmarks in multi-page document understanding.

  4. UNIT: Unifying Image and Text Recognition in One Vision Encoder: UNIT's novel training framework integrates text recognition into a vision encoder, enhancing document-related task performance while maintaining strong image recognition capabilities, showcasing a promising direction for future research.

Sources

Self-Supervised Vision Transformers for Writer Retrieval

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

UNIT: Unifying Image and Text Recognition in One Vision Encoder