Advancements in OCR and LLM Robustness to Noise

The recent developments in the field of Optical Character Recognition (OCR) and Large Language Models (LLMs) robustness to noise highlight significant advancements and challenges. In OCR, there's a notable push towards improving accuracy for less-resourced languages, such as Sámi and Yiddish, through fine-tuning existing models and leveraging synthetic data. This approach not only enhances the accessibility of historical and cultural documents but also sets a precedent for handling other low-resource languages. On the LLMs front, research is increasingly focusing on their robustness to noisy inputs, a critical aspect for real-world applications. Studies reveal that while LLMs exhibit vulnerabilities to noise, there's a clear indication that model architecture and size play pivotal roles in determining robustness. This underscores the need for continued innovation in model training and evaluation to ensure reliability across diverse and noisy datasets.

Noteworthy Papers

Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway: Demonstrates that fine-tuning pre-trained models with synthetic data significantly improves OCR accuracy for Sámi languages.
ArithmAttack: Evaluating Robustness of LLMs to Noisy Context in Math Problem Solving: Highlights the vulnerability of LLMs to punctuation noise in math problem-solving tasks, suggesting a need for more robust training methods.
Exploring Robustness of Multilingual LLMs on Real-World Noisy Data: Shows that mT5 models exhibit superior robustness to real-world spelling errors across multiple languages and tasks.
Jochre 3 and the Yiddish OCR corpus: Introduces a highly accurate OCR tool for Yiddish, significantly outperforming existing models and enhancing the accessibility of Yiddish texts.

Advancements in OCR and LLM Robustness to Noise

Noteworthy Papers

Sources