The field of computer vision and natural language processing is moving towards more efficient and accurate methods for text recognition and facial expression analysis. Recent developments have focused on leveraging self-supervised learning, masked image modeling, and cascaded transformers to improve performance while reducing computational demands. Notably, the integration of linguistic information into visual models has shown promising results, enabling more robust text recognition in scenarios with degraded visual quality. Furthermore, fine-tuning of autoregressive models on limited handwritten texts has demonstrated significant improvements in OCR applications. Noteworthy papers include:
- SIT-FER, which proposes a novel semi-supervised facial expression recognition framework that incorporates semantic, instance, and text-level information to generate high-quality pseudo-labels.
- Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition, which channels linguistic information into the decoding process of masked image modeling through a separate branch.
- Efficient and Accurate Scene Text Recognition with Cascaded-Transformers, which introduces a cascaded-transformers structure to improve the efficiency of encoder models.
- Practical Fine-Tuning of Autoregressive Models on Limited Handwritten Texts, which demonstrates the effectiveness of fine-tuning transformer-based models on limited handwritten texts.
- Masked Self-Supervised Pre-Training for Text Recognition Transformers on Large-Scale Datasets, which explores masked self-supervised pre-training for text recognition transformers and shows significant improvements in character error rate.