Advances in Low-Resource Language NLP

The field of natural language processing (NLP) is moving towards greater inclusivity of low-resource languages, with innovative approaches being developed to address the unique challenges posed by these languages. Recent research has focused on creating tailored language models, such as ParsiPy for historical Persian texts, SomBERTa for Somali, and LakotaBERT for Lakota, which have shown promising results in tasks like fake news detection, sentiment analysis, and language modeling. Furthermore, domain-adaptive pretraining and task-adaptive fine-tuning have been explored to improve performance on specific tasks like emotion classification and machine translation. Notably, the introduction of new datasets and models like AfriSocial, ClinText-SP, and RigoBERTa Clinical has expanded the capabilities of NLP systems for African and clinical languages. Noteworthy papers include ParsiPy, which introduces a comprehensive NLP toolkit for historical Persian languages, and SomBERTa, which presents a monolingual Somali language model that outperforms multilingual models in fake news and toxic content classification tasks.

Sources

ParsiPy: NLP Toolkit for Historical Persian Texts in Python

Detection of Somali-written Fake News and Toxic Messages on the Social Media Using Transformer-based Language Models

LakotaBERT: A Transformer-based Model for Low Resource Lakota Language

AfroXLMR-Social: Adapting Pre-trained Language Models for African Languages Social Media Text

Enhancing Multi-Label Emotion Analysis and Corresponding Intensities for Ethiopian Languages

Sun-Shine: A Large Language Model for Tibetan Culture

Natural Language Processing for Electronic Health Records in Scandinavian Languages: Norwegian, Swedish, and Danish

ClinText-SP and RigoBERTa Clinical: a new set of open resources for Spanish Clinical NLP

Construction Identification and Disambiguation Using BERT: A Case Study of NPN

Linguistic Blind Spots of Large Language Models

Exploring Cultural Nuances in Emotion Perception Across 15 African Languages

HausaNLP at SemEval-2025 Task 2: Entity-Aware Fine-tuning vs. Prompt Engineering in Entity-Aware Machine Translation

A Comparative Analysis of Word Segmentation, Part-of-Speech Tagging, and Named Entity Recognition for Historical Chinese Sources, 1900-1950

A Retrieval-Based Approach to Medical Procedure Matching in Romanian

Named Entity Recognition in Context