Enhancing LLM Adaptability for Low-Resource Languages

The recent developments in the field of large language models (LLMs) have significantly advanced the capabilities and adaptability of these models, particularly in addressing the challenges posed by low-resource and underrepresented languages. A notable trend is the emphasis on continual pre-training and fine-tuning strategies to enhance performance in specific languages, often leveraging multilingual capabilities and transfer learning. Innovations in vocabulary expansion, dataset curation, and model architecture optimization have led to substantial improvements in language understanding and generation tasks. Additionally, the integration of cultural and linguistic adjustments, along with the creation of new benchmarks, has contributed to more inclusive and effective language technologies. Notably, the use of modular architectures in task-oriented dialog systems and the application of curriculum learning for cross-lingual data-to-text generation with noisy data have shown promising results. These advancements collectively push the boundaries of LLM applicability, making significant strides towards democratizing AI across diverse linguistic and cultural contexts.

Sources

Efficient Continual Pre-training of LLMs for Low-resource Languages

BgGPT 1.0: Extending English-centric LLMs to other languages

Task-Oriented Dialog Systems for the Senegalese Wolof Language

Optimized Quran Passage Retrieval Using an Expanded QA Dataset and Fine-Tuned Language Models

Vocabulary Expansion of Chat Models with Unlabeled Target Language Data

Harnessing Transfer Learning from Swahili: Advancing Solutions for Comorian Dialects

Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion

Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments

LinguaLIFT: An Effective Two-stage Instruction Tuning Framework for Low-Resource Language Tasks

Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models

XTransplant: A Probe into the Upper Bound Performance of Multilingual Capability and Culture Adaptability in LLMs via Mutual Cross-lingual Feed-forward Transplantation

SnakModel: Lessons Learned from Training an Open Danish Large Language Model

Syntactic Transfer to Kyrgyz Using the Treebank Translation Method

Experience of Training a 1.7B-Parameter LLaMa Model From Scratch

Extending LLMs to New Languages: A Case Study of Llama and Persian Adaptation

Curriculum Learning for Cross-Lingual Data-to-Text Generation With Noisy Data

Open Universal Arabic ASR Leaderboard

Language verY Rare for All

Understanding and Analyzing Model Robustness and Knowledge-Transfer in Multilingual Neural Machine Translation using TX-Ray

Pipeline Analysis for Developing Instruct LLMs in Low-Resource Languages: A Case Study on Basque

Domain-adaptative Continual Learning for Low-resource Tasks: Evaluation on Nepali

Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation

Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs

Why We Build Local Large Language Models: An Observational Analysis from 35 Japanese and Multilingual LLMs