Multilingual Advances in NLP

The field of natural language processing (NLP) is witnessing significant developments in multilingual capabilities, with a focus on addressing the performance gaps in under-resourced languages. Researchers are working on evaluating and improving the performance of large language models (LLMs) in diverse linguistic environments, particularly in low-resource languages. The introduction of new benchmarks and evaluation frameworks, such as GlotEval and Kaleidoscope, enables the assessment of LLMs in multilingual and multicultural contexts. Furthermore, the development of multilingual LLMs, like SEA-LION, and the investigation of continual pretraining strategies, highlights the complexity of multilingual representation learning. Notable papers in this area include: SEA-LION, which introduces a cutting-edge multilingual LLM designed for Southeast Asian languages, achieving state-of-the-art performance across LLMs supporting these languages. Kaleidoscope, a large-scale, in-language multimodal benchmark that evaluates VLMs across diverse languages and visual inputs, revealing significant gaps in multilingual and multicultural coverage.

Sources

Evaluating Compact LLMs for Zero-Shot Iberian Language Tasks on End-User Devices

MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation

Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources

GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models

SEA-LION: Southeast Asian Languages in One Network

Assessing Thai Dialect Performance in LLMs with Automatic Benchmarks and Human Evaluation

Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation

Built with on top of