Advances in Language Model Training and Distillation

The field of language models is moving towards more efficient and effective training methods, with a focus on uncertainty-aware training and knowledge distillation. Researchers are exploring new techniques to improve the performance of language models, such as token-level uncertainty-aware objectives and sparse logit sampling. Additionally, there is a growing interest in developing methods that can bridge the gap between different models and tokenizers, enabling more flexible and adaptable language modeling. Notable papers include: Efficient Knowledge Distillation via Curriculum Extraction, which proposes a method for extracting a curriculum from a fully trained teacher network, and Cross-Tokenizer Distillation via Approximate Likelihood Matching, which enables distillation across different tokenizers. Vocabulary-agnostic Teacher Guided Language Modeling also shows promising results in overcoming vocabulary mismatches between teacher and student models.

Sources

Token-Level Uncertainty-Aware Objective for Language Model Post-Training

Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs

Efficient Knowledge Distillation via Curriculum Extraction

Distil-xLSTM: Learning Attention Mechanisms through Recurrent Structures

Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling

Cross-Tokenizer Distillation via Approximate Likelihood Matching

Built with on top of