Knowledge Distillation in Large Language Models

The field of large language models is moving towards more efficient and scalable architectures through the use of knowledge distillation techniques. Recent innovations in knowledge distillation methods have improved the performance of student models, allowing for the compression of large language models while preserving their accuracy. The focus is on optimizing knowledge transfer, reducing computational overhead, and improving inference speed. Noteworthy papers in this area include those that propose novel stochastic self-distillation training strategies, systematically evaluate the transferability of knowledge distillation to subquadratic architectures, and develop new distillation methods for improving performance and explainability. Notable papers: The paper on stochastic self-distillation training strategy proposes a novel approach for filtering and weighting teacher representations to distill from task-relevant representations only. The paper on empirical evaluation of knowledge distillation from transformers to subquadratic language models provides insights into the trade-offs between efficiency and performance in knowledge distillation.

Sources

Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models

Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation

Empirical Evaluation of Knowledge Distillation from Transformers to Subquadratic Language Models

Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions

DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models

Honey, I Shrunk the Language Model: Impact of Knowledge Distillation Methods on Performance and Explainability

Representation Learning via Non-Contrastive Mutual Information

Does Knowledge Distillation Matter for Large Language Model based Bundle Generation?

Built with on top of