Efficient Training and Inference of Large Language Models

The field of large language models is moving towards more efficient training and inference methods. Recent developments focus on optimizing parallelism strategies, reducing communication overhead, and improving resource utilization. Notable advancements include dynamic hybrid parallelism selection, layer-wise and phase-wise strategy optimization, and runtime adaptation. Additionally, there is a growing interest in heterogeneous GPU training, with solutions that efficiently utilize older GPUs and minimize idle time. Noteworthy papers include: Galvatron, which introduces a novel framework for automatic distributed training of large transformer models, HeterMoE, which proposes a system for efficient training of mixture-of-experts models on heterogeneous GPUs, Dion, which presents a communication-efficient optimizer for large models, TAGC, which introduces an optimized gradient compression algorithm for transformer-based models, HybriMoE, which proposes a hybrid CPU-GPU inference framework for efficient MoE inference, Nonuniform-Tensor-Parallelism, which mitigates the impact of GPU failures on scaled-up LLM training.

Efficient Training and Inference of Large Language Models

Sources