Efficient Training and Inference of Large Language Models

The field of large language models is moving towards more efficient training and inference methods. Recent developments focus on optimizing parallelism strategies, reducing communication overhead, and improving resource utilization. Notable advancements include dynamic hybrid parallelism selection, layer-wise and phase-wise strategy optimization, and runtime adaptation. Additionally, there is a growing interest in heterogeneous GPU training, with solutions that efficiently utilize older GPUs and minimize idle time. Noteworthy papers include: Galvatron, which introduces a novel framework for automatic distributed training of large transformer models, HeterMoE, which proposes a system for efficient training of mixture-of-experts models on heterogeneous GPUs, Dion, which presents a communication-efficient optimizer for large models, TAGC, which introduces an optimized gradient compression algorithm for transformer-based models, HybriMoE, which proposes a hybrid CPU-GPU inference framework for efficient MoE inference, Nonuniform-Tensor-Parallelism, which mitigates the impact of GPU failures on scaled-up LLM training.

Sources

Galvatron: Automatic Distributed Training for Large Transformer Models

HeterMoE: Efficient Training of Mixture-of-Experts Models on Heterogeneous GPUs

Dion: A Communication-Efficient Optimizer for Large Models

TAGC: Optimizing Gradient Communication in Distributed Transformer Training

HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference

Nonuniform-Tensor-Parallelism: Mitigating GPU failure impact for Scaled-up LLM Training

Built with on top of