Advances in Distributed Training and Parallelism

The field of distributed training and parallelism is experiencing significant advancements, with a focus on improving the efficiency and scalability of large-scale deep learning models. Researchers are exploring innovative methods to mitigate data dependencies, balance workloads, and optimize communication schedules. These advancements aim to reduce communication bottlenecks, improve training speeds, and increase the overall performance of distributed training systems. Notable developments include the use of flexible scheduling, workload-aware parallelism, and memory-parallelism co-optimization. These techniques have shown promising results in achieving significant speedups and improving the accuracy of large language models.

Some noteworthy papers include: DeFT, which proposes a new communication scheduling scheme that mitigates data dependencies and achieves speedups of 29% to 115% on representative benchmarks. Mist, which introduces a memory, overlap, and imbalance-aware automatic distributed training system that comprehensively co-optimizes all memory footprint reduction techniques alongside parallelism, achieving an average speedup of 1.28x compared to state-of-the-art systems.

Sources

DeFT: Mitigating Data Dependencies for Flexible Communication Scheduling in Distributed Training

WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training

Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization

PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch

TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives

Built with on top of