Efficiency and Scalability in Large-Scale Model Training and Deployment

Advances in Large-Scale Model Training and Deployment

Recent developments in the field of large-scale model training and deployment have focused on enhancing efficiency, scalability, and resource utilization. Key innovations include novel load-balancing methods for parallel training of Mixture of Experts (MoE) models, which aim to reduce communication costs and improve throughput. Additionally, advancements in post-training optimization after model pruning have introduced scaling laws to determine optimal post-training data usage, significantly reducing resource demands while maintaining model performance.

In the realm of inference, high-throughput systems for MoE models on memory-constrained GPUs have been proposed, leveraging pipelining and hierarchical resource management to achieve superior throughput. These systems not only optimize GPU memory usage but also support larger models on multiple GPUs, making advanced models more accessible.

Another notable trend is the integration of topology-aware scheduling for co-located large language model (LLM) workloads, which ensures efficient resource allocation and improves overall system performance. This approach addresses the challenges of heterogeneous workload priorities and varying resource requirements, enhancing the efficiency of preemptive scheduling.

Noteworthy Papers:

  • Pro-Prophet: Introduces a systematic load-balancing method for efficient parallel training of large-scale MoE models, achieving significant speedup and load-balancing enhancement.
  • MoE-Lightning: Proposes a high-throughput MoE batch inference system that significantly outperforms existing methods, achieving up to 10.3x higher throughput on resource-constrained GPUs.
  • Topology-aware Preemptive Scheduling: Develops a fine-grained topology-aware method for preemptive scheduling of hybrid workloads, improving overall scheduled performance by 55%.

Sources

Pro-Prophet: Systematic Load Balancing Method for Efficient Parallel Training of Large-scale MoE Models

Scaling Law for Post-training after Model Pruning

The Jevons Paradox In Cloud Computing: A Thermodynamics Perspective

gpuPairHMM: High-speed Pair-HMM Forward Algorithm for DNA Variant Calling on GPUs

MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

Topology-aware Preemptive Scheduling for Co-located LLM Workloads

Lorentz: Learned SKU Recommendation Using Profile Data

LSRAM: A Lightweight Autoscaling and SLO Resource Allocation Framework for Microservices Based on Gradient Descent

Intelligent Pooling: Proactive Resource Provisioning in Large-scale Cloud Service

Scaling Deep Learning Research with Kubernetes on the NRP Nautilus HyperCluster

Optimizing Airline Reservation Systems with Edge-Enabled Microservices: A Framework for Real-Time Data Processing and Enhanced User Responsiveness

Faster Multi-GPU Training with PPLL: A Pipeline Parallelism Framework Leveraging Local Learning

Fast and Efficient Memory Reclamation For Serverless MicroVMs

Loss-to-Loss Prediction: Scaling Laws for All Datasets

Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

IC Mechanisms for Risk-Averse Advertisers in the Online Advertising System

Scaling Laws for Online Advertisement Retrieval

Built with on top of