Large Language Model (LLM) Inference and Training

Report on Current Developments in Large Language Model (LLM) Inference and Training

General Direction of the Field

The recent advancements in the field of Large Language Models (LLMs) are primarily focused on optimizing inference and training processes to meet the growing demands of performance, scalability, and cost-efficiency. The field is witnessing a shift towards more sophisticated systems that can handle the complexities of deploying LLMs at scale, while also addressing the challenges of resource utilization and performance isolation.

  1. Performance Optimization and Hardware Characterization: There is a significant emphasis on developing tools and systems that can characterize and optimize the performance of LLM inference services across various hardware configurations. This includes benchmarking and predictive modeling to recommend the most cost-effective hardware for specific LLM workloads. The goal is to ensure that LLM services can meet performance requirements while minimizing costs.

  2. Fine-Grained Task Parallelism and Resource Utilization: The exploration of fine-grained task parallelism on simultaneous multithreading cores is gaining traction. This approach aims to improve the performance of latency-critical applications by leveraging advanced parallel programming frameworks. The focus is on developing specialized software solutions that can maximize the efficiency of tasking on multithreading cores, thereby enhancing overall system performance.

  3. GPU Harvesting and Co-Serving: A notable trend is the development of systems that can efficiently utilize stranded GPU resources for offline LLM inference tasks. These systems aim to improve GPU utilization by safely preempting offline tasks when online tasks arrive, thereby reducing underutilization and increasing throughput. The challenge lies in achieving high GPU utilization without compromising the performance of latency-sensitive online tasks.

  4. Memory-Centric Profiling and ARM Processors: With the rise of ARM processors in data centers and HPC systems, there is a growing interest in developing memory-centric profiling tools for ARM architectures. These tools are crucial for identifying memory access bottlenecks and guiding optimizations, especially as ARM processors become more prevalent in high-performance computing environments.

  5. Unified and Modular Training Systems: The field is moving towards the creation of unified and modular training systems that streamline the integration of state-of-the-art techniques for LLM pre-training. These systems aim to reduce the complexity and overhead associated with curating and comparing training recipes, enabling more efficient and scalable training of LLMs.

  6. Improving GPU Utilization in Pipeline-Parallel Training: There is a concerted effort to address the inefficiencies in pipeline-parallel training of large models, particularly the idle GPU time caused by pipeline bubbles. Innovations in this area focus on filling these bubbles with other pending jobs to maximize GPU utilization, thereby improving the overall efficiency of large-scale LLM training.

  7. Non-Intrusive Performance Isolation for Concurrent Workloads: The challenge of GPU underutilization in deep learning clusters is being tackled through the development of non-intrusive performance isolation mechanisms. These mechanisms aim to provide robust performance isolation and comprehensive workload compatibility, enabling more efficient GPU sharing without compromising the performance of individual workloads.

Noteworthy Papers

  • LLM-Pilot: Introduces a system for characterizing and optimizing LLM inference services, delivering 33% more performance while reducing costs by 60%.
  • ConServe: Proposes a system for harvesting stranded GPU resources, achieving 2.35x higher throughput and reducing latency by 84x compared to existing co-serving systems.
  • TorchTitan: Offers a PyTorch-native training system for LLMs, demonstrating accelerations of up to 65.08% with 1D parallelism and further improvements with 2D and 3D parallelism.
  • Tally: Introduces a non-intrusive GPU sharing mechanism, reducing 99th-percentile latency overhead by 96% compared to state-of-the-art systems while maintaining high throughput.

Sources

LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services

Exploring Fine-grained Task Parallelism on Simultaneous Multithreading Cores

ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving

Multi-level Memory-Centric Profiling on ARM Processors with ARM SPE

TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training

PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training

Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads

Built with on top of