The field of large language models (LLMs) is moving towards more efficient inference methods, with a focus on optimizing throughput, latency, and memory usage. Researchers are exploring novel architectures, scheduling algorithms, and hardware designs to improve the performance of LLMs. Notable advancements include the development of phase-decoupled compute partitioning, fine-grained pipeline parallelism, and heterogeneous memory management. These innovations have led to significant improvements in throughput, latency, and memory efficiency, making LLMs more suitable for real-world applications. Noteworthy papers include: Optimizing SLO-oriented LLM Serving with PD-Multiplexing, which presents a new LLM serving framework that achieves an average $5.1 imes$ throughput improvement over state-of-the-art baselines. SlimPipe, a novel approach to pipeline parallelism that reduces accumulated activations and achieves near-zero memory overhead and minimal pipeline bubbles. L3, a hardware-software co-designed system that integrates DIMM-PIM and GPU devices to achieve up to $6.1 imes$ speedup over state-of-the-art HBM-PIM solutions.
Efficient Large Language Model Inference
Sources
HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing
Evaluating Learned Query Performance Prediction Models at LinkedIn: Challenges, Opportunities, and Findings