Efficient Large Language Model Inference

The field of large language models (LLMs) is moving towards more efficient inference methods, with a focus on optimizing throughput, latency, and memory usage. Researchers are exploring novel architectures, scheduling algorithms, and hardware designs to improve the performance of LLMs. Notable advancements include the development of phase-decoupled compute partitioning, fine-grained pipeline parallelism, and heterogeneous memory management. These innovations have led to significant improvements in throughput, latency, and memory efficiency, making LLMs more suitable for real-world applications. Noteworthy papers include: Optimizing SLO-oriented LLM Serving with PD-Multiplexing, which presents a new LLM serving framework that achieves an average $5.1 imes$ throughput improvement over state-of-the-art baselines. SlimPipe, a novel approach to pipeline parallelism that reduces accumulated activations and achieves near-zero memory overhead and minimal pipeline bubbles. L3, a hardware-software co-designed system that integrates DIMM-PIM and GPU devices to achieve up to $6.1 imes$ speedup over state-of-the-art HBM-PIM solutions.

Sources

Optimizing SLO-oriented LLM Serving with PD-Multiplexing

SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training

Hardware-based Heterogeneous Memory Management for Large Language Model Inference

SLO-Aware Scheduling for Large Language Model Inferences

High-Throughput LLM inference on Heterogeneous Clusters

SeaLLM: Service-Aware and Latency-Optimized Resource Sharing for Large Language Model Inference

HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing

Circinus: Efficient Query Planner for Compound ML Serving

Evaluating Learned Query Performance Prediction Models at LinkedIn: Challenges, Opportunities, and Findings

HMI: Hierarchical Knowledge Management for Efficient Multi-Tenant Inference in Pretrained Language Models

L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference

Built with on top of