The field of large language model (LLM) serving is rapidly evolving, with a focus on improving efficiency, reducing latency, and increasing throughput. Recent research has explored various approaches to optimize LLM serving, including novel memory management frameworks, attention disaggregation techniques, and QoS-driven inference serving systems. These advancements aim to address the challenges of serving LLMs, such as high computational costs, memory requirements, and latency constraints. Notably, innovative solutions have been proposed to enhance resource utilization, prevent performance interference, and enable fine-grained QoS differentiation. Overall, the field is moving towards more efficient, scalable, and flexible LLM serving architectures. Noteworthy papers include:
- PipeBoost, which introduces a resilient pipelined architecture for fast serverless LLM scaling, reducing inference latency by 31% to 49.8%.
- Jenga, a novel memory allocation framework that improves GPU memory utilization by up to 79.6% and increases serving throughput by up to 4.92x.
- Adrenaline, an attention disaggregation and offloading mechanism that enhances resource utilization and performance in LLM serving systems, achieving 2.28x higher memory capacity and 2.07x better memory bandwidth utilization.
- Niyama, a QoS-driven inference serving system that enables efficient co-scheduling of diverse workloads on shared infrastructure, increasing serving capacity by 32% and reducing SLO violations by an order of magnitude.