The recent advancements in Large Language Model (LLM) serving have primarily focused on enhancing user experience and optimizing resource utilization. Researchers are increasingly emphasizing the need for more accurate and unified metrics to evaluate LLM performance, particularly in terms of service level objectives (SLOs) and goodput. These metrics aim to better reflect the user experience by addressing issues such as delayed token delivery and request dropouts. Additionally, there is a growing interest in developing practical scheduling techniques that can be easily implemented in existing systems to improve throughput and reduce latency. This includes exploring novel scheduling algorithms that consider both the computational and memory constraints of LLM inference, especially in augmented LLMs where external data integration adds complexity. The field is moving towards creating frameworks that not only optimize current strategies but also provide a cohesive direction for future research in LLM serving optimization.
Noteworthy papers include one that proposes a unified metric framework to better reflect user experience in LLM serving, and another that introduces new scheduling techniques, outperforming current methods on production workload traces.