Optimizing Large Language Model Serving

The field of Large Language Models (LLMs) is moving towards optimizing serving efficiency, with a focus on improving inference latency, reducing memory overhead, and increasing throughput. Researchers are exploring innovative approaches to key-value cache management, mixed-precision quantization, and dynamic chunking to achieve these goals. Notably, sentence-level semantic caching and chunk-adaptive mixed-precision quantization have shown promising results in reducing memory usage and improving computational efficiency. Furthermore, the development of QoS-driven inference serving systems is enabling more efficient co-scheduling of diverse workloads on shared infrastructure. Noteworthy papers include: Niyama, which introduces a novel QoS-driven inference serving system that increases serving capacity by 32% while maintaining QoS guarantees. SentenceKV, which proposes a sentence-level semantic KV caching approach that significantly outperforms state-of-the-art methods in both efficiency and memory usage.

Sources

Long-Tail Crisis in Nearest Neighbor Language Models

Niyama : Breaking the Silos of LLM Inference Serving

Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference

Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving

SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching

Fundamentals of Caching Layered Data objects

Comparative Analysis of Distributed Caching Algorithms: Performance Metrics and Implementation Considerations

Built with on top of