The field of Large Language Models (LLMs) is rapidly advancing, with a significant focus on optimizing inference processes to enhance speed, efficiency, and environmental sustainability. Recent developments have introduced innovative approaches to speculative decoding, a technique aimed at accelerating LLM inference without compromising output quality. These advancements include the explicit modeling of adaptive draft structures, leveraging heterogeneous hardware resources, and integrating hardware-level speculative decoding support. Additionally, there is a growing emphasis on reducing the environmental impact of LLM operations through the reuse of older, low-performing GPUs, thereby addressing the challenge of high computational intensity and resource demands.
Noteworthy papers in this area include:
- AdaEAGLE: Introduces a framework that explicitly models adaptive draft structures, achieving significant speedup without manual thresholds.
- Dovetail: Proposes a CPU/GPU heterogeneous speculative decoding approach, improving hardware resource utilization and inference speed.
- HADES: Presents a novel hardware-accelerated decoding method, enhancing LLM performance and energy efficiency.
- GreenLLM: Focuses on reducing carbon emissions by disaggregating LLM serving on heterogeneous GPUs, demonstrating substantial environmental benefits.