Efficient Deployment of Large Language Models on Edge Devices

The field of large language models (LLMs) is moving towards efficient deployment on edge devices, with a focus on reducing computational overhead and memory demands. Researchers are exploring various techniques, including model compression, quantization, and hardware acceleration, to enable the deployment of LLMs on resource-constrained devices. Notable papers include D$^2$MoE, which proposes a dynamic scheduling algorithm and matryoshka weight quantization to improve inference throughput and reduce memory footprint. TeLLMe presents a ternary LLM accelerator for edge FPGAs, achieving significant energy-efficiency advances and establishing a new benchmark for generative AI. COBRA introduces a binary Transformer accelerator with real 1-bit binary multiplication, surpassing ternary methods and delivering high energy efficiency and throughput improvements.

Sources

D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving

TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs

COBRA: Algorithm-Architecture Co-optimized Binary Transformer Accelerator for Edge Inference

On-Device Qwen2.5: Efficient LLM Inference with Model Compression and Hardware Acceleration

Built with on top of