The field of large language models (LLMs) is moving towards more efficient inference methods, with a focus on reducing memory usage and computational overhead. Recent developments have centered around quantization techniques, such as activation quantization and weight quantization, which aim to reduce the precision of model weights and activations while maintaining accuracy. Another area of research is exploring novel sparsity paradigms, including flexible N:M sparsity and transitive sparsity, to minimize computational overhead in General Matrix Multiplication (GEMM) operations. Additionally, there is a growing interest in developing hardware accelerators and simulators that can efficiently support these new techniques, such as digital compute-in-memory architectures and cycle-accurate systolic accelerator simulators. Noteworthy papers include: Gradual Binary Search and Dimension Expansion, which demonstrates a 40% increase in accuracy on common benchmarks compared to state-of-the-art methods. FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization, which achieves less than 1% perplexity degradation on Wikitext-103 for the Llama-2-7B model relative to an all-FP8 baseline design while consuming 14% less energy during inference.
Advances in Efficient Large Language Model Inference
Sources
Gradual Binary Search and Dimension Expansion : A general method for activation quantization in LLMs
FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference
Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator