Advances in Efficient Large Language Model Inference

The field of large language models (LLMs) is moving towards more efficient inference methods, with a focus on reducing memory usage and computational overhead. Recent developments have centered around quantization techniques, such as activation quantization and weight quantization, which aim to reduce the precision of model weights and activations while maintaining accuracy. Another area of research is exploring novel sparsity paradigms, including flexible N:M sparsity and transitive sparsity, to minimize computational overhead in General Matrix Multiplication (GEMM) operations. Additionally, there is a growing interest in developing hardware accelerators and simulators that can efficiently support these new techniques, such as digital compute-in-memory architectures and cycle-accurate systolic accelerator simulators. Noteworthy papers include: Gradual Binary Search and Dimension Expansion, which demonstrates a 40% increase in accuracy on common benchmarks compared to state-of-the-art methods. FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization, which achieves less than 1% perplexity degradation on Wikitext-103 for the Llama-2-7B model relative to an all-FP8 baseline design while consuming 14% less energy during inference.

Sources

Gradual Binary Search and Dimension Expansion : A general method for activation quantization in LLMs

FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference

Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator

SCALE-Sim v3: A modular cycle-accurate systolic accelerator simulator for end-to-end system analysis

BBAL: A Bidirectional Block Floating Point-Based Quantisation Accelerator for Large Language Models

Transitive Array: An Efficient GEMM Accelerator with Result Reuse

Built with on top of