Advances in Large Language Model Efficiency

The field of large language models (LLMs) is moving towards improving efficiency and reducing computational costs. Recent research has focused on developing novel compression techniques, quantization methods, and caching strategies to enable the practical deployment of LLMs. These advancements have led to significant reductions in memory consumption and improvements in inference speed, making LLMs more accessible for real-world applications. Notably, innovative approaches such as nested activation-aware decomposition, task-adaptive group-wise KV cache window selection, and log-distributed quantization have demonstrated superior performance and efficiency. Notable papers include: Large Language Model Compression via the Nested Activation-Aware Decomposition, which proposes a novel post-training compression paradigm for LLMs. LogQuant, a groundbreaking 2-bit quantization technique for KV Cache in LLM inference, delivering substantial memory savings while preserving superior performance.

Sources

Large Language Model Compression via the Nested Activation-Aware Decomposition

Variance Control via Weight Rescaling in LLM Pre-training

Improving Quantization with Post-Training Model Expansion

WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference

Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization

BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache

xKV: Cross-Layer SVD for KV-Cache Compression

Rank-Based Modeling for Universal Packets Compression in Multi-Modal Communications

Understanding and Improving Information Preservation in Prompt Compression for LLMs

QUAD: Quantization and Parameter-Efficient Tuning of LLM with Activation Decomposition

LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices

A Refined Analysis of Massive Activations in LLMs

Built with on top of