Efficient Computing Paradigms for Large Language Models

The field of large language models is shifting towards more efficient computing paradigms to address the challenges of increasing energy demands and slowing Moore's law. Researchers are exploring novel architectures and techniques to reduce computational costs and improve performance. A key direction is the development of quantization methods that can efficiently compress models while preserving accuracy. Another area of focus is the integration of memory into the learning process, with associative memory mechanisms showing promise. Noteworthy papers include: MILLION, which proposes a novel quantization framework achieving low-bitwidth KV cache through product quantization, and RaanA, which introduces a unified post-training quantization framework that overcomes crucial limitations of existing PTQ methods. Overall, these advances are expected to significantly improve the sustainability and efficiency of large language models.

Sources

Oscillatory Associative Memory with Exponential Capacity

Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency

MILLION: Mastering Long-Context LLM Inference Via Outlier-Immunized KV Product Quantization

RaanA: A Fast, Flexible, and Data-Efficient Post-Training Quantization Algorithm

Reconfigurable Time-Domain In-Memory Computing Marco using CAM FeFET with Multilevel Delay Calibration in 28 nm CMOS

Enhancing Biologically Inspired Hierarchical Temporal Memory with Hardware-Accelerated Reflex Memory

Efficient Calibration for RRAM-based In-Memory Computing using DoRA

Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models

Achieving binary weight and activation for LLMs using Post-Training Quantization

Optimizing Large Language Models: Metrics, Energy Efficiency, and Case Study Insights