Efficient Large Language Models

The field of large language models is moving towards improving efficiency and reducing computational costs. Researchers are exploring various techniques such as funneling, pruning, and compression to achieve this goal. Notable papers include 'Revisiting Funnel Transformers for Modern LLM Architectures' which investigates the impact of funneling in contemporary transformer architectures, and 'Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models' which introduces a family of hybrid models that offer better or on-par accuracy compared to other state-of-the-art models while being up to 3x faster at inference. Another noteworthy paper is 'Entropy-Based Block Pruning for Efficient Large Language Models' which proposes an entropy-based pruning strategy to enhance efficiency while maintaining performance.

Sources

Revisiting Funnel Transformers for Modern LLM Architectures with Comprehensive Ablations in Training and Inference Configurations

Beyond Progress Measures: Theoretical Insights into the Mechanism of Grokking

Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models

Task-Aware Parameter-Efficient Fine-Tuning of Large Pre-Trained Models at the Edge

Entropy-Based Block Pruning for Efficient Large Language Models

STEP: Staged Parameter-Efficient Pre-training for Large Language Models

Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability

Gating is Weighting: Understanding Gated Linear Attention through In-context Learning

Compression Laws for Large Language Models

Spatial-Geometry Enhanced 3D Dynamic Snake Convolutional Neural Network for Hyperspectral Image Classification

Saliency-driven Dynamic Token Pruning for Large Language Models

Dynamic Vision Mamba

Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression

Find A Winning Sign: Sign Is All We Need to Win the Lottery

Lattice: Learning to Efficiently Compress the Memory

DefMamba: Deformable Visual State Space Model

Encoder-Decoder Gemma: Improving the Quality-Efficiency Trade-Off via Adaptation

Mosaic: Composite Projection Pruning for Resource-efficient LLMs

The Method for Storing Patterns in Neural Networks-Memorization and Recall of QR code Patterns-

CAT: Circular-Convolutional Attention for Sub-Quadratic Transformers

Adaptive Computation Pruning for the Forgetting Transformer