Large Language Model Research

Current Developments in Large Language Model Research

The field of large language models (LLMs) is witnessing significant advancements aimed at enhancing efficiency, scalability, and performance. Recent developments are primarily focused on addressing the computational challenges associated with training and deploying LLMs, particularly in terms of memory usage, training time, and inference latency.

General Direction of the Field

  1. Efficient Training Techniques: There is a growing emphasis on developing methods to reduce the memory footprint and computational requirements of LLMs during training. Techniques such as activation offloading, modular decomposition, and hybrid parallelism are being explored to make training more feasible on resource-constrained devices.

  2. Model Compression and Pruning: Innovations in model compression and pruning are aimed at reducing the size of LLMs without compromising their performance. These methods include structured and unstructured pruning, distillation, and low-rank matrix techniques to achieve significant reductions in model parameters and computational costs.

  3. Long-Context Handling: Extending the context length that LLMs can effectively process is another major focus. Approaches like parallel decoding, information bottleneck-based compression, and state space models are being developed to enable LLMs to handle longer sequences more efficiently.

  4. Dynamic Activation and Sparsity: Research is exploring dynamic activation techniques and sparsity to improve inference efficiency. These methods aim to exploit the inherent sparsity in LLMs and dynamically adjust activations based on sequence information, thereby accelerating generation speed.

  5. Edge AI and Collaborative Frameworks: With the rise of edge computing, there is a growing interest in developing frameworks that allow for efficient fine-tuning and deployment of LLMs on edge devices. Collaborative edge AI frameworks are being designed to leverage distributed resources and optimize resource utilization.

Noteworthy Papers

  1. TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading: This paper introduces a novel approach to offload activations to high-capacity NVMe SSDs, significantly reducing GPU memory usage and improving training efficiency.

  2. Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models: The MOHAWK method presented in this paper allows for the distillation of Transformer architectures into more efficient subquadratic models, highlighting a new avenue for leveraging computational resources.

  3. MoDeGPT: Modular Decomposition for Large Language Model Compression: MoDeGPT offers a novel structured compression framework that does not require recovery fine-tuning, achieving significant compute cost savings and maintaining high performance.

  4. Pluto and Charon: A Time and Memory Efficient Collaborative Edge AI Framework for Personal LLMs Fine-Tuning: This framework introduces innovative techniques to break the resource wall of personal LLMs fine-tuning on edge devices, achieving remarkable speedup and memory reduction.

  5. LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models: LLM-Barber presents a novel one-shot pruning framework that rebuilds the sparsity mask without retraining, achieving state-of-the-art results in perplexity and zero-shot performance.

These developments underscore the dynamic and innovative nature of the field, with researchers continuously pushing the boundaries of what is possible in terms of efficiency, scalability, and performance of large language models.

Sources

TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading

Attention is a smoothed cubic spline

Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models

MoDeGPT: Modular Decomposition for Large Language Model Compression

Pluto and Charon: A Time and Memory Efficient Collaborative Edge AI Framework for Personal LLMs Fine-Tuning

LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models

QUITO-X: An Information Bottleneck-based Compression Algorithm with Cross-Attention

Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism

Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches

LLM Pruning and Distillation in Practice: The Minitron Approach

FocusLLM: Scaling LLM's Context by Parallel Decoding

First Activations Matter: Training-Free Methods for Dynamic Activation in Large Language Models

Practical token pruning for foundation models in few-shot conversational virtual assistant systems

Macformer: Transformer with Random Maclaurin Feature Attention

Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

Pruning By Explaining Revisited: Optimizing Attribution Methods to Prune CNNs and Transformers

Jamba-1.5: Hybrid Transformer-Mamba Models at Scale

ssProp: Energy-Efficient Training for Convolutional Neural Networks with Scheduled Sparse Back Propagation

Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time

MPruner: Optimizing Neural Network Size with CKA-Based Mutual Information Pruning