The field of large language models (LLMs) is rapidly advancing towards more efficient training and fine-tuning methods, addressing the significant computational and memory demands these models entail. A notable trend is the development of parameter-efficient fine-tuning (PEFT) techniques that aim to reduce resource usage without compromising performance. Innovations in this area include novel approaches to low-rank approximations and tensor decomposition, which significantly cut down on memory and computational requirements. These methods not only facilitate the training of large models on consumer-grade hardware but also maintain, and in some cases, enhance model performance across various tasks. Additionally, there's a growing focus on optimizing the architecture of LLMs, such as converting multi-head attention (MHA) models to grouped-query attention (GQA) models, to improve inference speed without substantial performance degradation.
Noteworthy Papers
- Gradient Weight-normalized Low-rank Projection for Efficient LLM Training: Introduces GradNormLoRP, a method that significantly reduces optimizer memory usage and enables efficient pre-training of large LLMs on consumer-level GPUs, outperforming existing low-rank methods in fine-tuning tasks.
- GaLore$+$: Boosting Low-Rank Adaptation for LLMs with Cross-Head Projection: Proposes GaLore$+$, which enhances fine-tuning speed and performance by reducing time consumption in low-rank projection estimations and employing randomized subspace iteration for fast SVD.
- Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA: Presents a low-cost method for pruning MHA models into GQA models, significantly compressing key-value heads without much performance degradation, compatible with rotary position embedding.
- DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models: Introduces DoTA and its quantized version QDoTA, leveraging MPO decomposition for effective initialization in fine-tuning LLMs, showing superior performance with fewer parameters and reduced memory consumption.