Enhancing Data Efficiency and Quality in LLM Instruction Tuning

The recent advancements in the field of large language models (LLMs) have primarily focused on enhancing data efficiency and quality control during the instruction tuning process. Researchers are increasingly exploring methods to prune and curate datasets, leveraging both synthetic and real data to optimize model performance. Key innovations include the use of random matrix theory for data pruning, diversity-aware score curation, and decomposed difficulty data selection frameworks. These approaches aim to address the inherent challenges of data redundancy, noise, and the mismatch between selected data and model learning tasks. Notably, federated learning is being integrated with data-efficient tuning strategies to reduce communication overhead and improve responsiveness to unseen tasks. Additionally, the balance between label quantity and quality in scalable elicitation methods is being rigorously examined, with a focus on reducing the reliance on costly human annotations. Overall, the field is moving towards more sophisticated data selection and quality assurance techniques that promise to significantly enhance the adaptability and performance of LLMs in specialized domains and general tasks alike.

Noteworthy papers include one that introduces a Diversity-aware Score curation method for Data Selection, demonstrating that a curated subset can outperform larger datasets, and another that proposes a two-stage model-centric data selection framework for optimized domain adaptation, showing superior accuracy in medical domain experiments.

Sources

Maximizing the Potential of Synthetic Data: Insights from Random Matrix Theory

Improving Data Efficiency via Curating LLM-Driven Rating Systems

3DS: Decomposed Difficulty Data Selection's Case Study on LLM Medical Domain Adaptation

Federated Data-Efficient Instruction Tuning for Large Language Models

Data Quality Control in Federated Instruction-tuning of Large Language Models

The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph

Communication-Efficient and Tensorized Federated Fine-Tuning of Large Language Models

A Little Human Data Goes A Long Way

Balancing Label Quantity and Quality for Scalable Elicitation

Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data

IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection

Built with on top of