Enhancing Dataset Quality and Diversity in Machine Learning

The recent developments in the field of machine learning and natural language processing have shown a significant shift towards enhancing the quality and diversity of training datasets. Researchers are increasingly focusing on creating large-scale, high-quality datasets with fine-grained information to improve the capabilities and reliability of large language models (LLMs). This trend is evident in the creation of datasets like ChineseWebText 2.0, which incorporates multi-dimensional and fine-grained information to better serve the training requirements of evolving LLMs. Additionally, there is a growing emphasis on understanding and mitigating biases in both visual and textual datasets, as highlighted by studies on ImageNet and pretraining datasets for LLMs. The field is also witnessing advancements in the development of tools and frameworks to analyze and characterize dataset biases, which is crucial for building more diverse and representative datasets. Furthermore, the integration of synthetic data and classifier ensembling techniques is being explored to optimize the trade-offs between data quantity and quality, as seen in the Nemotron-CC dataset. Overall, the current direction in the field is towards more sophisticated data curation and analysis methods to enhance the performance and fairness of machine learning models.

Sources

Perception of Visual Content: Differences Between Humans and Foundation Models

ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information

ICPR 2024 Competition on Multilingual Claim-Span Identification

Flaws of ImageNet, Computer Vision's Favourite Dataset

Understanding Bias in Large-Scale Visual Datasets

MediaSpin: Exploring Media Bias Through Fine-Grained Analysis of News Headlines

Class-wise Autoencoders Measure Classification Difficulty And Detect Label Mistakes

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

Measuring Bias of Web-filtered Text Datasets and Bias Propagation Through Training

RedStone: Curating General, Code, Math, and QA Data for Large Language Models

Built with on top of