Training Data Attribution for Large Language Models

Report on Current Developments in Training Data Attribution for Large Language Models

General Direction of the Field

The field of training data attribution (TDA) for large language models (LLMs) is rapidly evolving, with a strong focus on enhancing the interpretability, reliability, and scalability of attribution methods. Recent advancements are driven by the need to address critical issues such as data intellectual property protection, hallucination tracing, and the overall transparency of LLMs. The current research direction is characterized by the integration of novel techniques from both traditional machine learning and emerging fields like neurosymbolic AI, as well as the development of comprehensive toolkits and libraries to facilitate the benchmarking and deployment of TDA methods.

One of the key trends is the refinement of influence functions, which are foundational to many TDA methods. Researchers are now addressing the limitations of these functions, particularly the challenges posed by fitting errors during model training. This has led to the development of more robust and accurate attribution methods that can debias and denoise influence scores, thereby improving the overall sourcing accuracy.

Another significant trend is the adoption of neurosymbolic AI approaches, which aim to combine the strengths of neural networks with structured symbolic reasoning. This integration is seen as a way to enhance the reliability and interpretability of attribution methods, particularly in scenarios where traditional approaches struggle with issues like hallucinations and biases.

The field is also witnessing the creation of open-source libraries and toolkits that streamline the development and evaluation of TDA methods. These resources provide standardized APIs, modular utility functions, and comprehensive benchmark frameworks, making it easier for researchers and practitioners to compare and deploy different TDA techniques.

Noteworthy Innovations

Debias and Denoise Attribution (DDA): This method significantly enhances influence functions by addressing fitting errors, achieving an averaged AUC of 91.64% and demonstrating strong generality across different-scale models.
Neurosymbolic AI for Attribution: The integration of neurosymbolic AI offers a promising solution to the challenges of factual accuracy and reliability in LLM attribution, providing more interpretable and adaptable systems.
$\texttt{dattri}$ Library: This open-source library provides a unified API and modular utility functions, facilitating the development and benchmarking of TDA methods, particularly in large-scale neural network models.
HyperINF: Leveraging the Schulz's method, this approach offers efficient and accurate influence function approximation, with superior performance on LoRA-tuned models.
Quanda Toolkit: This interpretability toolkit facilitates systematic evaluation of TDA methods, providing a comprehensive set of metrics and a uniform interface for seamless integration with existing implementations.

These innovations collectively push the boundaries of training data attribution, making significant strides towards more transparent, reliable, and scalable LLMs.

Training Data Attribution for Large Language Models

Report on Current Developments in Training Data Attribution for Large Language Models

General Direction of the Field

Noteworthy Innovations

Sources