Enhancing Transparency and Control in Language Models

The recent advancements in the field of language modeling and interpretability have seen a significant shift towards enhancing model transparency and control. Researchers are increasingly focusing on developing methods that not only improve the performance of language models but also make them more interpretable and robust. This trend is evident in the integration of causal learning to mitigate spurious correlations in text classification, the use of dictionary learning for transparent medical coding, and the development of induction-head ngram models for efficient language modeling. Additionally, there is a growing emphasis on steering vectors to control model behavior, with a particular focus on targeting sparse autoencoder features for improved coherence and effectiveness. Notably, the field is also witnessing a reevaluation of traditional metrics used in acceptability judgments, with new theories like MORCELA being proposed to better align model scores with human judgments. Furthermore, the importance of understanding the extent to which neural recordings contribute to stimulus reconstruction is being highlighted, with methods like BrainBits introduced to quantify this contribution. The unification of interpretability and control through intervention-based evaluation is another significant development, aiming to bridge the gap between understanding and steering model behavior. Lastly, there is a push towards more efficient explainability at the word level, with techniques like literal pruning in Tsetlin Machines being explored to enhance human comprehension without compromising model performance. Overall, these developments indicate a concerted effort to make language models not only more powerful but also more transparent, interpretable, and controllable.

Sources

Interpretable Language Modeling via Induction-head Ngram Models

Beyond Label Attention: Transparency in Language Models for Automated Medical Coding via Dictionary Learning

Towards Robust Text Classification: Mitigating Spurious Correlations with Causal Learning

Improving Steering Vectors by Targeting Sparse Autoencoder Features

What Goes Into a LM Acceptability Judgment? Rethinking the Impact of Frequency and Length

BrainBits: How Much of the Brain are Generative Reconstruction Methods Using?

Towards Unifying Interpretability and Control: Evaluation via Intervention

Pruning Literals for Highly Efficient Explainability at Word Level

DISCO: DISCovering Overfittings as Causal Rules for Text Classification Models

Built with on top of