Advancing Interpretability and Robustness in LLMs

The recent advancements in the research area of Large Language Models (LLMs) and their applications have shown significant progress in enhancing interpretability, robustness, and usability. A notable trend is the integration of LLMs into various domains to provide human-readable explanations and to assist in decision-making processes. This is evident in the development of tools like Explingo, which leverages LLMs to transform machine learning explanations into narrative formats, thereby improving their accessibility and usability. Another direction is the exploration of novel frameworks for enhancing the resistance of LLMs to adversarial prompts, as seen in the recursive framework proposed to simplify and detect malicious inputs. Additionally, there is a growing interest in the interpretability of model activations through the use of Sparse Autoencoders, with innovations like BatchTopK SAEs that adaptively allocate latents for improved reconstruction. The field is also witnessing advancements in visual analytics for specialized domains such as jurisprudence, where interactive visualization is being used to articulate tacit domain knowledge. Furthermore, the concept of using LLMs as visual explainers is emerging, offering a new approach to interpreting vision models through linguistic explanations. Overall, these developments highlight a shift towards more transparent, robust, and interactive applications of LLMs across diverse fields.

Sources

Explingo: Explaining AI Predictions using Large Language Models

Sparse autoencoders reveal selective remapping of visual concepts during adaptation

Enhancing Adversarial Resistance in LLMs with Recursion

BatchTopK Sparse Autoencoders

Challenges and Opportunities for Visual Analytics in Jurisprudence

Frame Representation Hypothesis: Multi-Token LLM Interpretability and Concept-Guided Text Generation

Ask Humans or AI? Exploring Their Roles in Visualization Troubleshooting

Can LLMs faithfully generate their layperson-understandable 'self'?: A Case Study in High-Stakes Domains

Language Model as Visual Explainer

Concept Bottleneck Large Language Models

Evil twins are not that evil: Qualitative insights into machine-generated prompts

Distinguishing Scams and Fraud with Ensemble Learning

LatentQA: Teaching LLMs to Decode Activations Into Natural Language

Obfuscated Activations Bypass LLM Latent-Space Defenses

Built with on top of