Interpretability and Explainability in AI Models

The field of artificial intelligence is moving towards developing more interpretable and explainable models. Recent research has focused on understanding how large language models and vision-language models process information and represent knowledge. A key direction is the analysis of token embeddings and representations in these models, with studies revealing insights into the structure and evolution of these representations across layers.

Another important area of research is the development of methods for identifying and editing knowledge in multimodal models, which is crucial for improving their transparency and control. The use of techniques such as representation engineering and sparse autoencoders has shown promise in enhancing the interpretability and steerability of these models.

Some studies have also explored the application of geometric and theoretical frameworks to understand the behavior of AI models, providing new tools for model diagnostics and analysis.

Noteworthy papers include: Bridging the Dimensional Chasm, which develops a geometric framework to track token dynamics across Transformer layers, revealing an expansion-contraction pattern and implying a negative correlation between working space dimension and parameter-sensitive performance. Why Representation Engineering Works, which extends representation engineering to Vision-Language Models and develops a theoretical framework explaining the stability of neural activity across layers.

Sources

Bridging the Dimensional Chasm: Uncover Layer-wise Dimensional Reduction in Transformers through Token Correlation

Why Representation Engineering Works: A Theoretical and Empirical Study in Vision-Language Models

Identifying Multi-modal Knowledge Neurons in Pretrained Transformers via Two-stage Filtering

Embedding Shift Dissection on CLIP: Effects of Augmentations on VLM's Representation Learning

From Colors to Classes: Emergence of Concepts in Vision Transformers

The Axiom-Based Atlas: A Structural Mapping of Theorems via Foundational Proof Vectors

Token embeddings violate the manifold hypothesis

Automated Feature Labeling with Token-Space Gradient Descent

Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

Built with on top of