Advances in Interpretability and Security of Diffusion Models

The field of diffusion models is moving towards improved interpretability and security. Researchers are developing new methods to analyze and understand the internal workings of these models, such as the use of mechanistic interpretability techniques and novel visualization approaches. These advancements have the potential to increase trust in diffusion models and enable more effective steering of the generative process. Additionally, the community is acknowledging the importance of security in text-to-image diffusion models, with a focus on detecting and mitigating backdoor poisoning attacks. Noteworthy papers in this area include:

  • The introduction of Diffusion Steering Lens, a novel approach for interpreting vision transformers.
  • The proposal of REDEditing, a relationship-driven precise backdoor poisoning method for text-to-image diffusion models.
  • The development of Prompt-Agnostic Image-Free Auditing, a scalable and practical solution for pre-deployment concept auditing of diffusion models.
  • The application of Sparse Autoencoders to uncover human-interpretable concepts in diffusion models and demonstrate their causal effect on the model output.

Sources

Decoding Vision Transformers: the Diffusion Steering Lens

REDEditing: Relationship-Driven Precise Backdoor Poisoning on Text-to-Image Diffusion Models

What Lurks Within? Concept Auditing for Shared Diffusion Models at Scale

Emergence and Evolution of Interpretable Concepts in Diffusion Models

Built with on top of