Advancing Model Robustness and Explainability in Deep Learning

The recent research in the field of deep learning robustness and explainability has seen significant advancements, particularly in the evaluation and enhancement of model stability and explainability under adversarial conditions. A notable trend is the development of new metrics and frameworks aimed at better understanding and quantifying the robustness of deep learning models, especially in high-stakes applications. These efforts include the introduction of complementary metrics to traditional robust accuracy, meta-evaluation of stability measures, and the unification of attribution-based explanation methods through functional decomposition. Additionally, there is a growing focus on the impact of adversarial attacks on model explainability, with studies revealing the limitations of current explanation metrics in detecting adversarial perturbations. The field is also addressing the limitations of popular explainability tools like SHAP scores, demonstrating their failure even in models that respect Lipschitz continuity. Lastly, there is progress in constructing more sensible baselines for integrated gradients, enhancing the interpretability of machine learning models in scientific applications. These developments collectively push the boundaries of model robustness and explainability, ensuring safer and more trustworthy deployment of deep learning technologies.

Noteworthy papers include one that introduces robust ratio as a complementary metric to robust accuracy, highlighting its potential in quantifying model robustness under varying perturbation levels. Another notable contribution is the meta-evaluation of stability measures, which reveals the unreliability of existing metrics in identifying erroneous explanations. Lastly, the study on the impact of adversarial attacks on model explainability underscores the need for more sensitive explanation metrics.

Sources

Is it the model or the metric -- On robustness measures of deeplearning models

Meta-evaluating stability measures: MAX-Senstivity & AVG-Sensitivity

Impact of Adversarial Attacks on Deep Learning Model Explainability

Unifying Attribution-Based Explanations Using Functional Decomposition

SHAP scores fail pervasively even when Lipschitz succeeds

Constructing sensible baselines for Integrated Gradients

Built with on top of