Bridging security and interpretability in AI : a SHAP-centric framework for adversarial attack detection

Master Thesis
Author
Πετυχάκης, Γεώργιος
Petihakis, Georgios
Date
2025-05Advisor
Ξενάκης, ΧρήστοςXenakis, Christos
View/ Open
Keywords
Adversarial AI ; Explainable AI ; AI ; SHAP ; Adversarial attack detectionAbstract
Adversarial machine learning has emerged as a critical area of research due to the growing vulnerability of machine learning models to adversarial attacks. Even small perturbations to input data can cause models to produce incorrect or potentially harmful predictions.. This issue becomes particularly alarming in sensitive domains such as healthcare, where decisions informed by AI systems can significantly impact patient outcomes. This thesis addresses the detection of adversarial examples in the context of breast cancer diagnosis, combining conventional supervised learning methods with explainability-driven anomaly detection techniques.
The study begins with the development of a baseline model trained on the Breast Cancer Wisconsin (Diagnostic) Dataset (BCWD), achieving an accuracy of 94.74% on clean test samples. Upon exposure to adversarial perturbations generated using the Fast Gradient Sign Method (FGSM), the model’s performance deteriorated sharply dropping to an accuracy of 50.54%, revealing a significant weakness in adversarial robustness.
To mitigate this vulnerability, we investigate two main adversarial detection strategies: (1) supervised classifiers trained on raw features, SHAP (SHapley Additive exPlanations) values, and combined inputs and (2) unsupervised anomaly detection based on SHAP distance metrics. Among supervised models, Random Forest and neural networks achieved the highest performance, particularly when trained on a fusion of raw and SHAP features, attaining over 95% accuracy and high sensitivity. In parallel, SHAP-based anomaly detection using cosine and correlation distance metrics proved effective, offering a scalable and interpretable detection solution.
A detailed SHAP interpretability analysis was also conducted to understand how adversarial perturbations manipulate feature importance. By comparing SHAP value distributions between benign and adversarial samples, we identified systematic shifts in attribution where adversarial inputs cause the model to emphasize misleading features while downplaying diagnostically relevant ones. This analysis not only aids in detection but also deepens our understanding of adversarial behavior from a model reasoning perspective.
In conclusion, this thesis proposes a hybrid detection framework that synergizes the strengths of supervised learning and explainable anomaly detection. The findings underscore the value of integrating XAI tools like SHAP to enhance robustness and foster trust in AI-driven decision-making. The proposed methodologies offer a promising direction for deploying secure, interpretable, and resilient machine learning models in critical real-world applications, particularly in healthcare.