TUM Logo

Using Segmentation to Detect Adversarial Examples

Using Segmentation to Detect Adversarial Examples

Supervisor(s): Ching-Yu Kao
Status: finished
Topic: Others
Author: Dang Khoa Nguyen
Submission: 2022-02-15
Type of Thesis: Bachelorthesis
Thesis topic in co-operation with the Fraunhofer Institute for Applied and Integrated Security AISEC, Garching

Description

Image classification is increasingly finding more use-cases in our everyday life where deep neural network models seem to be especially effective. However, it has been discovered that classification models based on deep learning are prone to maliciously altered images. These so-called adversarial examples (AEs) cause misclassifications by the target model by applying imperceptibly small perturbations to an input. Misclassifications are particularly concerning for safety- and security-critical applications, such as autonomous driving and facial recognition. Hence, it is important to find countermeasures to minimize the damage of such attacks. This thesis introduces a novel approach to detect AEs by utilizing image segmentation. In particular, we compute adversarial segmentation maps from given inputs, that highlight suspected adversarial perturbations. Thus, the resulting map can be used to conclude, whether the input was altered by an adversarial attack. The purpose of this thesis is not to surpass existing solutions, but rather to test the applicability of this novel approach. For this feasibility study, we employ a U-Net model and evaluate its performance on the MNIST and CIFAR-10 dataset. The AEs are generated using the adversarial attacks FGSM, BIM, PGD, JSMA and C&W. We perform various experiments to give a wide overview of our model’s abilities. Firstly, we test its detection performance with singleattack cases for each attack method, where the model is trained and evaluated on a single attack type. To represent real world attack scenarios, we test our model on datasets containing all selected attack methods, such that the model has no prior knowledge of the attack type. Our results showed that the U-Net model was able to successfully segment adversarial noises from images. All single-attack and multi-attack models achieved a mean detection accuracy of at least 90%. The next step is to create a more refined segmentation model to maximize the capabilities of our novel approach and to extend the test cases with further attack methods and more difficult images.