Adversarial Image Detection and Defense
Feigned media has fooled the public for a long time before the age of computers, such as the notorious examples of the "Loch Ness Monster" and "Bigfoot" sightings. In recent times, widespread public adoption of the personal computer, media-editing software, and the internet has allowed fake media to become easy to produce and distribute. As media-processing software continues to advance, fake images, audio, and video will become more and more believable, such as the famous edited video of Obama giving a fake speech.
Recent technological innovations have led to a new form of fake media - the “adversarial example”. The main difference between an adversarial example and previous forgery methods is that the adversarial example is designed to fool a computer instead of a human. This has far-reaching effects on machine learning-based systems ranging from content filters to self-driving cars. It is even possible for adversarial examples in the real world to fool physical models. With the increasingly global reach of the media and real-world implementation of machine learning systems, there will be growing incentives to create fake content. We need a way to protect our society from the negative effects of malicious adversarial examples.
Adversarial examples can be generated in several different ways, but they usually are based on a common principle: maximize the cost function of a model with respect to the input while minimizing how much the input changes. For an image, this means changing the pixel values just enough to cause the model to misclassify the image without changing the appearance of the image to a human. Since the generation of an adversarial example requires access to a model’s cost function, the attacker needs to have a model. However, almost any model can be used because adversarial examples tend to generalize well to fooling other models. This “black-box attack” is when an adversarial example is fed to a model other than the one that was used to generate it.
There are two main strategies to combat adversarial examples:
- Adversarial detection: classify an input image as benign or adversarial.
- Adversarial defense: classify an input image as its true class.
In the example of the adversarial image of the ostrich from Figure 2, an adversarial detection algorithm should classify the image as an adversarial whereas an adversarial defense algorithm should classify the modified image as a ostrich.
Because of the controversies surrounding fake media during the 2016 United States presidential campaign, we decided to implement adversarial detection and adversarial defense in a political context. We used a custom dataset by combining the "PIM" and "HARRISON" datasets, consisting of political and nonpolitical images. Two models were trained on 22,500 distinct images and tested on 5,000 images, resulting in accuracy scores around 85-90%. Next, we generated adversarial examples of some of the images, using the "Cleverhans" library. We focused on two main types of adversarial attacks: "FGSM" and "Momentum iterative FGSM".
Figure 4 shows how a model can be fooled to incorrectly choose the wrong class for an image without changing the human-perceptibility of the image. We used the political image dataset and adversarial examples to test out different detection and defense techniques to counteract these adversarial examples.
We focused our testing on black-box attacks. These attacks are arguably more common in the real world, since it is rare that an attacker will have access to the underlying architecture of the model they are trying to fool. The scores for correctly classifying the images as political or nonpolitical are below:
- Benign images: 88.5%
- FGSM images: 54.5%
- Momentum iterative FGSM images: 36.0%
The momentum iterative FGSM images are stronger adversarial examples, as they fool the model into misclassifying the image at a higher rate than FGSM images.
The detection algorithm we implemented combined two cutting-edge research papers:
- "Detecting Adversarial Image Examples in Deep Networks with Adaptive Noise Reduction"
- "Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks".
Both of these papers are based on the idea that it is possible to remove some of the adversarial perturbations of an input image through image processing techniques. Furthermore, these image processing techniques will not have an effect on benign images. Thus, a processed adversarial image will be much more different from its pre-processed version than a benign processed image from its pre-processed version. Specifically, the differences between the probability outputs of the softmax layer between a processed and original image are used. If this difference exceeds a certain threshold, then we classify the original image as adversarial.
We modified and optimized the algorithm from Figure 5 to include four image processing techniques, instead of two. We also chose to use the mean of the difference scores, instead of the maximum. The four image processing techniques we used are:
- Median smoothing filter: A sliding window moves along each pixel of an image and replaces the center pixel is with the median value of the neighboring pixels in a 2x2 area.
- Bit-depth reduction: The color depth of each pixel in the image is reduced from an original 8-bit pixel down to 6-bits.
- Image cropping: Removes 15% of the peripheral area of an image.
- Scalar quantization: A lossy compression technique to map a range of pixel intensities to a single representing one.
The algorithm takes an input image and creates four new processed version of the image. The difference between the softmax outputs for each processed image and the original image is stored. The mean of these differences is then calculated and if it exceeds a certain threshold, the image is classified as adversarial. The results are shown below in Figure 8.
We deployed this algorithm as a Heroku application. To use it, upload any image to the front-end system built with Flask. The image will be processed in the back-end and the mean of differences between the processed images and original image will be calculated. If this score exceeds a certain threshold, the algorithm will return that the image is adversarial.
We next sought to create an adversarial defense system. The main paper we used for this was "Deflecting Adversarial Attacks with Pixel Deflection". The central idea of this technique is to apply pixel deflection and denoising to an image before it is classified. This should essentially reverse any adversarial perturbations that were added and allow the image to be classified as its true class, without affecting benign images.
Before the model predicts the class for an image, pixel deflection and denoising are applied:
- Pixel deflection: Redistribute 2,000 of the pixels in the image.
- Denoiser: Total variance minimization.
We found that pixel deflection did not disrupt the classification of benign images, while allowing us to correctly classify adversarial images much better than without using pixel deflection. The results for classifying FGSM adversarial examples as political or nonpolitical with and without and with pixel deflection is shown in Figure 10. The results for the momentum iterative FGSM adversarial examples is shown in Figure 11.
Adversarial example generation, detection, and defense is an active field of research. This area is changing rapidly and exciting new advancements in the field are made all of the time. Many intriguing papers that have been published recently. These include:
- "Detecting Adversarial Examples via Neural Fingerprinting"
- "Characterizing Adversarial Subspaces Using Local Intrinsic Dimensionality"
- " Detecting Adversarial Samples from Artifacts"
- "Countering Adversarial Images Using Input Transformations"
- "Towards Evaluating the Robustness of Neural Networks"
Adversarial Detection Code
Most of the code for adversarial detection was adopted from these scripts:
Adversarial Defense Code
Most of the code for adversarial defense was adopted from these scripts: