Adversarial Image Detection and Defense

Feigned media has fooled the public for a long time before the age of computers, such as the notorious examples of the "Loch Ness Monster" and "Bigfoot" sightings.  In recent times, widespread public adoption of the personal computer, media-editing software, and the internet has allowed fake media to become easy to produce and distribute. As media-processing software continues to advance, fake images, audio, and video will become more and more believable, such as the famous edited video of Obama giving a fake speech.

Recent technological innovations have led to a new form of fake media - the “adversarial example”. The main difference between an adversarial example and previous forgery methods is that the adversarial example is designed to fool a computer instead of a human. This has far-reaching effects on machine learning-based systems ranging from content filters to self-driving cars. It is even possible for adversarial examples in the real world to fool physical models. With the increasingly global reach of the media and real-world implementation of machine learning systems, there will be growing incentives to create fake content. We need a way to protect our society from the negative effects of malicious adversarial examples.

 Figure 1. From  "Explaining and Harnessing Adversarial Examples" . A small change that is essentially imperceptible to a human fools an advanced image detection model to incorrectly classify an image of a panda as a gibbon.

Figure 1. From "Explaining and Harnessing Adversarial Examples". A small change that is essentially imperceptible to a human fools an advanced image detection model to incorrectly classify an image of a panda as a gibbon.

Adversarial examples can be generated in several different ways, but they usually are based on a common principle: maximize the cost function of a model with respect to the input while minimizing how much the input changes. For an image, this means changing the pixel values just enough to cause the model to misclassify the image without changing the appearance of the image to a human. Since the generation of an adversarial example requires access to a model’s cost function, the attacker needs to have a model. However, almost any model can be used because adversarial examples tend to generalize well to fooling other models. This “black-box attack” is when an adversarial example is fed to a model other than the one that was used to generate it.

 Figure 2. From  "A CLEVER Way to Resist Adversarial Attack".  Representation of how an image can be modulated just enough to cross a model’s decision boundary to be misclassified.

Figure 2. From "A CLEVER Way to Resist Adversarial Attack". Representation of how an image can be modulated just enough to cross a model’s decision boundary to be misclassified.

There are two main strategies to combat adversarial examples:

  • Adversarial detection: classify an input image as benign or adversarial.

  • Adversarial defense: classify an input image as its true class.

In the example of the adversarial image of the ostrich from Figure 2, an adversarial detection algorithm should classify the image as an adversarial whereas an adversarial defense algorithm should classify the modified image as a ostrich.

 Figure 3. Examples of adversarial detection and adversarial defense methods. A) Detection algorithm from  "Detecting Adversarial Image Examples in Deep Networks with Adaptive Noise Reduction" . An input image returns a classification of “adversarial” or “benign”. B) Defense algorithm - an input image returns its true class regardless of whether the image has adversarial perturbations or not.

Figure 3. Examples of adversarial detection and adversarial defense methods. A) Detection algorithm from "Detecting Adversarial Image Examples in Deep Networks with Adaptive Noise Reduction". An input image returns a classification of “adversarial” or “benign”. B) Defense algorithm - an input image returns its true class regardless of whether the image has adversarial perturbations or not.

Because of the controversies surrounding fake media during the 2016 United States presidential campaign, we decided to implement adversarial detection and adversarial defense in a political context. We used a custom dataset by combining the "PIM" and "HARRISON" datasets, consisting of political and nonpolitical images. Two models were trained on 22,500 distinct images and tested on 5,000 images, resulting in accuracy scores around 85-90%. Next, we generated adversarial examples of some of the images, using the "Cleverhans" library. We focused on two main types of adversarial attacks: "FGSM" and "Momentum iterative FGSM".

 Figure 4. A) Benign political image correctly classified as political with high probability. B) FGSM-attacked political image incorrectly classified as non-political with high probability. The two images are essentially indistinguishable to the human eye.

Figure 4. A) Benign political image correctly classified as political with high probability. B) FGSM-attacked political image incorrectly classified as non-political with high probability. The two images are essentially indistinguishable to the human eye.

Figure 4 shows how a model can be fooled to incorrectly choose the wrong class for an image without changing the human-perceptibility of the image. We used the political image dataset and adversarial examples to test out different detection and defense techniques to counteract these adversarial examples.

We focused our testing on black-box attacks. These attacks are arguably more common in the real world, since it is rare that an attacker will have access to the underlying architecture of the model they are trying to fool. The scores for correctly classifying the images as political or nonpolitical are below:

  • Benign images: 88.5%

  • FGSM images: 54.5%

  • Momentum iterative FGSM images: 36.0%

The momentum iterative FGSM images are stronger adversarial examples, as they fool the model into misclassifying the image at a higher rate than FGSM images.

Adversarial Detection

The detection algorithm we implemented combined two cutting-edge research papers:

Both of these papers are based on the idea that it is possible to remove some of the adversarial perturbations of an input image through image processing techniques. Furthermore, these image processing techniques will not have an effect on benign images. Thus, a processed adversarial image will be much more different from its pre-processed version than a benign processed image from its pre-processed version. Specifically, the differences between the probability outputs of the softmax layer between a processed and original image are used. If this difference exceeds a certain threshold, then we classify the original image as adversarial.

 Figure 5. A) The formula for calculating the distance between an image and its processed version. x is the original image, xsqueezed is a processed version of the original image, and g(x) is the probability output. B) Different processing techniques are applied to a given image. If the differences between the processed and original images are stored as d1 and d2. If the maximum of these scores exceeds a certain threshold, then the original image is classified as adversarial.

Figure 5. A) The formula for calculating the distance between an image and its processed version. x is the original image, xsqueezed is a processed version of the original image, and g(x) is the probability output. B) Different processing techniques are applied to a given image. If the differences between the processed and original images are stored as d1 and d2. If the maximum of these scores exceeds a certain threshold, then the original image is classified as adversarial.

We modified and optimized the algorithm from Figure 5 to include four image processing techniques, instead of two. We also chose to use the mean of the difference scores, instead of the maximum. The four image processing techniques we used are:

  • Median smoothing filter: A sliding window moves along each pixel of an image and replaces the center pixel is with the median value of the neighboring pixels in a 2x2 area.

  • Bit-depth reduction: The color depth of each pixel in the image is reduced from an original 8-bit pixel down to 6-bits.

  • Image cropping: Removes 15% of the peripheral area of an image.

  • Scalar quantization: A lossy compression technique to map a range of pixel intensities to a single representing one.
 Figure 6. Original benign political image and its four image-processed versions.

Figure 6. Original benign political image and its four image-processed versions.

The algorithm takes an input image and creates four new processed version of the image. The difference between the softmax outputs for each processed image and the original image is stored. The mean of these differences is then calculated and if it exceeds a certain threshold, the image is classified as adversarial. The results are shown below in Figure 8.  

 Figure 7. Classifying whether an image is benign or adversarial. ROC curve and AUC score for detecting black-box A) FGSM adversarial examples and B) Momentum iterative FGSM adversarial examples.

Figure 7. Classifying whether an image is benign or adversarial. ROC curve and AUC score for detecting black-box A) FGSM adversarial examples and B) Momentum iterative FGSM adversarial examples.

We deployed this algorithm as a Heroku application. To use it, upload any image to the front-end system built with Flask. The image will be processed in the back-end and the mean of differences between the processed images and original image will be calculated. If this score exceeds a certain threshold, the algorithm will return that the image is adversarial.

 Figure 8. Example use of heroku application for classifying a A) benign image and B) FGSM adversarial image.

Figure 8. Example use of heroku application for classifying a A) benign image and B) FGSM adversarial image.

Adversarial Defense

We next sought to create an adversarial defense system. The main paper we used for this was "Deflecting Adversarial Attacks with Pixel Deflection". The central idea of this technique is to apply pixel deflection and denoising to an image before it is classified. This should essentially reverse any adversarial perturbations that were added and allow the image to be classified as its true class, without affecting benign images.

Before the model predicts the class for an image, pixel deflection and denoising are applied:

  • Pixel deflection: Redistribute 2,000 of the pixels in the image.
  • Denoiser: Total variance minimization.

 Figure 9. Original benign political image with its pixel deflected and denoised counterparts.

Figure 9. Original benign political image with its pixel deflected and denoised counterparts.

We found that pixel deflection did not disrupt the classification of benign images, while allowing us to correctly classify adversarial images much better than without using pixel deflection. The results for classifying FGSM adversarial examples as political or nonpolitical with and without and with pixel deflection is shown in Figure 10. The results for the momentum iterative FGSM adversarial examples is shown in Figure 11.

 Figure 11. Classifying whether an image is political or non-political. ROC curve and AUC score for classifying black-box A) FGSM adversarial examples without pixel deflection and B) FGSM adversarial examples with pixel deflection.

Figure 11. Classifying whether an image is political or non-political. ROC curve and AUC score for classifying black-box A) FGSM adversarial examples without pixel deflection and B) FGSM adversarial examples with pixel deflection.

  Figure 12. Classifying whether an image is political or non-political. ROC curve and AUC score for classifying black-box A) momentum iterative FGSM adversarial examples without pixel deflection and B) momentum iterative FGSM adversarial examples with pixel deflection.


Figure 12. Classifying whether an image is political or non-political. ROC curve and AUC score for classifying black-box A) momentum iterative FGSM adversarial examples without pixel deflection and B) momentum iterative FGSM adversarial examples with pixel deflection.

Adversarial example generation, detection, and defense is an active field of research. This area is changing rapidly and exciting new advancements in the field are made all of the time. Many intriguing papers that have been published recently. These include:

Adversarial Detection Code

Most of the code for adversarial detection was adopted from these scripts:

Adversarial Defense Code

Most of the code for adversarial defense was adopted from these scripts:

 

 

 

Andrew Eaton