This article was originally published by AI Frontier.
Spoofing neural networks: Create your own counter sample


Author | Daniel Geng and Rishi Veerapaneni


Translator | Sun Hao


Edit | Emily

AI frontline introduction: Assassinations through neural networks. Sound crazy? Maybe one day it will happen in a way you never thought possible. Of course, neural networks could train drones or operate other weapons of mass destruction, and even if trained to drive a car through harmless (as it stands) networks, it might go against its owner’s intentions. This is because neural networks are susceptible to “adversarial samples”.


The “adversarial sample” is the input to the neural network, which causes the network to produce incorrect output. Here’s an example to better illustrate the situation. We can start with the picture of the panda on the left, which some networks believe is “panda” with 57.7% confidence. The panda category was also the most reliable of all, leading the network to conclude that the object in the image was a panda. But by adding a small amount of carefully constructed noise, we can get an image that looks exactly the same to humans as before, but which the network considers to be “gibbon” with 99.3 percent confidence. It’s a crazy thing!

From Goodfellow’s “Understanding and Controlling The Antagonistic Sample”

So, how do you do an assassination against a sample? Imagine replacing a stop sign with an adversarial sample, a signal that humans recognize immediately, but that neural networks might not even notice. Now imagine placing a rival stop sign at a busy intersection. When a self-driving car approaches an intersection, the onboard neural network will fail to see a stop sign and continue into oncoming traffic, bringing its passenger (in theory) to the brink of death.

This may be just one puzzling and slightly sensational example, but it illustrates how people can use adversarial samples to harm others, and there are many more examples. The iPhone X’s “Face Recognition” unlocking feature, for example, relies on neural networks to recognize faces and is therefore vulnerable to counter attacks. Security features can be built around face recognition by building adversarial images. Other biometric security systems will also be at risk, with illegal or erroneous content using adversarial samples to bypass neural network-based content filters. The presence of these adversarial samples means that systems incorporating deep learning models actually have high security risks.


You can understand the adversarial samples by thinking of them as visual illusions of a neural network. Just as visual illusions can trick the human brain, adversative samples can trick neural networks.

The panda confrontation sample mentioned above is a targeted example. Adding a small amount of carefully constructed noise to the image caused the neural network to misclassify the image, even though it looked the same to a human. There are also examples that are not targeted, and they try to find any input that can fool the neural network. This input may seem like white noise to humans, but since we’re not constrained to look for input that looks similar to humans, the problem is much simpler.

It’s a little disconcerting that we can find adversarial samples for any neural network, even the most advanced models that have been described as having “superhuman” capabilities. In fact, it’s easy to create adversarial samples, and we’ll show you how to do it in this article. All the code and dependencies you need can be found in this GitHub repo.

Virus mimicry, perfect demonstration of effectiveness against samples

The code for this section of the antagonism sample on MNIST can be found in the GitHub repo below (the download code is not required to understand this article):GitHub repo

We will attempt to train a feedforward neural network on a MNIST dataset. MNIST is a 28 by 28 pixel data set of handwritten digital images. They look like this:

Six MNIST images side by side

Before we start, we should first import the libraries we need.

import network.network as network
import network.mnist_loader as mnist_loader
import pickle
import matplotlib.pyplot as plt
import numpy as np
Copy the code

There were 50,000 training images and 10,000 test images. Let’s start by loading the pre-trained neural network (which we shamelessly took from this great article on neural networks):

with open('trained_network.pkl'.'rb') as f:  
    net = pickle.load(f)  

training_data, validation_data, test_data = mnist_loader.load_data_wrapper()
Copy the code

For those unfamiliar with pickle, this is Python’s way of serializing data (such as writing to disk), essentially saving classes and objects. Use pickle.load() to open the saved network layer version.

About this pre-trained neural network. It has 784 input neurons (each corresponding to 28*28=784 pixels), a layer with 30 hidden neurons and 10 output neurons (one for each number). All of its active states are s-shaped; Its output is a single heat vector representing network prediction, which is trained by minimizing the loss of mean square error.

To prove that the neural network is trained we can write a simple test function:

def predict(n):
    # Get the data from the test set
    x = test_data[n][0]

    # Get output of network and prediction
    activations = net.feedforward(x)
    prediction = np.argmax(activations)

    # Print the prediction of the network
    print('Network output: ')
    print(activations)

    print('Network prediction: ')
    print(prediction)

    print('Actual image: ')

    # Draw the imagePLT. Imshow (x.r eshape (28 (p)), cmap ='Greys')
Copy the code

The method selects a sample from the test set, displays it, and then runs it in a neural network using the net.feedForward (x) method. Here is the output of some images:







On the left is MNIST. On the right are the three outputs of the neural network, called active states. The bigger the output, the more the neural network thinks the image is that number.

Now we have a well-trained network, but how do we fool it? We’ll start with a simple, untargeted approach, and once we get the desired goal, we can use a cool trick to modify this approach to a targeted approach.

The idea of an untargeted attack is to generate images so that the neural network produces a specific output. For example, our target tag/output is:

In other words, we want to get a picture, and the output of the neural network is the vector up here. In other words, find an image that the neural network thinks looks like a 5(mind you, we’re indexed to zero). It turns out that we can treat this as an optimization problem, just as we train the network. We call the image we want to generate (a 784-dimensional vector, we’ve flattened the 28 by 28 pixel image for simplicity of calculation). We will define a cost function:

Is the square of L2 normal form. That’s the target tag that we got from above. The output of our image through the neural network is. We can see that if the output of our image through the neural network is very close to our target label, the corresponding cost will be very low. If the output of the network falls far short of our goals, the generation value can be high. Therefore, finding a vector that minimizes the substitution value C, the image predicted by the neural network is our target label. Our problem now is to find this vector. Note that this problem is very similar to the way we train neural networks, where we define a cost function and then select weights and deviations (aka parameters) to minimize the cost function. In the case of generating adversarial samples, we do not minimize costs by using weights and biases, but instead leave the weights and biases unchanged (essentially keeping the entire network constant) and choose an input to minimize costs.

To do this, we will use the same method we use to train a neural network. In other words, we will use gradient descent! We can use back propagation to find the derivative of the input cost function,, and then use gradient descent update to find the best one that minimizes the cost.

Back propagation is usually used to calculate gradients of weights and cost biases, but in general back propagation is just an algorithm that can effectively compute gradients on a computational graph (this is a neural network). Therefore, it can also be used to calculate the gradient of the cost function of neural network input.

Now let’s look at the code that generates the adversarial sample:

def adversarial(net, n, steps, eta):
""" net : network object neural network instance to use n : integer our goal label (just an int, the function transforms it into a one-hot vector) steps : integer number of steps for gradient descent eta : integer step size for gradient descent """
# Set the goal output
goal = np.zeros((10, 1))
goal[n] = 1

# Create a random image to initialize gradient descent with
x = np.random.normal(.5, .3, (784, 1))

# Gradient descent on the input
for i in range(steps):
    # Calculate the derivative
    d = input_derivative(net,x,goal)

    # The GD update on x
    x -= eta * d

return x
Copy the code


First we create, called goal in the code. Next, we initialize it as a random 784-dimensional vector. So with this vector, we can start using gradient descent, which is really only two lines of code. The first line, D = input_derivative(NET, X, Goal) uses backpropagation calculations. (The entire code in the notes is written for those who need it, but we won’t describe it here because it’s really just a bunch of math. If you want a better description of what Input_Derivative does, check out this website (where we got the neural network implementation, by the way). In the second and last row of the gradient descent cycle, x- = eta * d is an update to GD. We move in the direction of gradient and eta of step size.

Below are the untargeted adversarial samples for each class and the prediction of the neural network:

Untargeted “O”



Untargeted “3”



Untargeted “5”



On the left is an untargeted adversarial test (a 28 X 28 pixel image). When the graph is given, the live state of the network is drawn on the right.

Incredibly, the neural network has a high degree of confidence that some images are certain numbers. “3” and “5” are good examples. For most other numbers, the neural network showed very low activity, suggesting that it was a bit confused. The results look good!

At this point, something might be bothering you. If we want to make an adversarial sample that corresponds to “5” then we want to find one that, when we feed it into the neural network, the output will be as close to the one-dimensional vector that represents “5” as possible. However, why does gradient descent not find the picture of “5”? After all, the neural network would almost certainly think that the image of “5” is actually “5” (because it is). One possible theory for why this happens is:

The gap between all available 28×28 pixel images is huge. There’s a different 28-by-28-pixel black and white image. For comparison, our common estimate of the number of atoms in the observable universe is. If every atom in the universe contains another universe then we would have an atom. If each atom contains another atom that also contains the universe, and so contains each other about 23 times, then we can almost have one atom. So basically, the number of images that can be analyzed is staggering.

Of all these photos, only a tiny fraction can be recognized as numbers by the human eye. Faced with many images, the neural network can recognize most of the images as numbers (partly because our neural network has never been trained on images that don’t look like numbers, so if an image doesn’t look like numbers, the network’s output will be almost random). So when we start looking for an image that looks like a number to a neural network, we’re much more likely to find an image that looks like noise or static than we are just using probability to find an image that looks like a number to a human.

Targeted attack

These adversarial samples are cool, but to humans, they look like noise. Wouldn’t it be cooler if we could have a counter sample that looked like something? Maybe a “2” image neural network thinks it’s a “5”? Turns out it’s possible! We only made very minor changes to the original code. We added a term to the cost function that we minimized. The new cost function is as follows:

This is what we expect the adversarial sample to look like (a 784-dimensional vector, the same dimension we entered). So what we want to do now is minimize both of these terms. We’ve already seen the left-hand term. Once given, minimizing this term will cause the neural network to output. Minimizing the second term will try to force our adversarial image to be as close as possible (because the two vectors get smaller the closer they get to the standard value), which is exactly what we want! In addition, the preceding λ is a hyperparameter that determines which term is more important. As with most superparameters, we found after trial and error that.05 is a very good value for λ.

If you know ridge regression you’re probably familiar with the cost function above. In fact, we can interpret the cost function above as placing a priori example for our adversarial sample in our model.

If you are not familiar with Regularization Regularization, you can find out more about Regularization from a search engine.

The code that implements the minimization of the new cost function is almost identical to the original code (we call this function sneaky_adversarial() because our use of targeted attacks is sneaky. Naming is always the hardest part of programming.

def sneaky_adversarial(net, n, x_target, steps, eta, lam=.05):
    """ net : network object neural network instance to use n : integer our goal label (just an int, the function transforms it into a one-hot vector) x_target : numpy vector our goal image for the adversarial example steps : integer number of steps for gradient descent eta : integer step size for gradient descent lam : float lambda, our regularization parameter. Default is .05 """

    # Set the goal output
    goal = np.zeros((10, 1))
    goal[n] = 1

    # Create a random image to initialize gradient descent with
    x = np.random.normal(.5, .3, (784, 1))

    # Gradient descent on the input
    for i in range(steps):
        # Calculate the derivative
        d = input_derivative(net,x,goal)

        # The GD update on x, with an added penalty 
        # to the cost function
        # ONLY CHANGE IS RIGHT HERE!!!
        x -= eta * (d + lam * (x - x_target))

    return x
Copy the code

The only change we made was the gradient descent update:

X -= eta (d + LAM (x-x_target)).Copy the code

The eta term is the new term of our cost function. Take a look at the results of the new iteration of the method:

Targeted “7” [x_target = 3]



Targeted “9” [x_target = 5)



Targeted [8] x_target = “2”



On the left is a targeted counter sample (a 28 X 28 pixel image). When the graph is given, the live state of the network is drawn on the right.

It is important to note that, as with non-targeted attacks, there are two types of behavior. Either the neural network is completely fooled and the activity of the desired number is very high (e.g., the “targeted 5” image), or the network is confused and all activity is low (e.g., the “targeted 7” image). Interestingly, there are now more images in the former category that completely fool the neural network rather than just confuse it. It seems that those adversarial samples normalized as “numeric classes” are more likely to converge better during gradient descent.

Prevent an opponent’s attack

It’s incredible! We’ve just created some images to trick the neural network. Our next question is whether we can prevent these types of attacks. If you look closely at the raw image and the adversarial sample, you can see that the adversarial sample has some gray background.

Target image



The original image



An adversarial sample with noise in the background.

We can start with a simpler approach that uses the binary threshold to completely whiten the background color:

def binary_thresholding(n, m):
"""
n: int 0-9, the target number to match
m: index of example image to use (from the test set)
"""

# Generate adversarial example
x = sneaky_generate(n, m)

# Binarize image
x = (x > .5).astype(float)

print("With binary thresholding: ") PLT. Imshow (x.r eshape (28, 28), cmap ="Greys")
plt.show()

# Get binarized output and prediction
binary_activations = net.feedforward(x)
binary_prediction = np.argmax(net.feedforward(x))

print("Prediction with binary thresholding: ")
print(binary_prediction)

print("Network output: ")
print(binary_activations)
Copy the code

Here are the results:

Pictures against:



Binary image



Effects of binary threshold on MNIST antagonistic images. On the left is the image, on the right is the output of the neural network.

It turns out the binary threshold works! But this protection against attacks is not very good. Not all images have a white background. For example, at the beginning of this article, look at the image of a panda. The binary threshold processing of the image may eliminate the noise, but it interferes with the panda image to a large extent. It may even get to the point where the Internet (and humans) can’t recognize it as a panda.

Binary thresholding of pandas produces an image full of blotches

Another, more general approach we can try is to train a new neural network on both the correctly labeled adversarial sample and the original training test set. There is implementation code in ipython Notebook (note that the code takes about 15 minutes to run). When you do that, you get 94% accuracy on all the test sets against the images, which is pretty good. However, this approach has its own limitations. In real life, you are unlikely to know how your attacker produced the adversarial sample.

There are many other ways to protect ourselves from adversarial attacks that we haven’t touched on in this introductory article, but the issue remains an open research topic, and there are many papers on the subject if you’re interested.

Black box attack

An interesting but important observation about adversarial samples is that they often have no specific model or architecture. Adversarial samples generated for one neural network architecture can be ported well to another architecture. In other words, if you want to cheat a model, you can build your own model and counter sample from it. These same adversarial samples are likely to fool other models as well.

This has important implications because it means that it is possible to create an adversarial sample for a complete black box model without prior knowledge of its internals. In fact, a team at Berkeley successfully used this approach to launch an attack on a commercial AI classification system.

conclusion

As we move into the future, with more and more neural networks and deep learning algorithms embedded in our daily lives, we must be very careful and remember that these models can easily be fooled. Although neural networks take their cues from biology to some extent, and are capable of approaching (or even exceeding) humans in a wide variety of tasks, adversarial samples tell us that they don’t operate in the same way as real creatures. As we can see, neural networks can easily and catastrophically fail, a situation that is completely alien to us humans.

We don’t fully understand neural networks, so it’s unwise to use our human intuitions to describe them. For example, you’ll often hear people say “The neural network thinks this image is of a cat because it’s orange.” The problem is that neural networks don’t think like humans. They are essentially just a series of matrices multiplied by some additive nonlinear data. As the adversarial sample demonstrates, the output of these models is very fragile. We must be careful not to rely on neural networks for machines to achieve human qualities, despite their human capabilities. That is, we cannot anthropomorphize machine learning patterns.



A neural network trained to detect dumbbells “believes” that “dumbbells” are sometimes paired with a disengaged arm. This is obviously not what we expected. From Google Research.

All in all, the existence of adversarial samples should keep us humble. They show us that despite the great progress we’ve made, there’s still so much we don’t know.

Link to original English text:

Ml.berkeley.edu/blog/2018/0…

For more content, you can follow AI Front, ID: AI-front, reply “AI”, “TF”, “big Data” to get AI Front series PDF mini-book and skill Map.