Editor’s note: In the tech world, if you don’t understand “machine learning,” you’re out. When someone is talking about machine learning and you don’t know what to do? If you can only nod your head but can’t get a word in edgeways with your colleagues, what should you do? Let’s make a change! Adam Geitgey wrote a simple and easy to understand “machine learning, Fun” material, divided into five parts, mainly for all interested in “machine learning”, but do not know where to start friends, hope to let more people know about “machine learning”, stimulate their interest in “machine learning”. This is the third article.

Machine Learning isn’t that deep, It’s Fun (1)

Machine Learning is not that deep, it’s fun (2)

Are you tired of reading a long paragraph about deep learning and still feeling confused? Change that!

Google now lets you easily find the one you want in your album based on your description alone, and doesn’t even need you to manually tag photos! How does this work?

As mentioned in the previous two sections, this tutorial is for anyone interested in “machine learning” who doesn’t know where to start. This guide is general and incomplete because it is intended for the general public. But we still hope to stimulate public interest in “machine learning”, let more people know about “machine learning”.

If you haven’t read the first two parts of this tutorial, do so!

Deep learning is used for object recognition

I’m sure you’ve seen the famous XKCD webcomic above.

The comics were created with the idea that any three-year-old could recognise the bird in the picture; How to make a computer recognize different objects, however, has puzzled our best computer scientists for more than 50 years.

In the last few years, we’ve finally figured out a pretty good way to do object recognition. This approach uses a “deep convolutional neural network”. The word sounds like a string of words from a William Gibson science fiction novel. But if we break it down one by one, it’s much easier.

So let’s get to work! Write a program that can recognize pictures of birds.

L Start simple

Before we learn to recognize pictures of birds, let’s learn to recognize a relatively simple object — the handwritten number “8.”

In the second part of this guide, we learned how neural networks handle complex problems by linking many simple neurons together. We’ve created a small neural network that can estimate the price of a house based on the number of rooms, the footprint, the location and the neighborhood.

We also know that machine learning works by applying the same generic algorithm to different data to solve different problems, so we can modify and adjust the same neural network to recognize all written text. But to simplify things, let’s first try to identify a handwritten number “8.”

Machine learning is only at its best when you have data, especially a lot of data. So, in order to do our work, we need lots and lots of handwritten eights. Fortunately, the researchers have built a MNIST dataset of handwritten numbers for this purpose. The MNIST dataset provides 60,000 images of handwritten digits, each of which is 18*18 in size. Here are some images of handwritten digits “8” in the dataset:

Partial handwritten “8” in MNIST data set

When you think about it, all numbers are just numbers.

The neural network we built in part 2 accepts only three numbers as data input, such as “3” rooms, “2000” square, etc. But now we want to use our neural network for image processing, so how do we actually feed images, not just numbers, into the neural network?

The answer is quite simple. A neural network receives numbers as input, and to a computer an image is really just a square of numbers representing the depth of each pixel in the image:

To input the image as data into our neural network, we simply consider the 18*18 pixel map as a set of 324 digits.

To process the 324 digital inputs, we simply scale up our neural network to accommodate 324 inputs:

Attention! Our neural network now has two results instead of one. The first output predicted the likelihood that the image was handwritten with the number “8”; The result of the second output predicts the likelihood that the image is not the number “8”. With the individual output of our target recognition object, we can use a neural network to group objects of the same type.

This time, we built a much larger neural network than the previous one (324 inputs compared to 3!). But any modern computer can process neural networks with hundreds of nodes in the blink of an eye, and it works fine even on your phone.

All that was left, then, was to train the neural network with lots of images of number eights and non-number eights, so that it could tell the two apart. When we enter an image of the number “8,” we tell it that there is a 100% chance that the image is the number “8” and a 0% chance that it is not the number “8.” Do the same for the opposite image.

The following is part of our detection data:

HMM… That’s great test data!

In just a few minutes, we can train this neural network with a laptop computer. After training, we have a neural network that can accurately recognize handwritten images of the number eight. Welcome to the world of image recognition in the late 1980s!

L Tunnel view

I have to say it’s really neat, as it’s as simple as feeding image pixels into a neural network to create image recognition. Machine learning is amazing, right?

Well, of course it’s not that easy.

First, the good news is that our number “8” “recognizer” does a really good job of recognizing simple images, but only if the number is in the middle of the image:

The bad news is that our recognizer completely screwed up when the handwritten number in the image wasn’t right in the middle of the image — even a small change in position was enough to trick our neural network.

That’s because our neural network only saw the training data with “8” in the center of the picture, and it had no idea what “8” looked like when it wasn’t in the center of the picture. So it learned only one pattern — the “8” in the center of the picture.

As we all know, in the context of applications, the data that our neural networks see is not always so predictable, it’s often complex and variable. So we needed to figure out a way for our neural network to recognize the “8” not just in the center of the image, but anywhere in the image.

Exhaustive method #1: Search the entire image with a sliding window:

Now that our little program is good at picking out the “8” in the center of the image, can we simply scan the entire image, part by part, until we find the part with the “8” in it, or the whole image is scanned?

This method, called sliding Windows, is a violent solution to this problem. It works very well under certain circumstances, but it’s very inefficient. In this way, we need to examine the same image over and over again to find eights in different positions and sizes.

Of course, we have a better way!

Exhaustive Method #2: More data and deeper Networks:

In the previous chapter, when we trained our neural network, we only showed it images with “8” in the center of the image. So what if we trained it with more varied data (the “8” in different positions and sizes)?

In fact, we don’t need to get new training data. All we need to do is write a simple script that generates different sizes of “8s” in different places in the image, as shown below:

By creating different versions of existing training images, we create “synthetic training data”. This is a very useful method!

This way, we can easily have a steady stream of training data. The more data we have, the harder the problems our neural networks have to deal with. But that’s ok, this is a problem we can solve by expanding the network, and the network itself can learn more and more complex patterns from it.

To expand our network, we just need to stack nodes layer by layer

We call this a “deep neural network” because it has many more layers than a traditional neural network.

The idea has been around since the late 1960s, but until a few years ago it was too computation-intensive and slow to train such large neural networks. However, once we figure out how to use 3D graphics cards (for fast matrix multiplications) rather than ordinary computer processors, the idea of using large neural networks like this will quickly become practical and feasible. In fact, using the same NVIDIA GeForce GTX 1080 graphics card that you use to play “Warden,” we were able to train the neural network at astonishing speed.

But although we can scale up our neural networks and train them quickly with 3D graphics cards, We still haven’t found a foolproof solution. We need to be more thoughtful when processing images with our neural networks.

It makes no sense to train our network to recognize the “8” at the top and the “8” at the bottom of an image as two different objects.

So we need to figure out a way to make this neural network smart enough to understand that the number eight is the same object no matter where it’s placed in the picture, without any extra training. Fortunately, such a solution does exist.

L convolution is the answer we want:

As a human, you can very sensibly know that pictures have a hierarchy or conceptual structure. Take a look at this picture:

Give me a picture of my son for free

As a human, you should be able to

Immediately recognize the hierarchical structure of this photo:

· The ground in the photo is covered with grass and cement;

· There is a child in the photo;

· The child sat on a rubber jumping horse;

· The vault is on the lawn.

The most important thing is that no matter what surface the child is on, we can identify the child. We do not need to bother to study the “structure” of the “child” on each plane just because of all the different planes the object may be in.

But right now, our neural networks can’t do that. Picture it will be in a different position of the figure “8” as a completely different objects, and it will not be able to understand the position of the moving an object in the picture does not mean that created a new object, which means that it may need according to each object in the different positions, to study the different way of identification. This method is unscientific.

We need to teach our neural network the notion of translational invariance, and let it know that “8” is “8” no matter where it appears in the picture.

Next, we’ll use a process called “convolution” to output this idea to our neural network. The concept of “convolution” is partly inspired by computer science and partly by biology (some mad scientist poked a strange probe into a cat’s head in the hope of figuring out how its brain processes images).

L The working process of “convolution”

Previously, we fed all the images into our neural network as a grid of numbers. This time, we’re going to do something smarter than we’ve done before, using the idea that an object is the same object no matter where it is in the picture.

The following are the specific operation steps and processes

Step 1: Decompose the image into overlapping image blocks

Similar to our sliding-window search. We first placed a sliding window over the entire original image, and then saved each output as a small tile-shaped, individual image.

After completing this step, we have successfully converted our original image into 77 small square images of the same size

Step 2: input each small square image into a small neural network

Previously, we fed an entire image into a large neural network to recognize the number eight. This step is the same as before, but we will enter all the small square images separately for identification.

Repeat this 77 times, entering one square image at a time

However, there is one big difference: we will keep the same weight on the neural network that processes the small square image. In other words, we will do exactly the same for each small square image. If any of the small squares have any different effects during the operation, we will mark that square.

Step 3: Save the output of each square operation to a new array

We didn’t want to lose the original arrangement of the image squares, so we saved the result of each image square and arranged it in a grid as the original image was arranged. Something like this:

In simple terms, the whole process is that we start with a large image, and we end up with a relatively small collection that records the most interesting parts of our original image.

Step 4: Downsampling

The result of step 3 is an array that shows the most interesting parts of the original image, but the array is quite large:

To reduce the size of this array, we need to use an algorithm called “maximum pooling” to reduce its sample size.

Ignoring the largest number for the moment, let’s look at each of the 2 by 2 squares in the set:

This step works like this: if we find an interesting part in any of the four input squares that make up the 2-by-2 square, we keep only that interesting part separately. This way, we can reduce the size of the array while preserving the most important parts.

The final step: Make a prediction

So far, we’ve reduced a large image to a relatively small array.

This array is just a bunch of numbers, so we can take this small array and feed it to another neural network as training data. This final neural network will determine whether the image is a match. To distinguish this step from the “convolution” step, we call it the “fully connected network.”

So, from start to finish, our five-step pipeline looks like this:

Add more steps

Our image processing pipeline consists of a series of steps: “convolution,” “maximum pooling,” and finally a “fully connected network.”

These steps can be combined and stacked as many times as you need to solve real-world problems. You can have two, three, even ten convolution layers, and you can maximize pooling wherever you want to reduce your data.

The basic idea is to start with a large image and compress it step by step until you end up with a single output. The more convolutional steps you have, the more complex image features your network will be able to recognize.

Let me give you an example. The first convolution step is learning to recognize sharp edges; The second convolution step uses its knowledge of sharp edges to further identify the beak; Similarly, the third convolution step can use its beak recognition knowledge to identify a complete bird, and so on.

Here is a more realistic deep convolutional network (similar to the network diagrams you see in research papers)

In this case, they started with a 224*224 pixel image, repeated the convolution step and maximum pooling twice each, then performed the convolution step three times, maximum pooling once, and finally entered two fully connected layers. The final output is one of the 1000 types the image is decomposed into.

Build the right network

So how do you know which steps you need to combine and repeat to make your image classifier perform best?

To be honest, you need to do a lot of experiments and tests to answer that question. You may need to test 100 networks to find the optimal structure and parameters for your problem to be solved. Machine learning is a science of trial and error.

L Build our “bird classifier”

Now we know how to write a program that can tell if an image is a bird

As usual, we need some data to start our work. The free CIFAR10 database contains 6,000 bird images and 52,000 other images, but to get more data, we also added 12,000 bird images from the Caltech-UCSD-200 — 2011 Bird Dataset.

Below is a selection of birds from our combinatorial database

The image below is part of a collection of 52,000 “non-bird” images:

The dataset did work well along our goals, but 72,000 low-resolution images were still not enough for real-world applications. If you want to achieve “Google-level” results, you need millions of large high-resolution images. In machine learning, having a lot of data is almost more important than having a good algorithm. Now you see why Google is willing to give you unlimited storage space for your photos? They want your image data, lots and lots of it.

To set up our classifier, we will use TFLearn. TFLearn is a wrapper around Google’s TensorFlow deep learning framework, which has a simple API. It allows us to define our convolutional neural network with just a few lines of code.

Here is the code to define and test the network:

If you’re using a good graphics card with enough memory (such as the Nvidia GeForce GTX 980 Ti sim card or better), you’ll be able to train in under an hour. If you’re using a normal CPU, it might take a little longer.

Congratulations to you! Our program can now recognize images of birds!

L Network testing inspection

Now that we have a trained neural network, let’s use it! Here is a simple script that takes an image file as input and determines whether the image is a bird.

But to test the effectiveness of our network, we need to test it with lots of images. The data set I created had 15,000 images as a check set, and when I ran the network through the 15,000 images, it was 95% accurate.

95% of the time. Doesn’t that look like a good number? But again, it all depends.

How accurate is 95% accuracy?

Our network is 95 percent accurate, but “the devil is always in the details,” and 95 percent can mean a lot of different things.

For example, what if 5 percent of our training data images were birds and 95 percent weren’t? A program that makes “non-bird” predictions is 95 percent accurate, but it is also 100 percent useless because the image is not what we need.

We need to pay more attention to the numbers, not just the overall accuracy. Judging a classification system is good, we need to pay more attention to how it fails, not just the percentage of times it fails.

Instead of thinking about whether our prediction is right or wrong, let’s just break it down into four separate categories

  • First, here’s a picture of a bird that our network correctly identified, which we call a “real example.”

A: wow! Our network can successfully identify a large number of different species of birds!

  • Second, below are pictures of “non-birds” that our network correctly identifies, which we call “true counterexamples.”

After using the 15,000 images in our test dataset, here are our estimated statistical times in each category:

Why do we break down our predictions like this? Because not all mistakes are made in the same way, different mistakes have different causes.

Imagine that we are writing a program to detect cancer from an MRI image of the heart. If we are monitoring cancer, we would rather have a “false positive” result than a “false negative” result. A “false counterexample” would be the worst outcome, where the program tells a person they don’t have cancer when in fact they do.

Instead of focusing only on total accuracy, we calculate accuracy and recall metrics. Accuracy and recall metrics give us a clear picture of how well we are doing.

The chart above shows us that we were right about 97% of the time when we predicted the image to be a bird! But it also reflects the fact that we only found 90 percent of the real birds in the dataset. In other words, while we may not be able to find every bird, once we find a bird, we can be pretty sure it’s a bird!

Where does this step lead?

Now that you know the basics of deep convolutional networks, you can try out some of the examples in TFLearn and then team up with various deep learning engineers to go for it. TFLearn also has built-in data sets, so you don’t even need to use your own images.

Now that you’ve seen how to branch and learn other areas of machine learning, why not learn how to use algorithms to train computers to play Atari games?

 

Note: This article is compiled by TupuTech. You can follow the wechat official account Tuputech to get the latest and best ai information.


Stamp here