• Video address: www.youtube.com/embed/FmpDI…
  • Document reference: PDF [2MB] & PPT [6MB] & Web View
  • Supplementary knowledge: Deep learning – Theoretical derivation of back propagation





When you hear about deep learning breaking down a new technical barrier, nine times out of ten it involves convolutional neural networks. Also known as CNNs or ConvNets, they are the workhorses in the field of deep neural networks. They have learned to classify images, in some cases even better than humans. If there’s one way to prove this assumption, it’s CNN. What’s really cool about them is that they’re easy to understand when you break them down into basic modules. Here’s a video that discusses these images in great detail.





LeNet-5





Classfication

Prior work

  • 【 ICML09 – Convolutional Deep Belief Networks. PDF 】












  • 【Playing Atari with Deep Reinforcement Learning】








  • Robot Learning Manipulation Action Plans













A toy ConvNet: X’s and O’s

Recognize if a picture contains the letter “X” or “O”?

To help guide your understanding of convolutional neural networks, let’s take a very simplified example: Does an image contain an “X” or an “O”?





This example is enough to illustrate the rationale behind CNN, but it’s simple enough to avoid getting bogged down in unnecessary detail. There is a problem in CNN that every time you are given an image, you have to decide whether it contains an “X” or an “O”. And suppose you have to choose between them, either “X” or “O”. The ideal situation would look something like this:





The standard “X” and “O” letters are located in the center of the image and are proportionally correct without distortion

It’s not easy for computers to solve this problem if the image changes slightly and is not standard:





A naive way for a computer to solve this problem is to save a standard image of the “X” and “O” (as in the example above), and then compare the two standard images with a new image to see which one fits better and determine which letter. But it’s actually very unreliable to do that, because computers are pretty rigid. In computer “vision,” a graph looks like a two-dimensional array of pixels (think of it as a checkerboard), with each position corresponding to a number. In our example, the pixel value “1” represents white and the pixel value “-1” represents black.





When comparing two images, if any of the pixel values don’t match, the two images don’t match, at least as far as computers are concerned. For this example, the computer considers the white pixels in the two images to be the same except for the 3 × 3 square in the middle, and the other four corners to be different:





So, on the face of it, the computer decides that the picture on the right is not “X”, that the two pictures are different, and concludes:





But to do that, it just doesn’t make sense. Ideally, we want computers to still be able to recognize the “X” and “O” in images that have undergone only simple transformations such as translation, scaling, rotation, and micro-distortion. As in the following cases, we expect computers to still be quick and accurate:





This is the problem that CNN emerged to solve.

Features





For CNN, it’s a by-by-piece comparison. And this little patch that it compares to we call Features. By finding some rough features in roughly the same position in the two images for matching, CNN can better see the similarity of the two images, compared with the traditional method of one-by-one comparison of the whole image.

Each feature is like a small graph (that is, a relatively small two-dimensional array with values). Different features match different features in the image. In the case of the letter “X”, those features made up of diagonals and intersecting lines are basically able to identify most of the important features that “X” has.





These features are likely to match the four corners and center of the letter X in any graph containing the letter “X”. So how exactly is it matched? As follows:





















See if there’s a little bit of a ringleader here. But this is really just the first step, you see how Features are matched on top of the original image. But you don’t really know what kind of math is going on here, like what’s going on with this 3 by 3 piece down here?





I’m going to follow up with the math operation, which is called convolution.

Convolution (Convolution)





Convolution





When you’re given a new image, CNN doesn’t know exactly what parts of the original image these features are supposed to match, so it tries every possible location in the original image. In this way, matching calculation is carried out for each position in the original whole picture, which is equivalent to turning the feature into a filter. And this process that we use for matching is called the convolution operation, and that’s where the convolutional neural network gets its name.

The math behind this convolution operation is actually pretty simple. To calculate the result of a feature and a small block corresponding to it on the original image, it is only necessary to simply multiply the pixel values at the corresponding positions in the two small blocks, then add up the results of the multiplication operation in the whole small block, and finally divide by the total number of pixels in the small block. If both pixels are white (that is, both have values of 1) then 1*1 = 1, and if both are black then (-1)*(-1) = 1. In either case, each pair of matched pixels is multiplied by one. Similarly, any pixels that do not match are multiplied to -1. If all the pixels in a feature (such as N *n) match the corresponding small block (n*n) in the original picture, then their corresponding pixel values multiplied and accumulated will be equal to N ^2, and then divided by the total number of pixels, n^2, the result is 1. Similarly, if every pixel doesn’t match, the result is -1. The specific process is as follows:









































For the middle part, do the same:

















When the whole picture is finished, it looks something like this:





Then perform the same operation with other features, and the final result is like this:





In order to complete our convolution, we repeatedly repeated the above process and carried out convolution operation between feature and each piece in the figure. Finally, through the convolution operation of each feature, we will get a new two-dimensional array. This can also be understood as the result of filtering the original image, which is called feature map, which is the “feature” extracted from each feature from the original image. Where, the closer the value is to 1, it means that the matching between the corresponding position and the feature is more complete; the closer it is to -1, it means that the reverse matching between the corresponding position and the feature is more complete; while the value close to 0 means that there is no matching or correlation between the corresponding position.





In this way, our original graph becomes a series of feature maps after convolution operations of different features. We can easily and intuitively view this whole operation as a single process. In CNN, we call this the convolution layer, so you might immediately think that there must be other layers behind it. That’s right. We’ll talk about that later. We can view the convolution layer as the following:





Therefore, it is conceivable that the operation of CNN is not complicated. But although we can describe the work of CNN in this space, the number of internal addition, multiplication and division operations actually increases very quickly. Mathematically, they grow linearly with the size of the image, the size of each filter, and the number of filters. With so many factors at play, it’s easy to make this problem computationally large, and it’s no wonder that many microprocessor manufacturers are now producing specialized chips to keep up with CNN computing needs.

Pooling (Pooling)





Pooling

Another useful tool used in CNN is called Pooling. Pooling can reduce the size of a large image while preserving important information. The math behind pooling is at best second grade level. It is to reduce the input image, reduce the pixel information, only retain important information. Generally, pooling is 2*2. For example, for max-pooling, the maximum value of 2*2 blocks in the input image is taken as the pixel value of the result, which is equivalent to reducing the original image by 4 times. (Note: Similarly, for average-pooling, the average value of 2*2 blocks is taken as the pixel value of the result.)

For this example, the pooling operation is as follows:













Insufficient outside fill “0” :









After maximum pooling (e.g., 2 by 2), a graph is reduced to a quarter of its size:





Then perform the same operation on all feature maps and obtain the following results:





Because max-pooling preserves the maximum value of each piece, it preserves the best matching result of the piece (because the closer the value is to 1, the better the match is). This means that it doesn’t focus on exactly where in the window a match is made, just whether there is a match made somewhere. It can be seen that CNN can find whether there is a certain feature in the image, regardless of where it is. This helps to overcome the inflexible practice of computer pixel by pixel matching mentioned earlier.

When all feature maps are pooled, the large graphs corresponding to a series of inputs become a series of small graphs. Similarly, the entire operation can be regarded as an operation, that is, the pooling layer in CNN, as follows:





By adding the pooling layer, the computation and machine load can be greatly reduced.

Normalization





1. The activation function Relu (Rectified Linear Units)

This is a small but important operation called Relu(Rectified Linear Units), or corrected Linear Units. The math is also simple:





For negative input, the output is all 0. For positive input, the output is as is. See more about its features here.

Let’s take a look at the specific operation of relu activation function in the isolated example in this paper:













Finally, after operating on the whole image, the result is as follows:





Similarly, in CNN, our series of operations are regarded as one operation, so Relu Layer is obtained as follows:





Deep Learning

Finally, we put together the convolution, pooling, and activation mentioned above, which looks like this:





Then, we increase the depth of the network, add more layers, and get the deep neural network:





Then in different layers, we can visualize the results of prior knowledge mentioned at the beginning of this paper:





Fully connected layer

































Judged as “X” according to the result:





In this process, we define this series of operations as “Fully connected layers” :





There can also be many full connection layers, as follows:





[Combine all the above structures]





This whole process, from front to back, is called “forward propagation,” where you get a set of outputs that are then propagated back to continually correct errors and learn.

Backpropagation

The mathematical principle here can be seen in: Deep learning — Theoretical Derivation of back propagation.





























Gradient Descent Optimizer









Hyperparameters





Application





Images





Soung





Text









Learn more





If you’d like to dig deeper into deep learning, check out my Demystifying Deep Learning post. I also recommend the notes from the Stanford CS 231 course by Justin Johnson and Andrej Karpathy that provided inspiration for this post, as well as the writings of Christopher Olah, an exceptionally clear writer on the subject of neural networks.

If you are one who loves to learn by doing, there are a number of popular deep learning tools available. Try them all! And then tell us what you think.

Caffe

CNTK

Deeplearning4j

TensorFlow

Theano

Torch

Many others

I hope you’ve enjoyed our walk through the neighborhood of Convolutional Neural Networks. Feel free to start up a conversation.





Brandon Rohrer

The focus of this paper is on the previous specific operation part, so the following part is not detailed, just covered, if necessary, the later will gradually improve, thank you!

Reference:

  • Brohrer. Making. IO/how_convolu…
  • Deep learning — Theoretical derivation of back propagation

(Note: Thank you for reading and I hope you found this article helpful. If you feel good, you are welcome to share, but please click here to obtain authorization first. This article is protected by copyright printing, any unauthorized reprint is prohibited, thank you!