原文 : An Intuitive Explanation of Convolutional Neural Networks

By Ujjwal Karn

Translation: Kaiser (Stu Wang)

Convolutional: How to Make a Great Neural Network








preface

  • If you are completely new to neural networks, it is recommended to read nine lines of Python code to build neural networks to grasp some basic concepts.
  • This article contains some GIFs, which may not play, but also need to visit the “wisdom” link above to view.
  • Battle.net points cards as prizes have been sent out. Congratulations to the following winners:
    • Liu Yang – Zhihu

    • Zhang Zhenyu – Zhihu

    • Bamboo stone – Zhihu

    • Sanada Yukimura – Zhihu

    • Andras, – Weibo








What is a convolutional neural network? And why is it important?

Convolutional Neural Networks (ConvNets or CNNs) are a Neural network that has been proved to be particularly effective in the field of image recognition and classification. Convolutional networks have been successfully used to recognize faces, objects, and traffic signs in vehicles such as robots and unmanned vehicles.


Figure 1





In Figure 1 above, the convolutional network can identify the scene and the system can automatically recommend relevant labels such as “bridge”, “railway” and “tennis ball”. FIG. 2 shows examples of convolutional network recognition of everyday things such as people and animals. More recently, convolutional networks have also shown their power in natural language processing (such as sentence classification).


Figure 2





Convolutional networks are an important tool in most machine learning applications today. But the experience of understanding convolutional networks and learning to use them for the first time is sometimes unfriendly. The purpose of this paper is to help readers understand how convolutional neural networks work on images.

In this paper, multi-layer Perceptrons (MLP) are also referred to as Fully Connected Layers.





LeNet Architecture (1990s)

LeNet is one of the earliest convolutional neural networks for deep learning. Yann LeCun’s masterpiece LeNet5 takes its name from the successful iterations of his series since 1988. At the time, the LeNet architecture was mainly used for tasks such as identifying zip codes.

Let’s take a look at how LeNet learns to recognize images. There have been many new architectures built on Top of LeNet in recent years, but the basic concepts still come from LeNet, and understanding LeNet makes it easier to learn others.


Figure 3: Simple ConvNet





The convolutional neural network in Figure 3 is very close to LeNet’s original architecture, dividing the images into four categories: dogs, cats, boats, and birds (LeNet was originally primarily used for these). As shown in the figure above, the network correctly assigned the highest probability (0.94) to ship classification when a ship map was obtained as input. The probabilities of the output layer add up to 1.

The convolutional neural network in FIG. 3 mainly performs four operations:

  1. convolution
  2. Nonlinear (ReLU)
  3. Pooling or downsampling
  4. Classification (full connection layer)

These operations are also the cornerstone of all convolutional neural networks, so understanding them is crucial to understanding the whole neural network. Next we will try to understand this most intuitively.











A picture is a matrix of pixel values

Essentially, each image can be represented as a matrix of pixel values:


Figure 4: Pixel value matrix





Channel is an idiom that refers to a particular component of a picture. A typical digital camera photo has three channels — red, green and blue — which you can imagine as three 2D matrices (one for each color) stacked on top of each other, each with a value between 0 and 255.

Grayscale images, on the other hand, have only one channel. For the sake of simplicity, only grayscale images are considered in this paper, which is a 2D matrix. Each pixel in the matrix is still 0 to 255 — 0 for black and 255 for white.











convolution

Convolutional networks get their name from the operation “convolution”. The fundamental purpose of convolution is to extract features from the input image. Convolution uses a small square matrix of data to learn image features and can preserve the spatial relations between pixels. I don’t want to go into the mathematics of convolution, but I want to understand how it works.

As mentioned above, each image is a matrix of pixel values. Consider a 5×5 image, whose pixel values are 0 and 1. The green matrix below is a special case of grayscale image (the pixel value of a conventional grayscale image is 0-255). Meanwhile, consider the following 3×3 matrix:

Then, the convolution calculation between the 5×5 image and the 3×3 matrix can be represented by the animation below:



Figure 5: Convolution operation. The output matrix is called the convolution eigenmap or eigenmap





To think about how this is done, we slide the orange matrix (also known as the ‘stride’) 1 pixel, 1 pixel on the original image (green), and at each position we multiply the corresponding elements of the two matrices and sum them to get an integer, which is the element of the output matrix (pink). Note that the 3×3 matrix only “sees” part of the input picture at a time.

The 3×3 matrix is also called “filter”, “kernel” or “feature detector”, and the matrix obtained by sliding the filter and dot product matrix on the original graph is called “convolution feature”, “excitation map” or “feature map”. The point here is to understand that the filter is a feature detector for the original input image.


For the same photo, different filters will produce different feature maps. For example, consider the following input image:

The following table shows the effects of different convolution kernels on the figure above. Just by adjusting the values of the filters, we can perform effects such as edge detection, sharpening, blurring, etc. – this means that different filters will detect different features in the image, such as edges, curves, etc.






Another good way to understand the convolution operation is to look at the animation in Figure 6:

(GIF is too large to upload)





A filter (red box) slides over the image (convolution) to produce feature maps. The convolution of another filter (green box) on the same image yields different feature maps. Note that the convolution operation captures local dependencies of the original image. Also, notice how two different filters produce different feature maps. In fact, both the image and the two filters are essentially just numerical matrices that we just looked at.

In practice, convolutional neural network learns the value of filters in the training process. Of course, we still need to specify some parameters before training: number of filters, filter size, network architecture and so on. The more filters, the more features extracted from the image, the stronger the pattern recognition ability.


The dimension of the feature map is controlled by three parameters, which we need to set before the convolution step:

  • Depth: Depth is the number of filters used in the convolution operation. As shown in Figure 7, we used three different filters for the original boat map, resulting in three feature maps. You can think of these three feature maps as stacked 2D matrices, so the “depth” of the feature map here is 3.


Figure 7.





  • Stride: Stride is the number of pixels per slide. When the Stride is 1, it slides pixel by pixel. When the Stride is equal to 2, it slides across 2 pixels at a time. The larger the stride, the smaller the feature map.

  • Zero-padding: Sometimes it’s convenient to fill a circle of zeros around the edge of the input matrix so that we can apply filters to the edge pixels of the image matrix as well. The advantage of zeroing is that it allows us to control the size of the feature map. Zeroing is also called wide convolution, not zeroing is called narrow convolution.











nonlinear

As shown in Figure 3, each convolution operation is followed by an additional operation called ReLU. ReLU, which stands for Rectified Linear Unit, is a non-linear operation with the following output:


Figure 8: ReLU





ReLU works in pixels and replaces all negative pixels with 0. The purpose of ReLU is to introduce nonlinearity into the convolutional network, because most problems to be learned in the real world are nonlinear (simple convolution operations are linear — matrix multiplication and addition, so additional calculations are needed to introduce nonlinearity).

Figure 9 can help us clearly understand that ReLU is applied to the feature map obtained in Figure 6, and the output new feature map is also called “correct” feature map. (Black is painted gray)


Figure 9: ReLU


Other nonlinear equations such as TANH or SigmoID can also replace ReLU, but ReLU performs better in most cases.








pooling

Spatial pooling (also called subsampling or downsampling) reduces the dimension of each feature map, but preserves the most important information. Space pooling can take many forms: Max, Average, Sum, and so on.

In the case of maximum pooling, we define the spatial neighborhood (a 2×2 window) and extract the largest element in the window from the corrected feature map. In addition to taking the maximum for extra, we can also take the average (average pooling) or add up all the elements in the window. In fact, maximum pooling has shown the best results.

Figure 10 shows the maximum pooling operation on the corrected feature map (after convolution +ReLU), using a 2×2 window.


Figure 10: Maximum pooling





We slide the 2×2 window in a 2-stride Stride and take the maximum for each area. Figure 10 also shows that pooling can reduce the dimension of the feature map.

In the network shown in Figure 11, pooling is applied to each feature map separately (note that because of this, we get three output maps from three input maps).


Figure 11: Apply pooling to the correct feature map





Figure 12 shows the effect of the pooling operation on the corrected feature map obtained in Figure 9.


Figure 12: Pooling





Pooled function rooms progressively reduce the spatial dimensions of input representations. In particular, pooling

  • Make the input representation (characteristic dimension) smaller and easier to operate
  • Reduce the number of parameters and calculations in the network, so as to curb the overfitting
  • Enhance the network’s robustness to small distortion, distortion, and translation in the input image (small distortion in the input does not change the pooled output – because we have taken the maximum/average value in the local neighborhood).
  • It helps us obtain an equivalent image representation that does not change with size. This is very useful, because then we can detect the object in the picture, no matter where it is.

As of now:


Figure 13





So far we have seen how convolution, ReLU, and pooling work, and these layers are the basic units for all convolutional neural networks. As shown in Figure 13, we have two sets of “convolution +ReLU+ pooling” layers — where the second group applies six filters to the output of the first group, producing six feature maps. ReLU scoped the six feature maps separately, and then used maximum pooling for the generated corrective feature maps.

These layers extract useful features, introduce nonlinearity and reduce dimensions to the network, and keep features invariant to size and shift.

The output of the second pooling layer is equivalent to the input of the full connection layer, which we’ll explore in the next section.








The connection layer

The Fully Connected layer is a multi-layer Perceptron that uses Softmax excitation function as the output layer. Many other classifiers such as support vector machines also use Softmax. Fully connected means that every neuron in the upper layer is connected to every neuron in the lower layer.

The output of the convolutional layer and the pooling layer represents the advanced features of the input image. The purpose of the full connection layer is to use these features to classify, and the categories are based on the training set. For example, the image classification task shown in Figure 14 has four possible categories. (Note that figure 14 does not show all the neuron nodes)



Figure 14: Full connection layer — each node is connected to all nodes in the adjacent layer

In addition to classification, adding full connection layer is also an effective method to learn nonlinear combination between features. The features extracted from the convolution and pooling layers are good, but if you consider the combination of these features, it’s even better.

The sum of the output probabilities of the full connection layer is 1, which is guaranteed by the excitation function Softmax. The Softmax function converts any real-valued vector to a vector with 0-1 elements and a sum of 1.











Join forces – Backpropagation training

To sum up, convolution + pooling is a feature extractor, and full connection layer is a classifier.

Note figure 15, because the input image is a boat, the target probability is 1 for boats and 0 for other categories.

  • Input image = ship
  • Target vector = [0, 0, 1,0]



Figure 15: Training convolutional neural network





The training process of convolutional network can be summarized as follows:


  • Step 1: Initialize all filters and parameters/weights with random numbers
    • Step 2: The network takes the training picture as the input, performs the forward steps (convolution, ReLU, pooling and forward propagation of the full connection layer) and calculates the corresponding output probability of each category.

      • Suppose the output probability of the ship chart is [0.2, 0.4, 0.1, 0.3]
      • Since the weights of the first training sample are random, the output probability is almost random
    • Step 3: Calculate the total error of output layer (sum of 4 categories)

      • The total error is equal to 12(target probability − output probability)2
    • Step 4: The back propagation algorithm calculates the gradient of the error relative to the ownership weight, and uses the gradient descent method to update the values of all filters/weights and parameters to minimize the output error.

      • The degree of adjustment of the weight is proportional to its contribution to the total error.
      • When the same image is input again, this time the output probability may be [0.1, 0.1, 0.7, 0.1], which is closer to the target [0, 0, 1, 0].
      • This means that our neural network has learned to classify a particular image by adjusting the weight/filter to reduce the output error.
      • Parameters such as the number of filters, filter size and network architecture have been fixed before Step 1 and will not change during training — only the filter matrix and synaptic weight will be updated.


    The above steps train the convolutional network — essentially optimizing all the weights and parameters so that it can correctly classify the images in the training set.


    When a new (previously unseen) image is fed into the convolutional network, the network performs the forward propagation step and outputs the probabilities for each category (for new images, the probabilities are also output using trained weights). If our training set is large enough, the network is expected to classify new images correctly and achieve good generalization ability.


    Note 1: The above steps have been greatly simplified and mathematical details have been omitted in order to make the training process more intuitive.

    Note 2: In the above example, we used two sets of convolutional + pooling layers. In fact, these operations can be repeated countless times in a convolutional network. Nowadays, some outstanding convolutional networks have tens of convolutional + pooling layers. Also, not every convolution layer should be followed by a pooling layer. As can be seen from FIG. 16, we can have continuous multi-group convolution +ReLU layer, followed by a pooling layer.




    Figure 16











    Visual convolutional neural network

    In general, the more convolutional layers, the more complex the features that can be learned. In image classification, for example, the first layer of a convolutional neural network learns to detect edges in pixels, then the second layer uses those edges to detect simple shapes, and other layers use shapes to detect advanced features, like face shapes. As shown in Figure 17, these features are learned by the Convolutional Deep Belief Network. This is just a simple example, but actually the convolution filter might detect some features that don’t make sense.


    Figure 17: Features of learning in Convolutional Deep Belief Network





    Adam Harley did an amazing visualization of a convolutional neural network, trained on the MNIST handwritten number database. I highly recommend that you play it to understand the details of convolutional neural networks.

    Here we see how the network recognizes the input number “8”. Note that figure 18 does not show the ReLU process separately.


    Figure 18: Visual convolutional neural network





    The input image has 1024 pixels (32×32 images), and the first Convolution Layer 1 has six different 5×5 filters (Stride = 1). As can be seen from the figure, six different filters generate feature maps with depth of 6.

    Convolutional Layer 1 is followed by Pooling Layer 1, and the maximum Pooling of 2×2 is performed for the six feature maps respectively (Stride = 2). You can move the mouse pointer over each pixel in the dynamic web page and see its corresponding 4×4 grid in the previous convolutional layer (Figure 19). It is not hard to see that the brightest pixel (corresponding to the maximum) in each 4×4 grid constitutes the pooling layer.

    Figure 19: Visual pooling operations





    Then we have three fully connected (FC) layers:

    • FC 1:120 neurons
    • FC 2:100 neurons
    • FC 3: 10 neurons, corresponding to 10 numbers – that is, the output layer

    In Figure 20, each of the 10 nodes in the output layer is connected to 100 nodes in the second full connection layer (hence the “full connection”).

    Note that the only bright spot in the output layer corresponds to an 8 — indicating that the network correctly recognizes handwritten numbers (brighter nodes represent higher probabilities, as in this case, 8 has the highest probability).


    Figure 20: Visualize the full connection layer





    A 3D version of this visualization can be seen here.





    Other convolutional network architectures

    Convolutional neural network started in the 1990s, and we have known the earliest LeNet. Some other influential architectures are listed as follows:

    • 1990s to 2012: From the 1990s to the early 2010s, convolutional neural networks were in their incubation phase. As the volume of data increases and the computational power increases, the problems that convolutional neural networks can solve become more and more interesting.

    • AlexNet(2012) : In 2012, Alex Krizhevsky released AlexNet, a deeper, wider version of LeNet, and won the ImageNet Large-scale Image Recognition Challenge (ILSVRC) by a wide margin. This is a very important breakthrough, and now the popularity of convolutional neural network applications are thanks to this feat.

    • ZF Net(2013) : The 2013 ILSVRC winner was Matthew Zeiler and Rob Fergus’ convolutional network, known as ZF Net, which is an improved version of AlexNet after adjusting the architectural hyperparameters.

    • GoogleNet(2014) : The winner of ILSVRC in 2014 was Szegedy et al., from Google. His main contribution is the development of Inception Module, which drastically reduces the number of parameters in the network (4 million compared to AlexNet’s 60 million).

    • VGGNet(2014) : The ILSVRC runner-up of the year was VGGNet, whose outstanding contribution was to show that the depth (number of layers) of the network was a key factor in good performance.

    • ResNet(2015) : Residual Network developed by Kaiming He is the ILSVRC champion in 2015, which also represents the highest level of convolutional neural Network and is the default choice of practice (May 2016).

    • DenseNet (August 2016) : Published by Gao Huang, dense Connected Convolutional Network each layer is directly Connected forward to the other layers. DenseNet has shown remarkable progress in five difficult object recognition base sets.

    (End of translation)





    After this article and the previous series, you should have a good grasp of the basic principles of convolutional neural networks. Next, let’s try to build a LeNet using Keras, the popular Python deep learning library.


    Practice is the only criterion for neural networks


    The structure of LeNet was reviewed, and the classical LeNet network was built by completing the second group of convolution + excitation + pooling.

    Tip:

    • The second convolution layer has 16 filters with a size of 8×8.
    • The second excitation layer uses the tanh function
    • The size and stride length of the maximum pool layer in the second group are 3×3

    Please complete the code in the Python development environment below and click the blue button [Run] to check if the answer is correct.

    (To visit:Convolution: How to Make a Great Neural Network – Intelligence-gathering Column)


    As shown in the figure below, if the parameters of the second convolution + excitation + pooling layer are incorrectly set, a red prompt will be given and the specific error will be pointed out.