Make writing a habit together! This is the 11th day of my participation in the “Gold Digging Day New Plan ยท April More text Challenge”. Click here for more details.

ย 

Summary of classical Network models of deep learning 1

Let’s take a look at some of the classic network models we’re going to cover, as follows:

  • LeNet: CNN network for handwritten number recognition
  • AlexNet: ILSVRC 2012 champion, deeper layers than LeNet, a historic breakthrough.
  • ZFNet: 2013 ILSVRC competition had a better effect, similar to AlexNet.
  • VGGNet: Runner-up of ILSVRC classification and champion of positioning in 2014
  • GoogleNet: 2014 ILSVRC Classification champion
  • ResNet: 2015 ILSVRC competition champion, crushing all previous networks

ย 

LeNet

In 1989, Yang LeCun et al. proposed LeNet network, the earliest convolutional neural network, which greatly promoted the development of deep learning. Yang LeCun is also known as the father of convolutional network. Indeed, such as Daniel should also remember by everyone in scientific research, here you can look at the pictures of Yang, have a look at it very learned ๐Ÿ‘จ ๐Ÿผ ๐ŸŽ“ ๐Ÿ‘จ ๐Ÿผ ๐ŸŽ“ ๐Ÿ‘จ ๐Ÿผ ๐ŸŽ“ ๐Ÿ‘จ ๐Ÿผ ๐ŸŽ“ ๐Ÿ‘จ ๐Ÿผ ๐ŸŽ“

Ok, let’s take a look at the network structure of this power design, as shown below:

At present, the network is very simple, there are five layers (only the layer with parameters, the pooling layer without parameters is not included in the network model, the following layers are not included in the pooling layer) :

  • Input size: 32 x 32
  • Convolution layer: 2
  • Pooling layer: 2
  • Full connection layer: 2
  • Output layer: 1, 10 x 1

Take a look at each of these layers in the following figure:

  • Input layer: LeNet’s input is a 32*32*1 grayscale image (as long as one color channel).
  • CONV1 (the first convolution layer) : convolved the gray scale image with the size of 32*32*1, the size of convolution kernel was 5*5, the step size of s was 1, the number of convolution kernels was 6, and the feature image with the size of 28*28*6 was obtained after the convolution operation (the feature image size would not be calculated after the convolution).
  • The first pooling layer: LeNet adopts the method of average pooling, and the size of the feature graph obtained by the upper level convolution is 28*28*6, which is pooled. The size of the pooling kernel is 2*2, the step size is S =2, and the size of the feature graph after the first pooling is 14*14*6 (the size of the feature graph after pooling will not be the same, and the formula of the feature graph obtained by convolution and pooling is basically the same).
  • CONV2 (second convolution layer) : the feature graph with size 14*14*6 pooled was convolved for the second time. The size of convolution kernel was 5*5, step size s was 1, and number of convolution kernels was 16. After convolution, a feature graph with size 10*10*16 was obtained.
  • The second pooling layer: the size of the pooling core is 2*2, and the step size is S =2. After the second pooling, the size of the feature map is 5*5*16.
  • The first fully connected layer: the size of the feature map obtained in the previous step is 5*5*16, and the output size after this layer is 120*1. This step actually has a certain confusion, began to contact neural network may not understand, how a three-dimensional vector suddenly become a one-dimensional? I think those of you who are looking at the network model should be familiar with this, but for the sake of completeness and rigor, here’s a brief description: In fact, after the feature graph of 5*5*16 is obtained, a flatten operation should be carried out, that is, to expand the feature graph of 5*5*16 into a vector of (5x5x16)*1=400*1 size, and then enter the fully connected layer. It should be pretty clear at this point, that in the first fully connected layer we put in 400 neurons and out 120 neurons.
  • Second fully connected layer: similar to the previous layer, inputs are 120 neurons and outputs are 84 neurons.
  • Output layer: after obtaining 84 neurons, the output is actually obtained through a full connection, with a size of 10*1.

ย 

AlexNet

Let’s look at AlexNet’s network architecture directly, as shown in the figure below: It can be seen that AlexNet and LeNet have very similar overall structure, both of which are a series of convolutional pooling operations and finally connected to the full connection layer. Similarly, we will explain each layer in detail, that is, how to obtain the corresponding feature map size through convolution kernel and pooling kernel. The network has 8 layers (not including pooling) as follows:

  • Input size: 227*227*3
  • Convolution layer: 5
  • Pooling layer: 3
  • Full connection layer: 2
  • Output layer: 1, 1000 x 1

  • Input layer: AlexNet input is 227*227*3 color images (with three color channels)
  • The first convolution layer: convolve the color image with size 227*227*3, convolution kernel size 11*11, step size s is 4, padding=0, and number of convolution kernels is 96. After convolving, a feature image with size 55*55*96 is obtained.
  • The first pooling layer: **AlexNet takes the maximum pooling approach. The feature graph of 55*55*96 was obtained by upper convolution, and the maximum pooling operation was performed on the feature graph. The pooled core size is 3*3, step size s is 2, padding=0, and the pooled feature map size is 27*27*96.
  • The second convolution layer: the input dimension is 27*27*96, the convolution kernel size is 5*5, step size s is 1, padding=2, and the number of convolution kernels is 256. After convolution, a feature graph with a size of 27*27*256 is obtained. (Note 1) We can set specific s and padding to achieve the same size of the first two dimensions of the feature image before and after the convolution. See in the figure above, same is written in this step of convolution, indicating that the convolution mode is same, that is, the size of the feature graph before and after convolution is the same.]
  • The second pooling layer: input dimension is 27*27*256, pooling core size is 3*3, step size s is 2, padding=0, feature map size after pooling is 13*13*256.
  • The third convolution layer: input dimension is 13*13*256, convolution kernel size is 3*3, step size s is 1, padding=1, and number of convolution kernels is 384. After convolution, a feature graph with size of 13*13*384 is obtained.
  • The fourth convolution layer: input dimension is 13*13*384, convolution kernel size is 3*3, step size s is 1, padding=1, and number of convolution kernels is 384. After convolution, a feature graph with size of 13*13*384 is obtained.
  • The fifth convolution layer: input dimension is 13*13*384, convolution kernel size is 3*3, step size s is 1, padding=1, and number of convolution kernels is 256. After convolution, a feature graph with size of 13*13*256 is obtained.
  • The third pooling layer: input dimension is 13*13*256, pooling core size is 3*3, step size s is 2, padding=0, feature map size after pooling is 6*6*256.
  • The first full connection layer: input is 6*6*256=9216 neurons, output is 4096 neurons.
  • The first fully connected layer: input is 4096 neurons, output is 4096 neurons.
  • Output layer: after 4096 neurons are obtained, the output is actually obtained through a full connection, with a size of 1000*1.

All layers of AlexNet have been clearly described above, but some details of the model have not been described, such as adding relu activation function, adding local response standardization (LRN, which will be proved invalid later in VGG), and adding Dropout layer. The specific network structure is shown below. [Note: The figure below describes the structure diagram running on two Gpus, which is a technique used to reduce the training time when the computing power is not enough at that time. It is no longer needed now.]

Here’s a summary of AlexNet’s innovations:

  • ReLU is used as the activation function of CNN, and its effect is verified to exceed Sigmoid in the deep network, which successfully solves the gradient dispersion problem of Sigmoid in the deep network. In addition, the training speed is accelerated, because the training network uses gradient descent method, the training speed of unsaturated nonlinear function is faster than that of saturated nonlinear function. Although the ReLU activation function was proposed a long time ago, it was not until the advent of AlexNet that it was promoted.
  • Dropout is used during training to avoid overfitting the model by randomly ignoring some neurons. Dropout is discussed in a separate paper, but AlexNet has made it practical and proved its effectiveness through practice. Dropout is mainly used in the last fully connected layers in AlexNet.
  • Use overlapping maximum pooling in CNN. Before, average pooling was widely used in CNN, while AlexNet used maximum pooling to avoid the fuzzy effect of average pooling. In addition, AlexNet proposed that the concession length is smaller than the size of the pooling core, so there will be overlap and coverage between the outputs of the pooling layer, which improves the richness of features.
  • The LRN layer is proposed to create a competition mechanism for the activities of local neurons, which makes the values with larger responses become larger, and inhibits other neurons with smaller feedback, thus enhancing the generalization ability of the model. [This method was later considered invalid in VGG]
  • CUDA is used to accelerate the training of deep convolutional network, and the powerful parallel computing capability of GPU is utilized to process a large number of matrix operations in neural network training. AlexNet uses two GTX580 Gpus for training, and a single GTX580 has only 3GB of video memory, which limits the maximum size of the trainable network. Therefore, AlexNet is distributed on two Gpus and the parameters of half of neurons are stored in the video memory of each GPU. [Now with improved computing power, there is less need for dual GPU acceleration]
  • Data enhancement by randomly taking 224*224 regions (and horizontally flipped mirrors) from the 256*256 original image equates to (256*224)2*2=2048 times more data. If there is no data enhancement, CNN with numerous parameters will fall into the over-fitting process only by relying on the original data volume. The over-fitting can be greatly reduced and generalization ability can be improved by using data enhancement. For prediction, it is to take the four corners of the picture plus five positions in the middle, and flip left and right to obtain a total of 10 pictures, and predict them and calculate the average of the 10 results.

ย 

ZFNet

The network structure of ZFNet is basically the same as that of AlexNet. The main change is to change the size of the convolution kernel from 11*11 to 7*7 at the first layer of AlexNet, and change the step size s from 4 to 2. Since ZFNet has only changed so much from AlexNet, why talk about this structure? This is what I think is more valuable for ZFNet to put forward an idea of inverse transformation to visualize neural networks. The smaller convolution kernel is also the result of visualization, that is, small convolution kernel makes the network better.