Make writing a habit together! This is the 12th day of my participation in the “Gold Digging Day New Plan · April More text Challenge”. Click here for more details.

Summary of classical Network models for deep learning ii

Vgg-16 and VGG-19 are excellent (16 and 19 refer to the number of network layers). The following takes VGG-16 as an example to introduce the VGG network in detail. First of all, we can take a look at the overall structure of VGG-16, as shown in the figure below: It can be seen that VGG-16 and AlexNet also have similarities (the number of network layers becomes deeper). Both of them are pooled after a series of convolution operations, and finally some fully connected layers are connected.

The following figure will be used to introduce the transformation between each layer one by one. There are 16 layers in total, as follows:

Input size: 224*224*3

Convolutional layer: 13

Pooling layer: 5

Full connection layer: 3 (output layer is included here, the last full connection layer can also be said to be output layer, the last SotMax layer is just for classification)

  • Input layer: VGG-16 input is 224*224*3 size color pictures.
  • The first convolution layer: convolved images with the size of 224*224*3, convolved kernels with the size of 3*3, step size s=1, padding=1, and number of convolved kernels 64. After convolution, a feature graph with the size of 224*224*64 is obtained.
  • The second convolution layer: convolved images with the size of 224*224*64, with the convolution kernel size of 3*3, step size of S =1, padding=1, and number of convolution kernels of 64. After convolution, a feature image with the size of 224*224*64 was obtained. [Note: In the figure above, the first and second convolution are put together, because the dimension of the second convolution is unchanged relative to the feature graph of the first convolution.]
  • The first pooling layer: the input dimension is 224*224*64, the pooling kernel size is 2*2, the step size is S =2, the padding=0, and the feature map with the size of 112*112*64 is obtained by pooling kernel.
  • The third and fourth convolution layers: input images with 112*112*64 dimensions for convolution, convolution kernel size is 3*3, step size is S =1, padding=1, number of convolution kernels is 128, and feature images with 112*112*128 size are obtained after convolution. [Note: Like the first and second convolution, the dimension of the feature graph after the third convolution does not change after the fourth convolution, which is written together here.]
  • The second pooling layer: the input dimension is 112*112*128, the pooling core size is 2*2, the step size is S =2, the padding=0, and the feature map with the size of 56*56*128 is obtained by pooling core.
  • The fifth, sixth, and seventh convolution layer: input images with dimensions of 56*56*128 for convolution. The convolution kernel size is 3*3, step size is S =1, padding=1, and number of convolution kernels is 256. After convolution, a feature graph with size of 56*56*256 is obtained. Similarly, the sixth and seventh convolution do not change the size of the feature graph after the fifth convolution, so they are put together.
  • The third pooling layer: input dimension is 56*56*256, pooling core size is 2*2, step size s=2,padding=0, and feature image size of 28*28*256 is obtained by pooling core.
  • The eighth, ninth and tenth convolution layer: input images with dimensions of 28*28*256 for convolution. The convolution kernel size is 3*3, step size is S =1, padding=1, and number of convolution kernels is 512. After convolution, a feature graph with size of 28*28*512 is obtained.
  • The fourth pooling layer: the input dimension is 28*28*512, the pooling core size is 2*2, the step size is S =2, the padding=0, and the feature map with the size of 14*14*512 is obtained through pooling.
  • The 11th, 12th and 13th convolution layer: input images with dimensions of 14*14*512 for convolution. The convolution kernel size is 3*3, step size is S =1, padding=1, and number of convolution kernels is 512. After convolution, a feature map with size of 14*14*512 is obtained.
  • The input dimension of the fifth pooling layer is 14*14*512, the pooling core size is 2*2, step size s=2,padding=0, and the feature map with the size of 7*7*512 is obtained by pooling core.
  • The first full connection layer: input 7*7*512=25088 neurons, input 4096 neurons
  • The second full connection layer: input is 4096 neurons, output is 4096 neurons.
  • The third full connection layer: input is 4096 neurons, output is 1000 neurons.

Compared with AlexNet, VGG has the following improvements:

  1. The main difference, deeper, increases the number of network layers to 16-19 (not including pooling and Softmax layers), while AlexNet is an 8-layer structure.
  2. Elevate the convolution layer to the concept of a convolution block. The convolution block is composed of 2-3 convolution layers, which enables the network to have a larger receptive field and reduce network parameters at the same time. At the same time, the multiple use of ReLu activation function has more linear transformations and stronger learning ability.
  3. The former Conv takes up a lot of memory, while the latter FC takes up a lot of parameters, most of which are concentrated in the first fully connected layer, making the final parameters up to 138M.
  4. Cancel LRN, because it turns out that LRN actually reduces efficiency.