VGG network

Design idea

1 Deeper networks help improve performance, layer 11, 13, 16, 19

2. The deeper network is difficult to train and easy to overfit, so he uses small convolution kernel

3. The extensibility is very strong, and the generalization is also very good when migrating to other image data

The core structure

VGG11, VGG13, VGG16 (common), VGG19 (common), such as the model of D, the convolution kernel is smaller to ensure that the network is deep enough, 3 full connections, Softmax loss function, each convolution and the first two full connection layers are followed by a ReLU activation function.

1 The training input is an RGB image with a size of 224*224

2. When the input image is preprocessed, only the mean value is subtracted. Generally speaking, when training the network, the value of 128 is subtracted or the mean value calculated by all pixels is subtracted. , model of fixed network parameters are generally floats, floating point arithmetic computation can a bit bigger, so run to the chip model to do it during the fixed-point, parameters into a fixed number of vertex operations, to the input of the RGB (0 ~ 255) minus 128 operating data, become an 8-bit signed integer uint_8, is advantageous to the compression and arithmetic of the model.

(3) All hidden layers use ReLu (only 0 or 1, which is efficient) instead of THE LRN layer (Local Response Normalization wastes more memory resources and time, but performance does not improve much), and ReLu is used mostly in future development

4. The convolution kernels of 3×3 and 1×1 are used with very small receptive fields, which can deepen the network and reduce the total parameters of using multiple small convolution kernels, thus reducing the amount of calculation. One is to reduce the parameters, and the other is equivalent to more nonlinear mapping, which can increase the fitting/expression ability of the network.

5. 1×1 convolution layer increases the nonlinearity of decision function. 1×1 convolution kernel is equivalent to full connection. The 1×1 convolutional neural network can also be used to replace the full connection layer.

6 Two consecutive convolution of 3×3 corresponds to a 5×5 receptive field, and three corresponds to 7×7.

6.1 As the network structure deepens, it contains three ReLu layers instead of one, making the model more discriminant and non-linear;

6.2 Reduced parameters

6.2.1 For example, if the input and output are all C channels, 3x(3x3xCxC)=2x7xCxC is required even if there are 3 convolution layers using 3×3, while 7x7xCxC=4x9xCxC is required if one convolution layer using 7×7.

6.2.2 is equivalent to applying a regularization (to prevent over-fitting) to 7×7 convolution and decomposing it into three 3×3 convolution. Machine learning is mainly about learning the distribution of fitting samples. The simpler the structure, the fewer the parameters, the better the performance and the stronger the generalization ability.

During training, the a-level network of VGGNet with simple level (shallow level) is trained first, and then the weight of A-network is used to initialize the complex model behind, so as to accelerate the convergence speed of training.

7 The number of channels is large

The number of channels in the first layer of the VGG network is 64, and each layer is doubled to a maximum of 512 channels. As the number of channels increases, more information can be extracted.

The 8 layers are deeper and the characteristic map is wider

Since the convolution kernel focuses on expanding the number of channels and pooling (the essence of pooling is dimensionality reduction to reduce information redundancy) focuses on reducing the width and height, the model architecture is deeper and wider while controlling the increasing scale of computation.

9 Fully connected convolution (test phase)

This is also a feature of VGG. In the network test phase, the three full connections in the training phase are replaced by three convolution, so that the full convolutional network obtained in the test can receive input of any width or height because there is no restriction of full connections, which is very important in the test phase.

For example, if the layer of 7x7x512 needs to be fully connected with the layer of 4096 neurons, it is replaced by the convolution of the layer of 7x7x512 with the number of channels of 4096 and the convolution kernel of 1X1. The idea of fully connected convolution refers to the working idea of OverFeat. As shown in the figure below, convolution operation is performed on the 14×14 image. In this step after obtaining the 5×5 feature map, if full connection is used, it will be flattened and then fully connected, thus destroying the image position relation of the feature map. Go straight to a list of features. However, if full convolution is used, a feature map of 1x1xC will be finally obtained, where C is the channel number and the size of the category. At this time, if a 16×16 image is presented, a feature map of 2x2xC will be obtained after full convolution. At this time, the four values of 2×2 can be maximized or averaged, which will become one value. In this way, a larger image is obtained, and the feature map finally obtained is 3x3xC. The size of 4x4xC, 5x5xC, the size of the output is related to the size of the input, but it is always possible to map this output (take maximum) to get the value of this category.

Training methods

  • The optimization method is stochastic gradient descent with momentum, SGD+ Momentum (0.9), simulating inertia and correlating historical gradients.
  • The batch size is 256, depending on the graphics card capacity
  • Regularization (l1) : L2 regularization is applied, weight decay is 5E-4
  • Dropout: P =0.5 after the first two fully connected layers.
  • Initialization of weights is important, and bad initialization can lead to stagnation of learning. The method of initialization here is: first train network A, which is shallow enough to initialize weights randomly. After A is trained, the first four layers and the last two fully connected layers of other deeper networks are initialized with the weight of A, and the weight of other layers is initialized randomly, and the random initialization parameters are: Gaussian distribution, mean value is 0, standard deviation is 0.01, bias weight is 0. The batch size is set to 256 and the momentum is 0.9. The training is regularized by weight attenuation (L2 penalty multiplier set to 5⋅10^−4) and dropout regularization (dropout ratio set to 0.5) for the first two fully connected layers. The learning rate was initially set at 10 ^ −2 and then decreased by a factor of 10 when the validation set accuracy stopped improving. The learning rate decreased 3 times in total, and learning stopped after 370,000 iterations (74 Epochs)

Training image size selection

S is the minimum edge of the training image, the training scale. Q is the smallest edge of the test image, the test scale. The original image was scaled in equal proportions so that S was greater than 224, and then 224×224 Windows were randomly extracted from the image for training. Single-scale training: the fixed size of S corresponds to the single-scale training, which trains multiple classifiers. Train two classifiers S=256 and S=384. The classifier S=384 is initialized with the weight S=256. Multi-scale training (scale jittering) : A classifier is directly trained. Every time data is input, each image is re-scaled, and the short edge S of the zoom is randomly selected from [256,512]. Training set enhancement can also be considered through scale jittering. The objects in the image may have different sizes, so this is considered useful when training.

Model to evaluate

VGG proposed that the input image Q may not be equal to S in test, and later proposed full-convolutional network to achieve dense evaluation.

** Single scale evaluation ** that is, the test image size Q is fixed, if S is fixed, then Q=S; If S jitter, Q=0.5(Smin+Smax)

As can be seen from the table:

  • Through network A-LRN, the authors found that the Local Response normalization (LRN) layer used by AlexNet did not lead to performance improvement and wasted memory calculation wastage. Therefore, there is no LRN layer in the network of other groups.
  • With the increase of network depth, the classification performance gradually improves (A, B, C, D, E). From A at layer 11 to E at layer 19, the error rate of top1 and Top5 decreases obviously with the increase of network depth.
  • The performance of multiple small convolutional kernels is better than that of a single large one. (B) VGG The author conducted an experiment to compare B with his own shallow network which is not in the experimental group. The shallow network uses conv5x5 to replace B’s two conv3x3, and the result shows that the performance of multiple small convolutional kernels is better than that of a single large one.
  • The scale jitter (S∈[256;512]) obtained better results than the image training with fixed minimum edge (S=256 or S=384). This confirms that training set enhancement through scale jitter is indeed helpful in capturing multiscale image statistics.

Multi-scale evaluation means the image size Q is not fixed, Q = {S{min}, 0.5(S{min} + S{Max}), S{Max}

Through experiments, the author found that when training with a fixed value of S and the range of Q between [S−32,S,S+32], the test results were closest to the training results, otherwise the performance might be degraded due to the huge difference between training and test scales. The experimental results show that the scale jitter is better than the evaluation of the same model on a single scale, and the scale jitter is better than the training with fixed minimum edge S.

Multi-crop evaluation

Dense: I.e., the fully connected layer is replaced by the convolution layer (the first FC layer is converted to the 7×7 convolution layer, and the last two FC layers are converted to the 1×1 convolution layer), and finally a predicted Score map is obtained, and the results are averaged. Multi-crop evaluation: Random crop of multiple copies of an image. A total of 150 clipping renderings were made for each of the three scales with 50 clipping (a 5×5 size normal grid and two flips). Then, the structure of each sample was predicted through the network, and all results were averaged. Multiple clipping performs slightly better than intensive evaluation, and the two methods are complementary because their combination is superior to each of them. Multiple cropping image assessment is a complement to the intensive evaluation: when applying ConvNet to cut the image characteristics of convolution figure filled with zero, and in the case of intensive evaluation, the same filling naturally cut out of the image from the image of the adjacent parts (pooling as convolution and space), which greatly increased the receptive field of the whole network, thus captured more context. Since the full convolutional network is applied to the whole image, there is no need to sample multiple clipping images during the test, because it requires the network to recalculate each clipping image, which is less efficient. The use of a large number of cropped images improves accuracy because it enables finer sampling of the input image compared to a full convolutional network.

The experimental conclusion

  • Local Response Normalization (A-LRN) does not improve PERFORMANCE of A network
  • The classification error decreases with the increase of depth, and the deeper the feature convergence is, the less affected by sample changes, and the more robust it is
  • Data enhancement: image scale jitter, multi-scale evaluation. During training, we also add scale change, jitter, changes in different color channels, occlusion, clipping, graph clipping and other operations to enhance the robustness of the model.
  • Multi-crop assessment is better than dense assessment.
  • A multi-scale approach is used for training and prediction. It can increase the amount of training data, prevent the model from over-fitting, and improve the accuracy of prediction
  • Multiple small convolution kernels perform better than a single large one

summary

1. Performance can be effectively improved by increasing depth; 2, the best model: VGG16, from beginning to end only 3×3 convolution and 2×2 pooling, simple and beautiful; 3. Convolution can replace full connection and adapt to pictures of various sizes

\