Computer Vision -GoogLeNet V3

Into the sky of a cloud watching you quietly outside the window.

gossip

I will sort out the relevant documents and put them on my GitHub. You are welcome to add stars, ask questions and correct mistakes. Meanwhile, I am looking forward to our joint participation.

preface

Rethinking the Inception structure in computer vision.

review

  1. Googlenet-v1 mainly adopts multi-scale convolution kernel, 1×1 convolution operation and auxiliary loss function.
  2. Googlenet-v2 adds BN layer on the basis of V1, and uses small convolution kernels to replace large convolution kernels.

GoogLeNet — V1 adopts multi-scale convolution kernel, 1✖️, 1 convolution operation and auxiliary loss function to realize deeper 22-layer convolutional neural network, which won ilSVRC-2014 classification and detection champion and localization runner-up.

On the basis of Googlenet-v2, BN layer was added, and the 5*5 convolution was completely replaced by two 3✖️3 convolution stacks to further improve the performance of the model.

The VGG network model is large, has many parameters, and requires a lot of calculation. Therefore, it is not applicable to real scenarios.

GoogLeNet has less computation than VGG. GoogLeNet can be used for scenarios with limited resources.

Rethinking the Inception Architecture for Computer Vision

Research Significance:

  1. The model design criteria are summarized to provide reference for convolutional neural network model design.
  2. Three techniques are proposed to establish the most commonly used model of Inception series — Inception V3.

The paper details

The advantages of this paper are as follows: 1. Convolution decomposition is proposed to improve efficiency

GoogLeNet’s Inception architecture is also designed to perform well even under tight constraints on memory and computing budgets. GoogLeNet, for example, uses only 5 million parameters, compared with the 60 million used by its predecessor AlexNet, which represents a 12-fold reduction. In addition, VGGNet uses 3 times more parameters than AlexNet.

Abstract:

  1. Background: Since 2014, deep convolutional neural network has become the mainstream and achieved excellent results in multiple tasks;
  2. Problem: At present, convolutional neural network with high precision has many parameters and a large amount of calculation, so it is difficult to land.
  3. Solution: Decomposition convolution and regularization strategies are proposed to improve the speed and accuracy of deep convolutional neural network.
  4. Results: single crop, top- 5,5.6%; Multi-crop, top-5, 3.5%.

Large volume sets decompose into small convolution kernels stacked. Small networks that replace the 5 by 5 convolution.

Decoupling:

  • Accelerate training;
  • With fewer parameters, you can use more convolution kernels;

Decomposed into smaller convolution:

  1. Small convolution kernel, small computation;
  2. Large convolution kernel, large receptive field, can capture more information;
  3. Small convolution kernel will reduce the ability of expression;

Convolution with large spatial filters (e.g. 5×5 or 7×7) tends to be disproportionately expensive computationally. For example, the computational cost of convolution with 5×5 filters on a grid with filters is 25/9=2.78 times higher than that of 3×3 convolution with the same number of filters. Of course, a 5×5 filter can capture the dependencies between the activation signals of the more distant cells in the previous layers, so reducing the geometry of the filter comes at the cost of great scalability.

If we zoom in on the calculation of 5✖️5 convolution, we see that each output looks like a small fully connected network, sliding 5✖️5 blocks over its input (figure 1 above). Since we are building the visual network, it seems natural to use translation invariance to replace the fully connected component again with a two-layer convolutional architecture: the first layer is 3×3 convolutional, and the second layer is the fully connected layer at the top of the first layer 3✖️ (figure 1 above). Sliding this small network over the input activation grid boils down to replacing 5✖️5 convolution with two layers of 3✖️3 convolution (Figure 4 and Figure 5 above).

  1. 3✖️3 Can it still be decomposed? Available for 2 ✖ ️ 2? In fact, 3✖️1 and 1✖️3 are better;
  2. Asymmetric and 2✖️2 reduce parameters by 33% and 11%, respectively.

By using asymmetric convolution, such as n✖️1, we can do better than 2 by 2. For example, using 3✖️1 convolution, followed by 1✖️3 convolution, is equivalent to sliding a two-layer network with the same felt field of 3✖️3 convolution (see Figure 3). If the number of input and output filters is equal, the two-layer solution is still 33% cheaper with the same number of output filters. By contrast, decommissioning the 3✖️3 convolution into 2✖️2 convolution saves only 11% of the computation.

  1. Do not decompose at first, the effect is not good!
  2. Feature maps between 12 and 20 are good! 3. The best parameter is 1✖️ 7,7 ✖️1.

Experiments between two Inception models, one using decomposition as linear +ReLU layers and the other using two ReLU layers. After 3.86 million operations, the former is stable at 76.2%, while the latter reaches 77.2%t OP-1 accuracy on the verification set.

Replace 3✖️3 calculus of small networks. The bottom layer of the network consists of 3✖️1 convolution of 3 output units.

Utility of auxiliary classifiers

  1. The auxiliary classification layer can not accelerate the convergence in the early stage.
  2. Before convergence, there is no auxiliary classification, the training speed is the same;
  3. Fast convergence, more with auxiliary classification than without auxiliary classification.

The architecture is used for the roughest (8✖️8) grids to facilitate higher-dimensional representation. We only use this solution on the coarsest mesh, because this is where generating a high-dimensional sparse representation is the most critical, because the ratio of local processing (1✖️1 convolution) increases compared to spatial aggregation.

The assumption that the auxiliary classification layer mentioned in V1 is helpful for low-level feature extraction is incorrect.

This paper considers that auxiliary classification plays a regular role. If the secondary branch is batch normalized or has a dropout layer, the network’s primary classifier performs better. This also provides weak support for the hypothesis that batch normalization acts as a regularizer.

The left figure shows the traditional pooling method, which will lose the information of the feature graph. The right figure shows the process of enlarging the feature graph before pooling. The problem is that the calculation brightness is too large.

Solution: Half of the feature images are obtained by convolution, half of the feature images are pooled, and then spliced.

Note: The Inception- Module is used for 35×35 down to 17×17 and 17×17 down to 8×8;

Inception module, which reduces the mesh size while expanding the filter bank. It’s cheap and avoids bottlenecks. The right figure shows the same solution, but in terms of grid size rather than operations.

The experiment

Comparison of recognition performance with different receptive field sizes but constant computational cost.

  1. 299×299 receptive field, stride2 after the first layer and maximum pooling;
  2. Stride1 and 151×151 receptive fields were found after the first layer.
  3. 79×79 receptive field, span 1 after the first layer and no pooling.

Starting with v2, a new TRICK is added based on the previous model, the last of which is called inception-v3.

The cumulative effects of single-model and multi-crop experiment results on various influencing factors were compared. Our numbers are compared with the best single model reasoning results published on the ILSVRC2012 classification benchmark.

Compare the integrated evaluation results of multi-model and multi-crop report results. Our numbers are compared with the results of best integrated reasoning published on the ILSVRC 2012 classification benchmark. * All results, but the first five reported integration results are on the validation set. The integration produced 3.46% top-5 errors on the validation set.

Thesis summed up

Major improvements of Inception-V3:

  1. RMSProp optimization method was adopted.
  2. Label smoothing regularization method is used.
  3. 17×17 feature images were extracted by asymmetric convolution.
  4. The auxiliary classification layer using BN;

Key points:

  1. Asymmetric convolution decomposition: reduce parameter calculation and provide a new idea for convolution structure design;
  2. Efficient feature graph descending strategy: Using convolution and pooling of Stride =2 to avoid information representation bottleneck;
  3. Label smoothing: avoid network overconfidence and reduce overfitting;

Inspiration:

  1. CNN classification is the basis of CNN visual tasks: CNN that performs well in classification usually performs well in other visual tasks;
  2. Many of GoogLe’s papers derive optimal solutions from extensive experiments that are difficult for the average player to reproduce;
  3. The result of asymmetric convolution decomposition is good for feature images with resolution of 12-20, and feature extraction is performed with 1×7 and 7X1.
  4. In the early stage of network training, the addition of auxiliary classification layer did not accelerate the network convergence, but in the late stage of training, the network convergence was accelerated.
  5. Removing the first of the two auxiliary classification layers does not affect network performance.
  6. The label smoothing parameter is set to keep the probability of non-label around 10-4.