Inception series inception_v2-V3

Inception_v1 of Inception series

MobileNet series MobileNet_v2

MobileNet series MobileNet_v3

Click follow and update two computer vision articles a day

Quote:

Inception_v2 and Inception_v3 are in the same paper, the paper that proposed BN is not Inception_v2. The difference between the two is that the paper “Rethinking the Inception Architecture for Computer Vision” refers to various design and improvement techniques, Inception_v2 uses some of these constructs and improvements, and Inception_v3 uses all of them.

Principles of model design

Inception_v1 structure is very complex, it is difficult to change on its basis, if arbitrarily change its structure, it is easy to directly lose part of the calculation benefit. At the same time, the Inception_v1 paper does not have a detailed description of each decision design factor, which makes it difficult to simply adapt to some new applications. To this end, the Inception_v2 paper introduces the following basic design principles in detail, and proposes some new structures based on these principles.

1. Avoid presentation bottlenecks, especially in shallow layers of the network. The dimensions of each layer of a forward network should decrease from input to output. (Bottlenecks occur when dimensions do not vary this way)

2. Representations of high dimensions are easy to be processed in the network. Increasing the number of activation functions will make it easier to parse features and make the network training faster. (This principle means that the higher the representation dimension is, the more suitable it is for network processing. For example, data classification on two-dimensional plane is not suitable for network processing. Increasing the number of activation functions will make it easier for network to learn its representation features.)

3. Spatial aggregation can be carried out on lower dimensional embeddings without losing much presentation power. For example, the size of the input representation can be reduced (shallow) prior to spatial aggregation without serious adverse effects before more decentralized (e.g., 3×3) convolution is performed. We assume that the reason for this is that if the output is used in a spatial aggregation environment (mid and high level), the strong correlation between adjacent cells results in much less information loss during sizing. Since these signals should be easy to compress, reducing the size can even facilitate faster learning.

4. Balance network width and depth. Optimal network performance can be achieved by balancing the number of filters and network depth at each stage. Increasing the width and depth of the network can help improve the quality of the network. However, the best improvement of constant computation can be achieved if both are added in parallel. Therefore, the calculation budget should be distributed in a balanced way between the depth and width of the network.

Some special structures

01 Convolution Decomposition

A 5×5 convolution kernel can be replaced by two consecutive 3×3 convolution kernels, where the first is normal 3×3 convolution and the second convolution is fully connected on the basis of the previous 3×3 convolution. The advantage of this is that it achieves both the desired receptive field of 5×5 convolution and the smaller parameter, 2×9/25, which is about 28% smaller. The details are shown in Figure 1 on the left below. Furthermore, a 3×3 convolution is decomposed into 3×1 and 1×3 by asymmetric decomposition. The details are shown in Figure 2 on the right.

Thus the original Inception structure (figure 3, left) can be transformed into the structures shown below (Figure 5, middle) and (Figure 6, right).

Finally, a structure combining the two decomposition methods was derived, as shown in figure 7 below.

In practice, such decomposition structure does not work well in the lower layer of the network. It has a good effect in layers of medium size (MXM feature map, where M is within the range of 12 to 20). This is in consideration of the second principle that such Inception structure will be placed in the middle layer of the network, while the structure of the general convolutional network will still be used in the lower layer of the network.

02 Utility of auxiliary classifiers

The auxiliary classifier did not play any role in the early stage of training, but in the late stage of training, it began to surpass the network without auxiliary classifier in accuracy and reached a slightly higher plateau. Moreover, there is no adverse effect after removing these two auxiliary classifiers, so the idea in Inception_v1 of helping low-level networks train faster is problematic. The primary classifier works better if the two branches have BN or Dropout, which is weak evidence that BN can act as a regularizer.

Reduce Grid Size efficiently

There are two ways to reduce Grid Size, as shown above. The one on the left violates the first principle that size should decrease layer by layer, otherwise it will develop bottleneck. The diagram on the right fits the first principle, but it’s a huge number. To this end, the author proposes a new approach as shown in figure 10 below. That is, parallel operation, convolution and pooling operation with step Size of 2 are used to reduce Grid Size without violating the first principle.

The complete Inception_v2 structure diagram is as follows:

Padding is not used in the entire structure; the proposed FIG10 structure is used between each Inception module.

04 Regularize the label smoothing model

If the model learns in the training process to label all probability values with ground truth, or to make the maximum Logit output value as different as possible from other values, intuitively speaking, the model will be more confident in prediction. In this way, over-fitting will occur and generalization ability cannot be guaranteed. Therefore, label smoothing is necessary.

Delta k,y is the Dirac function, which is the class k = y, which is 1, or 0 otherwise. The original label vector q | x (k) = the delta k, y. The smoothed label vector becomes the following formula.

Here ∈ is a hyperparameter, u(k) is 1/k, and K represents the number of categories. That is, the new label vector (assuming triclassification) will become (∈/3, ∈/3, 1-2∈/3) instead of the original label vector (0,0,1).

conclusion

The difference between Inception_v2 and Inception_v3 is shown here. Inception_v2 refers to an Inception module that uses one or more of Label Smoothing or BN-auxiliary or RMSProp or Factorized technologies. Inception_v3 refers to the Inception module where all of these technologies are used.

This article comes from the public CV technical guide technical summary series.

Welcome to pay attention to the public number CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

A summary PDF of the following articles can be obtained by replying to the keyword “Technical Summary” in the public account.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Other articles

Inception series inception_v2-V3

Other articles

​

Related Posts

Developers should not miss the open source project – Artificial Intelligence

Talk about from Q_Learning to DQN

The first in our series on deep learning: Getting Started