v1:Going deeper with convolutions


For Inception V1, the Inceptionmodule structure is mainly proposed (combining 1*1, 3*3, 5*5 CONV and 3*3 pooling). The biggest highlight is the introduction of 1*1 CONV from NIN (Network in Network). The structure is shown in the figure below, representative of GoogleNet

Assume the size of previous Layer is 28*28*192, then,

Weights of A: 1*1*192*64+3*3*192*128+5*5*192*32=387072

The output featuremap size of A is 28*28*64+28*28*128+28*28*32+28*28*192=28*28*416

B the size of the weights, 1 * 1 * 192 * 64 + (1 + 3 * 3 * 1 * 192 * 96 * 96 * 128) + (1 * 1 * 192 * 16 + 5 * 5 * 16 * 32) + 1 * 1 * 192 * 32 = 163328

Output feature map size of B: 28*28*64+28*28*128+28*28*32+28*28*32=28*28*256

I can’t help but feel the genius of 1*1 conv. From the above data, we can see that the weights are reduced on the one hand, and the dimension is reduced on the other hand.

Highlights of Inception V1 are summarized as follows:

(1) There is a common function of the convolutional layer, which can achieve dimension reduction and dimension increase in channel direction. The reduction or dimension increase depends on the number of channels (filters) in the convolutional layer. In Inception V1, 1*1 convolution is used for dimension reduction to reduce weights and feature map dimensions.

(2) the unique function of 1*1 convolution. Since 1*1 convolution has only one parameter, it is equivalent to making a scale on the original feature map, which is also learned from training, and will undoubtedly improve the recognition accuracy.

(3) Increase the depth of the network

(4) Increase the width of the network

(5) convolution of 1*1, 3* 3,5 *5 is also used to increase the adaptability of the network to the scale

The following figure shows the network structure of Googlenet:

There are two things to note here:

(1) In order to ensure convergence, there are three Losses in the whole network

(2) Global average pooling is used before the last fully connected layer, and there are still many places to play.



v2:Batch Normalization: Accelerating Deep Network Training by ReducingInternal Covariate Shift

Inception V2 network, representing the addition of the Batch Normalization layer of BN, uses an improved GoogleNet with 2 3*3 instead of 1 5*5 convolution.

Highlights of Inception V2 are summarized as follows:

(1) BN layer is added to reduce InternalCovariate Shift (the distribution of internal Neuron data changes), so that the output of each layer is normalized to a gaussian of N(0, 1), thus increasing the robustness of the model, training at a larger learning rate and faster convergence. Initialization is more arbitrary and, as a regularization technique, reduces the use of the Dropout layer.




v3:Rethinking the InceptionArchitecture for Computer Vision

Inception V3 network, primarily based on V2, proposes the Factorization of Convolutional decomposition (GoogleNet).

Highlights of Inception V3 are summarized as follows:

(1) Decompose 7*7 into two one-dimensional convolution (1 *7,7*1), and 3*3 is the same (1 *3,3*1). Such advantages can not only accelerate the calculation (the excess computing power can be used to deepen the network), but also split one CONV into two ConVs, which further increases the network depth and nonlinearity. The modules of 35*35/17*17/8*8 are designed more carefully.

(2) Increase the network width, the network input from 224*224 to 299*299.

 

v4:Inception-v4,Inception-ResNet and the Impact of Residual Connections on Learning 

Inception V4 mainly uses Residual Connection to improve the STRUCTURE of V3, represented as, Inception- resnet-V1, Inception- resnet-V2, Inception- V4

The residual structure in Resnet is as follows, which is cleverly designed and almost inspiring. Eltwise is made by using the original layer and the feature map of the base of two volumes. The improvement for Inception-ResNet is to use the Inception Module above to replace conv+1*1 conv in ResNet Shortcut.

Highlights of Inception V4 are summarized as follows:

(1) By combining Inception module and ResidualConnection, the Inception- Resnet-V1 and Inception- Resnet-V2 are proposed to accelerate and converge training faster and with higher accuracy.






(2) A deeper version of Inception- V4 has been designed with the same effect as Inception- resnet-V2.

(3) The network input size is the same as V3, still 299*299

Aggregated ResidualTransformations for Deep Neural Networks

This article presents an updated version of ResNet. Different from the dimension of channel and space, cardinality mainly represents the number of modules in ResNeXt and the final conclusion

(1) Increasing Cardinality is better than increasing width or depth of the model

(2) Compared with ResNet, ResNeXt has fewer parameters, better effect, simpler structure and more convenient design

The left picture shows a Module of ResNet and the right picture shows a module of ResNeXt, which is a split-transform-merge idea

Xception: DeepLearning with Depthwise Separable Convolutions

In this paper, Xception (Extreme Inception) is proposed based on Inception V3. The basic idea is depthwise convolution operation. It finally came true

(1) There is a slight decrease in model parameters, which is as follows:

(2) The accuracy is improved compared with Inception V3. The accuracy of ImageNET is as follows:

First of all, the operation of convolution, there are two main transformations,

(1) Spatial dimensions

(2) Channel dimension

Xception plays on these two transformations. The difference between Xception and Inception V3 is as follows:

(1) The difference of convolution operation sequence

Inception V3 is a 1*1 convolution followed by a 3*3 convolution so that channels are merged first, i.e. channel convolution followed by spatial convolution, whereas Xception is the opposite, 3*3 convolution of space followed by 1*1 convolution of channel.

(2) if there are RELU

This difference is the most significant, Inception V3 has RELU operations in every module, Xception does not have RELU operations in every module.

MobileNets: EfficientConvolutional Neural Networks for Mobile Vision Applications

 

MobileNets is an application of the Exception idea. The difference is that the Exception article focuses on improving accuracy, while MobileNets focuses on compressing the model while ensuring accuracy.

 

Depthwiseseparable convolutions the idea of depthwise parable is to decompose a standard convolution into a Depthwise convolutions and a pointwise convolution. It’s simply the factorization of a matrix.

The difference between traditional convolution and deeply separated convolution is as follows:

Assume that the input feature map size is DF * DF and dimension is M, the filter size is DK * DK and dimension is N, and assume that the padding is 1 and the stride is 1. Then,

For the original convolution operation, the number of matrix operations required is DK · DK · M · N · DF · DF, and the convolution kernel parameter is DK · DK · N · M

The number of depthwise matrix operations to be performed is DK · DK ·M · DF · DF + M · N · DF · DF, and the convolution kernel parameters are DK · DK ·M +N ·M

Since the process of convolution is mainly a process of reducing spatial dimensions and increasing channel dimension, namely N>M, DK · DK · N · M> DK · DK · M+N · M.

As a result, depthWiseseparable compresses both the model size and the amount of model computation, making the model faster, less expensive, and more accurate. As shown in the figure below, MACS on the horizontal axis represents the multiplection-Accumulates of addition and multiplication, and veracity on the vertical axis.

In caffe, the group operation in the convolutional layer is mainly used to realize the depthwise reflection convolutions, and the base_line model is about 16M in size.

The mobileNet network structure is as follows:



ShuffleNet: AnExtremely Efficient Convolutional Neural Network for Mobile Devices

 

This article makes one major improvement on mobileNet:

MobileNet only performs deepWiseconvolution of 3*3 convolution, while 1*1 convolution is still a traditional convolution mode with a lot of redundancy. ShuffleNet performs shuffle and group operations on this basis. Realize channel shuffle and Pointwise Group convolution, and finally improve the speed and accuracy compared with mobileNet.

As shown in the picture below,

(a) Is the original mobileNet framework, and there is no information exchange between various groups.

(b) The feature map is shuffled




The basic idea of Shuffle is as follows. Assume that two groups are input and five groups are output

| group 1   | group 2  |

| 1, 2, 3, 4, 5 6,7,8,9,10 | |

To a 2 by 5 matrix

One, two, three, four, five

6, 7, 8, 9, 10

Transpose, 5 by 2

1 6

2 7

3 8

4 to 9

5 to 10

Flat matrix

| group 1   | group 2  | group 3   | group 4  | group 5  |

| 1, 6 2, 7, 8 | | 3 | 4, 9, 10 | | 5

The structure of ShuffleNet Units is as follows,

(a) is a bottleneck unit with a DepthWise Volution (DWConv)

(b) Based on (a), pointwiseGroup convolution (GConv) and Channel shuffle are performed

(c) The final ShuffleNetunit of AVG pooling and concat operations



MobileNetV2: Inverted Residuals and Linear Bottlenecks 

The main contributions are as follows:

· Inverted residuals are proposed.

Because MobileNetV2 version uses residual structure, and resnet residual structure is similar, from resnet, but with different.

Since Resnet does not use Depthwise ConV, the number of characteristic channels before entering the Pointwise ConV is quite large, so dimension reduction of 0.25 times is used in the residual module. However, MobileNet V2 has a relatively small number of channels due to depthwise CONV, so 6 times of liters are used in residuals.

To sum up, two differences

(1) The residual structure of ResNet is 0.25 times of dimensionality reduction, and that of MobileNet V2 is 6 times of dimensionality reduction

(2) In the residual structure of ResNet, 3*3 convolution is ordinary convolution, and in MobileNet V2, 3*3 convolution is Depthwise conv

 

MobileNet V1 and MobileNet V2 have 2 differences:

(1) Before entering the 3*3 convolution, the V2 version carried out 1* 1Pointwise conv dimension enhancement and passed RELU.

(2) After the convolution of 1*1 goes out, RELU operation is not performed

40. We have linear executers.

According to the no RELU?

First look at RELU’s features. RELU can map all negative values to 0 and is highly nonlinear. The following figure shows the test of the paper. When the dimension is 2,3 lower, the loss of information caused by RELU is serious. However, when the single dimension is 15 or 30, the loss of information is relatively small.

In MobileNet V2, in order to ensure that the information is not lost in large quantities, the RELU of the last one should be removed from the residual module. Therefore, it is also called a linear modular element.

MobileNet V2 network structure



Where, T represents expansion factor of channels, C represents number of output channels,

N represents the repetition times of the unit, s represents the sliding step length and stride

In the bottleneck module, the stride=1 and stride=2 modules are shown in the figure above, and only the stride=1 module has residual structure.

 

Results:

The MobileNet V2 is faster and more accurate than the MobileNet V1

references:

Iamaaditya. Making. IO / 2016/03 / one…

Github.com/soeaver/caf…

Github.com/facebookres…

Github.com/kwotsin/Ten…

Github.com/shicai/Mobi… Github.com/shicai/Mobi…

Github.com/tensorflow/…

Github.com/HolmesShuan…

Github.com/camel007/Ca…

Github.com/shicai/Mobi…

Github.com/chinakook/M…