Introduction:

MobileNet v1 presents an efficient network architecture and a set of two hyperparameters that allow model builders to choose the right size model for their applications based on the constraints of the problem to build very small, low-latency models. These models can be easily matched to the design requirements of mobile and embedded visual applications.

Click follow and update two computer vision articles a day

01 Depthwise Separable Convolution

Depth separable convolution is the decomposition of the standard convolution into depth convolution and 1×1 point convolution. Deep convolution applies a single filter to each input channel, and point convolution combines the weighted output of deep convolution. Such decomposition effect greatly reduces the amount of computation and model size.

For example, given a DfxDfx M feature map generate a DfxDfx N feature map.

Normal convolution kernel (as shown in Figure 1) size is Dk× Dk×M ×N. The calculation quantity is Dk· Dk· M · N ·DF ·DF.

The dimension of depth convolution in depth separable convolution (as shown in Figure 2) is Dk ×Dk ×M. Here, M represents the MTH convolution kernel (the number of channels of each convolution kernel is 1) for the MTH channel, and its calculation quantity is Dk· Dk· M ·Df· Df. The dimension of point convolution (as shown in Figure 3) is 1 x1 xM xN, and its computational quantity is M · N ·Df ·Df. Therefore, the computation quantity of depth separable convolution is Dk· Dk· M· Df·Df + M·N· Df·Df.

The ratio of depth separation convolution to standard convolution is

According to this, MobileNet uses 3×3 depth separable convolution to reduce the amount of computation by 8-9 times, while the accuracy is only slightly reduced. Details are as follows:

02 Network Structure and Training

Except that the first layer is standard convolution, and a fully connected layer is finally added, the rest is made up of deeply separable convolution. BN and ReLU layers are added after each layer except for the last full connection layer. There are 28 floors.

Its complete structure is shown as follows:

The point convolution of 1×1 is 94% of the computation, and the parameters are 75%. In addition, the largest number of parameters is the full connection layer, and the parameters and computation of different types of layers are as follows:

This deeply separable convolution doesn’t just reduce the computation. For the sparse matrix with no structure, although the computation is less than that of the dense matrix, However, as the dense matrix is optimized by the General matrix multiply function (GEMM) at the bottom layer (this optimization is the initial reordering of convolution by IM2COL in memory, and then the matrix multiplication calculation), the dense matrix calculation is faster. Here, the 1×1 convolution in the deeply separable convolution does not need to conduct ·im2col reordering, and can directly use matrix operation. As for the deep convolution part, the number of parameters and calculation amount are very few, so it can be calculated according to the optimization of normal convolution. So the computation of depth separable convolution is also extremely fast.

In addition, MobileNet uses fewer regularization and data enhancement techniques than larger models, because MobileNet models are so small that they don’t have to worry about overfitting. In addition, MobileNet uses the Inception_v2 structure, which does not require label smoothing and also reduces crop distortion due to size constraints. Finally, it has very few parameters, so it does not need L2 regularization for weight attenuation.

03 Width Multiplier: Thinner Models

Even though the above structure is small enough, many applications require a smaller model. Here, the hyperparameter ɑ is used to scale down the number of deep convolution kernels and point convolution. Therefore, the calculated quantity is changed to :Dk · Dk· ɑM· Df· Df + ɑM· ɑN· Df· Df When ɑ=1, it is the MobileNet structure above. A MobileNet that uses the hyperparameter ɑ is called reduced MobileNets. The number of calculations and the number of parameters is reduced by about a^2.

04 Resolution Multiplier: Reduced Representation

The second hyperparameter Of Rho was used to reduce spatial resolution. The usage is the same as ɑ. Therefore, the calculation quantity is changed to Dk ·Dk ·ɑM ·pDf ·pDf + ɑM ·ɑ N ·pDf ·pDf

Here p∈ (0,1). So the spatial resolution becomes 224, 192, 160 or 128. The computation is reduced by something like that. There is no reduction in the number of parameters.

05 Experiments

MobileNet series MobileNet_v2

MobileNet series MobileNet_v3

This article comes from the public CV technical guide model interpretation series.

Welcome to pay attention to the public number CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.