A Quantization-Friendly Separable Convolution for MobileNets

A Quantization-Friendly Monomorphic Convolution for MobileNets

Address: arxiv.org/pdf/1803.08…

This paper was published by Qualcomm in 2018. It mainly studied the MobileNetV1 network with poor quantization effect, explored the reasons for its serious quantization loss (from 70.5% of floating point model to 1.8% of quantization model), and found that the root cause was mainly from detractable convolution. Although it greatly reduces the number of parameters and storage of MobileNetV1, the side effect is that the quantization effect is very poor. Therefore, the author proposes three improvements to solve this problem and improve the quantization effect of MobileNetV1:

  1. Remove BN and ReLU6 from separable convolution;
  2. Use ReLU to replace ReLU6 in other network layers.
  3. L2 regularization is used in separable convolution layers

The above operation finally improved the quantization effect from only 1.8% to 68.03%, greatly reducing the problem of quantization dropping points.

Contact information:

Github:github.com/ccc013/AI_a…

Zhihu column: Machine learning and Computer Vision, AI paper notes

Wechat official account: AI algorithm notes


1. Introduction

Quantitative to run on mobile phones, IoT platform deep learning are critical to the operation of reasoning, mainly is the platform used in power consumption and storage has larger limit, and the platform is very dependent on like digital signal processor (DSP) fixed point computing hardware module to realize high performance compared to the GPU. On such a platform, common deep network structures, such as VGG, GoogleNet and ResNet, are difficult to be deployed on the mobile phone platform due to their large number of parameters and calculation, while some light networks can achieve a balance between accuracy and efficiency by replacing standard convolution with deep separable convolution. As shown in figures a and b below, such as Google’s MobileNetV1, which greatly reduces the number of arguments and storage, is a common network architecture for deployment on mobile phones. However, the side effect is that the separation convolution layer used causes huge quantization loss, which makes its quantization effect very poor.

In order to solve the quantization problem mentioned above, the author uses TensorFlow framework to implement MobileNetV1 and InceptionV3, and compares the accuracy of floating point model and vertex model after quantization, as shown in the figure below, it can be seen that the accuracy of InceptionV3 does not decrease too much after quantization. On the contrary, the accuracy of MobileNetV1 quantization decreased significantly. The author also gives the evaluation of the two networks, the former is only standard convolution, while the latter mainly uses separable convolution.

There are several possible solutions to this serious problem of quantization loss:

  1. The most direct method is to use more bits of quantization. For example, increasing from 8bit to 16bit can improve the accuracy rate, but also increase the number of model parameters and calculation, which are limited by the capacity of the target platform.
  2. The second approach is to retrain a quantitative model for fixed-point reasoning;

For the second method, Google put forward a framework for Quantization Training in the paper Quantization and Training of Neural Net Works for Efficient Integ-arithmetic Only Inference. It can simulate the quantization effect in forward training, while maintaining the original floating-point training operation in back propagation. The framework is subject to additional training to reduce quantization losses, but multiple models need to be maintained for multiple different platforms.

In this paper, the author uses a new structure design for separable convolution layer to build a lightweight quantification-friendly network. The new structure requires only a single floating-point training, and the resulting model can be deployed to different platforms with minimal loss of accuracy, whether it is floating-point or fixed-point reasoning. In order to achieve this goal, the author designed a new quantization friendly MobileNetV1 based on the root cause of MobileNetV1’s quantization accuracy loss, which can maintain the accuracy of floating point model and greatly improve the accuracy of fixed point model after quantization.

The contribution of this paper lies in:

  1. The author thinks that BN and ReLU6 are the main reasons for the large loss of MobileNetV1 quantization.
  2. A quantization friendly separable convolution is proposed, and its efficiency on MobileNetV1 is proved in floating-point training and quantization fixed-point.

2. Quantitative plan and loss analysis

This section will study the root cause of precision loss of MobileNetV1 model based on TensorFlow (TF) with 8-bit quantization in fixed-point pipeline. The figure below shows a classic 8-bit quantization process. The 8bit quantization model of a TF is first generated directly from the pre-trained floating point model, and its weight is first quantized offline. Then, in the reasoning phase, all floating point inputs are first quantized into an 8bit unsigned value, and then transmitted to a vertex runtime operation. QuantizedConv2d, QuantizedAdd, QuantizedMul, etc., generate a 32-bit cumulative result that is then converted into an 8-bit output by activation re-quantization. This output is then passed to the next operation.

2.1 8-bit quantization scheme of TensorFlow

TensorFlow’s 8-bit quantization uses a uniform quantizer, which means that all quantization steps are of equal length.

Xfloatx_ {float}xfloat represents float input x, and TensorFlow’s quantized 8bit value is xquant8x_{quant8}xquant8, then the calculation formula is as follows:

Delta x delta x delta x delta x delta x delta x delta x delta x delta x delta x delta x delta x ⋅][\cdot][⋅] represents the most recent round operation.

Based on the above definition, the cumulative result of a convolution operation can be calculated as follows:

Finally, given the known maximum and minimum values of the output, combining the above formulas (1)-(4), the output obtained by reverse quantization can be calculated by the following formula:

2.2 Measurement of quantified loss

As shown in Figure 2 above, there are five types of loss in this quantization process, including quantization loss of input, weight quantization loss, saturation loss at run time, inverse quantization loss of activation value, and possible construction loss for certain nonlinear operations, such as ReLU6. In order to better understand the contribution of different types of loss, this paper adopts signal-to-quantization Noise Ratio (SQNR) to evaluate the Quantization accuracy of output of each network layer. SQNR calculation is shown as follows. It is defined as the energy of the unquantized signal X divided by the energy of the quantization error N:

Since the average magnitude of the input signal X is much larger than the quantization step △x\triangle_x△x, it is reasonable to assume that the quantization error is a uniform distribution following the mean of 0 and its probability density function (PDF) is combined to 1. Therefore, for an 8-bit linear quantizer, The noise energy can be calculated as follows:

Combining (2), (7) and (6), the following formula can be obtained:

SQNR has a great relationship with signal distribution. According to Equation (8), SQNR is obviously determined by the energy and quantization range of X. Therefore, increasing the capacity of input signal or reducing the quantization range can improve the output SQNR.

2.3 Quantitative loss analysis of MobileNetV1

BN in the deep convolution layer

As shown in Figure (b) below, the core layer of MobileNetV1 contains a depthwise convolution and pointwise convolution, both of which are followed by a BN and nonlinear activation function. Like ReLU or ReLU6. In the implementation of TensorFlow, ReLU6 is used as a nonlinear activation function.

Assuming that X is the input of each layer, including D channels, and the size of each batch is M, BN is independently transformed in each channel in deep convolution, and its calculation formula is as follows:

Where x^ik\hat{x}_i^{k}x^ik represents the normalized i-th input value xikx_I ^{k}xik in channel K, and μk,σk\mu^k, \sigma^kμk,σk represents the mean and variance of the whole batch. γk,βk\gamma^k, \beta^kγk,βk denotes the size and transition value. Whereas ϵ\epsilonϵ is a very small constant set to 0.0010000000475 in the TensorFlow implementation.

In a fixed-point process, the transformation of BN can be folded up by first saying:

Then equation (9) can be rewritten as:

In the implementation of TensorFlow, in each channel K, α\alphaα can be combined with weights and folded into convolution operations to reduce computational consumption.

Although deep convolution is applied to each channel individually, the selection range of the minimum and maximum value of weight quantization is all channels, which leads to a problem that an outlier of a channel may lead to huge quantization loss of the whole model, because the outlier will increase the value range of data.

Furthermore, without cross-channel correction operations, deep convolution is likely to produce values of all zeros in a channel, which results in a channel variance of zero, and this is common in MobileNetV1. According to Formula (10), when the variance is 0, α\alphaα will be very large due to the set very small constant ϵ\epsilonϵ, as shown in the figure below. The values of α\alphaα in 32 channels are measured. The six outliers α\alphaα, due to the zero variance, increase the quantization range. As a result, the quantization number is wasted to keep these large values from the all zero channels, whereas the smaller α\alphaα α with the information channels are not preserved. The representational ability of the whole model becomes much worse.

It can be seen from the experiment that without retraining, this problem can be solved well by changing the variance of these all zero channels to the mean of the variance of the rest channels. This improves the accuracy of the quantized MobileNetV1 model on the ImageNet2012 validation set from 1.8% to 45.73%.

Standard convolution is completed in one step of input filter and merger to get the new output value, and depth in the MobileNetV1 separable convolution this operation can be divided into two layers, the depth of the convolution filtering operation, and merge operation point by point convolution operation, so as to greatly reduce the computational cost and size model, and it can keep the model accuracy.

Based on the above principles, this paper considers to remove BN and ReLU6 after deep convolution, and let deep convolution learn appropriate weights to replace the role of BN. Such operation can not only maintain the ability of characterization of features, but also enable the model to achieve quantitative operation well.

As shown in the figure below, SQNR is used to observe the quantization loss of each layer, where the black line represents the MobileNetV1 that folds α\alphaα into the convolutional weight of the original version, and the blue line represents the removal of BN and ReLU6 in all deep convolutional layers. The red line represents that after BN and ReLU6 of the deep convolutional layer are removed, ReLU is still used in the point-by-point convolutional layer instead of ReLU6, BN and ReLU6 in the point-by-point convolutional layer are retained, and then one image is randomly selected from each category in the validation set of ImageNet2012. There are 1000 pictures in total. From the experimental results, it can be known that BN and ReLU6 in the reserved deep convolution layer will greatly reduce the SQNR output of each layer.

2.3.2 ReLU6 还是 ReLU

SQNR is still used as the evaluation standard in this section, as shown in the figure above. This section mainly discusses the nonlinear activation function used in point-by-point convolution layer, ReLU6 or ReLU. It should be noted that for a linear quantizer, when the input signal is more evenly distributed, SQNR will be higher, otherwise it will be lower;

It can be observed from the above figure that SQNR is significantly reduced at the convolution layer of the first boiling point when ReLU6 is used. According to Formula (8), although ReLU6 can reduce quantization range, the energy of input signal is also reduced due to clip operation. Ideally, its SQNR should be similar to ReLU, but in fact, the side effect of CLIP operation is that the distribution of input signals will be distorted in the first few network layers, making it less friendly to quantization. According to the experimental results, SQNR decreases greatly from the first layer to other layers.

2.3.3 L2 regularization of weights

As above said SQNR and signal distribution relationship is large, the author further in training on the depth of all the weights of convolution layer using the L2 regularization, L2 regularization can punish a lot of weight, and larger weight is can lead to increase the quantitative range, makes the weight distribution is not uniform, resulting in quantitative loss is bigger, A better weight distribution results in a more accurate quantization model.


3. MobileNets quantization friendly convolution

Based on the analysis of quantization loss in Section 2, the author proposes a quantization friendly separation convolution framework for MobileNet, whose goal is to solve the huge quantization loss so that quantization model can achieve accuracy similar to floating-point model without retraining.

The new separated convolution layer proposed by the author is shown in Figure C below, which contains three major changes to make separable convolution quantity-friendly:

  1. Remove BN and ReLU6 from all deep convolutional layers.
  2. Use ReLU instead of ReLU6 for the rest of the layers. The author believes that 6 is a very arbitrary value. Although ReLU6 can enable the model to learn sparse features at an earlier stage, its clip operation also distorts the input distribution at an earlier stage, resulting in a distribution that is not well quantized, thus reducing SQNR at each stage.
  3. During training, L2 regularization is used for weights in all deep convolution layers.

Finally, the quantization friendly MobileNetV1 network structure proposed in this paper is shown in the following table. After removing BN and ReLU6 of deep convolution, this network structure still retains the original advantages of reducing calculation loss and model size, and can make the accuracy of quantization model close to that of floating point model.


Experiment 4.

The experiment in this paper uses TensorFlow to realize the modified MobileNetV1, the training parameters are the same as the original version, the modified place is batch size is 128, the graphics card uses Nvidia GeForce GTX TITAN X, The ImageNet2012 dataset is used as the training set and validation set, and the training is only for the floating point model.

The experimental results are shown in the figure below, showing the accuracy of floating point model and quantization model under different versions respectively. The accuracy of the original version is 70.5% and 1.8% respectively. Then, three changes proposed in this paper are tested and tested respectively:

  1. When BN and ReLU6 of the deep convolutional layer are removed, the accuracy of the floating point model is slightly improved to 70.55%, while that of the quantitative model is greatly improved to 61.5%.
  2. On the basis of 1, ReLU6 is changed into ReLU, and the floating point model is improved to 70.8%, while the quantitative model is improved to 67.8%.
  3. By further increasing L2 regularization of weights, the floating point model was slightly reduced to 70.77%, but the quantized model continued to be improved to 68.03%.


Conclusion 5.

This paper analyzes the root cause of the difficulty in quantization of lightweight network MobileNetV1 (serious drop of quantization points), and finds that it is mainly caused by BN and ReLU6 in deep convolution. Because deep convolution operates on each channel, it is easy to lead to the value of 0 in the whole channel. Which leads to the variance of the channel is 0, and according to the calculation formula of BN, will lead to the alpha \ alpha of alpha is very large, which produced a large outliers, this affects the quantitative range, the large amount of information but with a smaller values are quantified to nod, and these do not contain outliers or contains less information, instead, be preserved, This will greatly reduce the characterization ability of the quantitative model, and thus reduce the accuracy of the quantitative model.

Based on this finding, the author directly removed BN and ReLU6 of deep convolution, and the experimental results also verified the feasibility of this operation. In the following two changes, ReLU was used to replace ReLU6 of other layers, and the improvement effect of L2 regularization was also obvious, while L2 regularization was relatively general. Of course, the author does not seem to have conducted an experiment to compare whether L2 regularization improves the quantitative model more or ReLU is used for both.

In addition, the experiment has only been carried out on MobileNetV1 at present. The author also mentioned that this method needs to be verified in other networks that also use separable convolution, such as MobileNetV2 and ShuffleNet, as well as the classification in this paper. It also needs to be verified in other tasks, such as detection and segmentation.