【 Model reasoning 】 Talk about model pruning channel pruning strategy

Please follow my official account [Jizhi Vision] for more notes to share

Hi, I’m Jizhi Vision. This article mainly talks about model pruning channel pruning related practice sharing.

Model pruning is an important technology for model inference acceleration. Model pruning is often expected to achieve less precision and make the model more lightweight.

So let’s get started.

1 pruning thought

The channel pruning strategy is used here, mainly for those modules that do not affect the final input dimension and output dimension. Meanwhile, in order to adapt to the characteristics of deep learning that some layers have strict restrictions on the input dimension, the strategy here is to do not channel pruning for those layers with pre and post dimension processing.

1.1 Main Ideas

Based on the structural characteristics of YOLOV3 model, a large number of convolution layers and BN layers are directly connected. The weights coefficient of BN layer is used to determine the importance of the corresponding channel and remove the unimportant channel, so as to reduce the parameters of the previous convolution layer. If the subsequent convolution layer is also reduced, the parameters of the subsequent convolution layer will be reduced. The basic module of channel pruning discussed here is a CBL block, whose network structure is shown as follows:

1.2 BN layer selection strategy

There are various layers based on dimension processing in deep learning networks, and some layers have strict usage specifications for input dimensions. For example, shortcut(a structure used in deep learning to prevent information loss or gradient disappearance) layers often contain multiple inputs. When channel pruning is performed on modules with one input, while modules with other inputs do not carry out dimensional coordination, then shortcut will have dimension mismatching problem. Therefore, the pruning process is simplified. Instead of pruning all CBL modules, only those CBL modules that do not need dimensional coordination are pruned. In YOLOV3 model, the CBL module without pruning operation is mainly as follows:

1.3 Examples

To illustrate a simple example, the channel pruning to a BN layer has an effect on the parameters of the convolution kernel. Because when a channel of a BN layer is pruned, it will inevitably affect the convolution kernel parameters of the two convolution layers before and after. As shown below, it is assumed that the size of the convolution kernel of the previous convolution layer is 3×3, with 2 channels of input and 4 channels of output. The convolution kernel size of the subsequent convolution layer is 3×3, input to channel 4 and output to channel 2, so pruning the second channel of BN layer will be as follows, where green represents the channel to be pruned, and red represents the kernel parameters to be removed after the corresponding channel is pruned:

2 pruning flow chart

Channel pruning flow chart is as follows:

2.1 Normal training and sparse training

2.1.1 Normal training

The specific training results and curves, among which the two lines and three columns on the left represent the corresponding loss with loss gain, are as follows:

The weight distribution of BN layers before normal training, 112 BN layers in total, follows a normal distribution centered on 0.749, as follows:

During normal training, the overall distribution of Weigths at BN layer is iterated with training times as follows:

Weight distribution of BN layer after normal training is as follows:

2.1.2 Sparse training

The purpose of sparse training is to make most of the learned network weights close to zero, because the relationship between weights and inputs in deep learning networks is multiplication and addition. If most of the weights in the network model are zero or close to zero, removing these parts will have little influence on the final network output accuracy.

Sparse training is not only used in pruning, but also a general model compression method, because even without pruning, it is faster and lighter in storage for the operation of and zero values. The principle of sparse training is as follows: the operation of counting non-zero weight number is regarded as a loss function by using L1 norm. The larger L1 norm of weight is, the more non-zero weight is represented, and the loss of target detection task before the L1 loss combination is turned into the final loss function. In a word, the zeroing effect of L1 norm loss is used to increase constraints and thus the parameters are thinned. The loss function is expressed as follows:

In the program, the loss of this combination is not directly used, but the gradient of L1 norm is accumulated on the gradient of weights requiring sparse training, and the code is as follows:

for idx in prune_idx:
    bn_module = module_list[idx][1]
    bn_module.weight.grad.data.add_(s * torch.sign(bn_module.weight.data))  # L1
Copy the code

The graph of sparse training is as follows:

The weights distribution of BN layer before sparse training is as follows:

During 1-52 iterations of sparse training, the overall distribution changes of Weigths at BN layer are as follows:

During 52-120 iterations of sparse training, the overall distribution changes of Weigths at BN layer are as follows:

Weight distribution of BN layer after sparse training is as follows:

3. Advance channel pruning based on time series prediction

By comparing the parameters of sparse training and normal training, we know that both sparse training and normal training have their own advantages and disadvantages. When the number of iterations is consistent, the training time is basically the same. Normal training has a good accuracy performance before pruning, but once encountered pruning, it will lose most of the accuracy performance, need to go through many fine-tuning to achieve acceptable accuracy. In contrast, sparse training can achieve direct channel pruning at the expense of a little precision loss, but the training accuracy is not high. By comparing the advantages and disadvantages of the two, it is realized whether the channel pruning can be done with less fine-tuning and with appropriate precision.

By checking the weights distribution of BN layer in sparse training, it changes regularly with the iteration of training times, and the final distribution tends to be stable with the increase of iteration times. This indicates that the overall distribution of weight of BN layer is dependent on the number of iterations as a time series.

All this suggests that early pruning may be possible. If the future distribution of weights of BN layer is predicted at the early stage of training, unimportant channels can be cut off in advance. After a certain number of iterations, better performance indicators can be obtained, and a balance can be achieved between normal training and sparse training.

3.1 Design ideas and experiments

In order to predict the weights value of BN layer in the following training process, it is necessary to take the number of training iterations as the time, and the weights value is a time series that changes with the number of iterations. The purpose of pruning in advance is to be able to prune once and then achieve good precision performance through subsequent training iterations, while also reducing the time cost of training. Sparse training was used to obtain the weights change in the early stage, which was successively used as the data dependence of prediction. However, in order to improve the accuracy after pruning, it is necessary to use normal training, so that the model after pruning is free from L1 norm constraint, and the accuracy is higher than that after sparse training.

3.1.1 Design flow chart

The flow chart of advance channel pruning based on time series prediction is as follows:

3.1.2 Design of key points

In the whole design, the key is how to predict the weight with the time series prediction model. In this scheme, the establishment of the time series prediction model needs to consider the following two aspects: whether there is a correlation between the weights predicted values of BN layer and whether there is no correlation in statistical concept; What time series prediction algorithm should we choose to predict whether the predicted values of the future have a good distribution for channel pruning to be effective.

Generally speaking, it is impossible for the weights of BN layers in this network to be statistically uncorrelated. A statistically correlated time series prediction model is adopted, which basically has learning parameters. As for YOLOV3 model, there are 13376 weights of BN layers that can be pruned. Even with the simplest linear prediction, you would have to learn a matrix with rows and columns 13376 and 13376 x (n>=1), which could result in insufficient hardware resources for a program being trained. Another difficulty is that the time series prediction model needs training with reference, but the time series sample number is too small, and the prediction results of the prediction model are not convincing.

Therefore, when verifying the scheme, the weights data of BN layer is assumed to be statistically irrelevant, so that each weight can be predicted separately. In this way, the time series prediction module will not occupy too much hardware resources.

Based on the time cost, the time series prediction module uses the simple moving average method to conduct experiments, and the specific parameter configuration is as follows:

3.1.3 Experimental results

Loss comparison diagram, in which the top pink is sparse training, the middle green is advance channel pruning training, and the bottom blue is normal training:

The accuracy comparison figure is as follows, in which the pink one at the bottom is sparse training, the green one in the middle is advance channel pruning training, and the blue one at the top is normal training:

Result analysis: through the advance channel pruning, we balance the accuracy and model parameter scale, which not only accelerates the training process, but also achieves the one-time pruning accuracy higher than the sparse regularization method. The final performance indicators are as follows:

The above data show that advance channel pruning based on time series prediction can achieve a very good effect of channel prediction. This method only needs one pruning, and the normal training after pruning is equivalent to fine-tuning, which not only reduces the time cost, but also provides better precision results than the sparse training without fine-tuning.

Ok, the model pruning channel pruning related strategies and practices have been shared above. I hope my sharing can be of some help to your study.

[Public Account transmission]

【 Model Reasoning 】 On the Pruning Strategy of model Pruning Channel