Recently, there have been many studies on Dynamic network in terms of accelerated reasoning. DGC combines the idea of Dynamic network into grouping Convolution, which makes grouping Convolution lightweight and can enhance the ability of expression. The overall idea is direct and clear, and it can be used as a good choice in network design

Source: Xiaofei algorithm engineering Notes public account

Thesis: Dynamic Group Convolution for Accelerating Convolutional Neural Networks

  • Thesis Address:Arxiv.org/abs/2007.04…
  • Thesis Code:Github.com/zhuogege194…

Introduction


At present, block convolution is widely used in lightweight networks, but the analysis of this paper finds that block convolution has two fatal shortcomings:

  • Due to the introduction of sparse connections, the expression ability of convolution is weakened and the performance is reduced, especially for difficult samples.

  • Fixed connection mode, does not change according to the characteristics of the input sample. By visualizing the contribution of input dimension to output dimension of DenseNet intermediate layer, this paper finds that the contribution of different input dimensions to different outputs is not the same, and this contribution relationship is also different among different input samples.

Referring to the idea of dynamic network, dynamic group convolution (DGC) is proposed in this paper. A small feature selector is introduced for each group to dynamically determine which input dimensions are connected according to the intensity of input features. Multiple groups can capture different complementary features in input images. Learn rich ability of feature expression. Therefore, dynamic grouping convolution can adaptively select the most relevant input dimension for each grouping while maintaining the intact structure of the original network.

Group-wise Dynamic Execution


The structure of DGC is shown in Figure 2, which divides the output dimensions into multiple groups. Each group is equipped with an auxiliary head to determine which input dimensions are used for convolution calculation. The logic for each grouping is as follows:

  1. Saliency Generator generates the importance score of the input dimensions.
  2. Input Channel selector uses the Gating strategy to dynamically determine the most important parts of the input dimension based on the importance score.
  3. A normal convolution operation is performed on the selected subset of input dimensions.

Finally, the output of all heads is concated and scrambled into the subsequent BN layer and activation layer.

Saliency Generator

Saliency generator specifies a score for each input dimension to indicate its importance. Each head has a specific saliency generator, which is used to guide different heads to use different input dimensions, thus increasing the diverse expression of features. Saliency Generator follows the design of SE Block. For the third head, the importance vector GIG ^ IGi can be calculated as:

Gi ∈R1×Cg^ I \in \mathbb{R}^{1\times C} GI ∈R1×C represents the importance vector of input dimension, and (z)+(z)_+(z)+ represents ReLU activation. PPP reduces each input feature graph to a single scalar. Beta I \ beta ^ {I} beta I and WiW ^ {I} Wi for learning parameters, beta I \ beta ^ {I} beta I for bias, R1×C↦R1×C ↦R1×C\mathbb{R}^{1\times C}\ mapstomathbb {R}^{1\times C/d}\ mapstomathbb {R}^{1\times C/d}\ mapstomathbb {R}^{1\times C/d}\ mapstomathbb {R}^{1\times C}R1 x C↦R1 x C/d↦R1 x C, where D is the compression ratio. Here xix^{I}xi is all input dimensions, that is, in each head, all input dimensions are candidates.

Gating Strategy

After obtaining the importance vector, the next step is to determine which input dimensions are selected by the current head to participate in subsequent convolution operations. The head-WISE threshold or network-Wise threshold can be used to filter the input features with lower scores. The paper uses the Head-wise threshold. Given the target clipping ratio ζ\zetaζ, the threshold τ I \tau^{I}τ I of the third head satisfies:

The importance score serves two purposes: 1) Those whose importance score is less than the threshold value will be removed. 2) The remaining dimensions will be weighted by the corresponding importance score. A weighted feature yi ∈ R (1 – zeta) C * H * Wy ^ {I} \ \ mathbb in ^ {R} {(1 – \ zeta) C \ \ times times H W} yi ∈ R (1 – zeta) C * H * W. Assuming that the number of heads is H\mathcal{H}H, Iii the head of a convolution kernels for wi ⊂ theta I, theta I ∈ fairly Rk * C * k C ‘Hw ^ {I} \ subset \ theta ^ {I}. \ theta ^ {I} \ in \ mathbb {R} ^ \ \ times times k C \ {k times \ frac {C ^ {‘}} {\ mathcal {H}}} wi ⊂ theta I, theta I ∈ fairly Rk * C * k HC ‘, the corresponding convolution computation for:

Itop⌈k⌉(z)\mathcal{I}_{top} \ LCeil k\rceil (z)Itop⌈k⌉(z) returns the subscripts of the largest KKK elements in ZZZ, Output x ‘I ∈ RC’ H ‘x’ x W x H ^ {‘ I} \ \ mathbb in C ^ ^ {R} {\ frac {{‘}} {\ mathcal {H}} \ times H ^ ^ {{‘} \ times W ‘}} I ∈ x ‘RHC’ x ‘H’ x W, ⊗ \ otimes ⊗ for conventional convolution. At the end of DGC, the outputs are merged and scrambled to output x ‘x^{‘}x’. In order to make the importance score as sparse as possible, Lasso loss is introduced:

L\mathcal{L}L is the number of DGC layers and λ\lambdaλ is the preset hyperparameter.

Computation Cost

Convolution kernel size for the conventional convolution KKK MAC for k2C ‘CH’ W ‘k ^ ^ 2 c {‘} CH W ^ ^ {‘} {‘} k2C’ CH ‘W’, DGC, The MAC of saliency generator and convolution for each head is 2C2d\frac{2C^2}{d}d2C2 and K2 (1 – zeta) CC ‘HH’ W ‘k (1 – \ zeta) ^ 2 C \ frac {C ^ {‘}} {\ mathcal {H}} H W ^ ^ {‘} {‘} k2 (1 – zeta) CHC’ H ‘W’. Therefore, the saving ratio of MAC of DGC layer relative to conventional convolution is:

The number of heads H\mathcal{H}H has almost no effect on the overall computational consumption.

Invariant to Scaling

The overall idea of DGC method is somewhat similar to that of dynamic pruning algorithm FBS. The flow of FBS algorithm is shown in the figure above, which calculates the importance score of the output dimension and uses the importance score for weighting in the final feature output, without using BN. This weighting method causes a large difference in the characteristic distribution of each sample, causing the internal covariate shift problem. Although DGC also uses importance score for feature weighting, it performs BN+ReLU normalization on the final convolution result to avoid this problem:

Training DGC Networks

DGC networks are trained from scratch and do not require a pre-training model. In the back propagation stage, only the gradient of the correlation weights of the dimensions selected during reasoning is calculated, and the others are set to zero. In order to prevent pruning from causing too much variation in training losses, the cutting ratio ζ\zetaζ was gradually increased during training. The overall training is divided into three stages: the first stage (1/12 EPOchs) is used for warm up, the second stage is used for training by gradually increasing the clipping ratio, and the third stage (the last 1/4 EPOchs) is used for fine-tune sparse network. The learning rate is decreased by using cosine annealing reduction method.

Experiments


In contrast to the pruning method and the dynamic dimension selection method, DGC-G uses network-wise threshold for dimension selection, which is learned during training.

Contrast this with other lightweight networks.

Performance comparison of different parameter Settings.

Visualize importance scores and filters for shallow and deep layers.

The clipping probability of one head of a layer of DGC network for each input dimension.

Conclustion


Dynamic Group Convolution (DGC) combines the idea of Dynamic network into grouping Convolution, which makes grouping Convolution lightweight and improves the ability of expression. The overall idea is direct and clear, so it can be a good choice in network design.





If this article was helpful to you, please give it a thumbs up or check it out

For more information, please pay attention to wechat official account [Algorithm Engineering Notes of Xiaofei]