In this paper, SENet and CondConv are summarized in weight space, and a unified framework called WeightNet is proposed, which can dynamically generate convolution kernel weights according to sample characteristics, and achieve trade-off between accuracy and speed by adjusting hyperparameters

WeightNet: Revisiting the Design Space of Weight Networks

  • Thesis Address:Arxiv.org/abs/2007.11…
  • Thesis Code:Github.com/megvii-mode…

Introduction


The paper proposes a simple and efficient dynamically generated network WeightNet, which integrates the characteristics of SENet and CondConv in the weight space. Firstly, the dynamic activiation vector is obtained through global average pooling and full connection layer with SigmoID activation. Then the activation vector is used to extract the following features. SENet uses the activation vector to weight feature layers, while CondConv uses the activation vector to weight candidate convolution kernel parameters. Based on the above two methods, WeightNet adds a layer of grouping full connection after the activation vector to directly generate the weight of the convolution kernel, which is highly efficient in calculation and can be trade-off in accuracy and speed through the setting of hyperparameters.

WeightNet


Grouped fully-connected operation

In the fully connected layer, atoms are all connected, so the fully connected layer can be considered as a matrix calculation Y=WXY=WXY=WX, as shown in Figure A. Grouping full connection is to divide atoms into GGG groups, and all connections are made within each group (including I/GI/GI /g inputs and O/GO/GO/G outputs), as shown in Figure B. A remarkable feature of grouping full concatenation operations is that the weight matrix becomes a sparse block diagonal matrix, and full concatenation operations can be thought of as grouping full concatenation operations with a grouping number of one.

Rethinking CondConv

CondConv uses m-dimension vector α\alphaα to weighted merge MMM convolution kernels to obtain the final convolution kernels, which are dynamically generated by sample features. ⋅ (⋅)\ Sigma (\cdot)σ(⋅) Alpha = sigma (Wfc1 hw ∑ I ∈ x 1 h, j ∈ wXc, I, j) \ alpha = \ sigma (W_ {fc1} \ times \ frac {1} {hw} {\ sum} _ {I \ in h, J \ w} in X_ alpha = {c, I, j}) sigma (Wfc1 x hw1 ∑ I ∈ h, j ∈ wXc, I, j), x \ times x for matrix multiplication, Rm Wfc1 ∈ x CW_ {fc1} \ in \ mathbb {R} ^ {C} m \ times Wfc1 Rm ∈ x C, alpha ∈ Rm * 1 \ alpha \ \ mathbb in ^ {R} {1} m \ times alpha Rm ∈ x 1, The final convolution kernel weight is obtained by weighting multiple candidate convolution kernels with the vector α\alphaα : W ‘= alpha 1 + alpha 2 ⋅ ⋅ W1 W2 +. + + alpha m ⋅ WmW ^ {‘} = \ alpha_1 \ cdot W_1 + \ alpha_2 \ cdot W_2 + + + \ \ cdots alpha_m \ cdot W_mW ‘= alpha 1 + alpha 2 ⋅ ⋅ W1 W2 +. + + alpha m ⋅ Wm, which Wi ∈ RC * C * kh x kwW_i \ \ mathbb in ^ {R} {C C \ \ times times k_h \ times k_w} Wi ∈ RC * kh * C kw. We can convert the above operations to:

W∈Rm×CCkhkw \in \mathbb{R}^{m\times CCk_hk_w}W∈Rm×CCkhkw is the result of matrix stitching. According to Formula 1, we can say, The final convolution kernel calculation of CondConv can be output by adding a fully connected layer with input MMM and output C×C×kh×kwC\times C\times k_h\times k_wC×C× kH × kW after vector α\alphaα. This is much more efficient than the original CondConv implementation.

Rethinking SENet

First, the SE module dynamically generates m-dimension vector α\alphaα according to the sample features, and then weights MMM features. Vector alpha \ alpha alpha by global pool, two full connection layer, ReLU operating the delta (⋅) \ delta (\ cdot) delta (⋅) and sigma (⋅) sigmoid operation \ sigma (\ cdot) sigma (⋅) calculation: Alpha = sigma (Wfc2 x delta (Wfc1 hw ∑ I ∈ x 1 h, j ∈ wXc, I, j)) \ alpha = \ sigma (W_ {fc2} \ times \ delta (W_ {fc1} \ times \ frac {1} {hw} {\ sum} _ {I \ in h, J \ w} in X_ {c, I, j})) of alpha = sigma (Wfc2 x delta (Wfc1 x hw1 ∑ I ∈ h, j ∈ wXc, I, j)), Wfc1 RC/r x ∈ CW_ {fc1} \ \ mathbb in {r} ^ c} {c/r \ times Wfc1 ∈ RC/r x c, Wfc2∈RC×C/rW_{fc2}\in \mathbb{R}^{C\times C/ R} Wfc2∈RC×C/ R, ×\times× is matrix multiplication. The main purpose of using two-layer fully connected layer is to reduce the number of global parameters. Since α\alphaα is a CCC vector, using single-layer fully connected layer will bring too many parameters. After alpha \ alpha alpha gain vector, which can be used before the convolution layer Yc = Wc ‘∗ Y_c ⋅ alpha (X) = W ^ {‘} _c * (X \ cdot \ alpha) Yc = Wc’ ∗ ⋅ alpha (X), Also can be used after the convolution layer Yc = (Wc ‘∗ X) = (W ^ {‘ ⋅ alpha Y_c} \ cdot \ alphaYc _c * X) = (Wc’ ∗ X) ⋅ alpha (⋅) (\ cdot) (⋅) as the multiplication of dimension coordinates. Both of the above implementations are actually equivalent to weighting the weight matrix Wc ‘W^{‘}_cWc’ : Yc = (Wc) ‘⋅ alpha c ∗ XY_c = (W ^ {‘} _c \ cdot \ alpha_c) * XYc = (Wc)’ ⋅ alpha c ∗ X, unlike formula 1, there is no dimension reduction, It is equivalent to a group full connection operation in which the input is CCC, the output is C×C× KH ×kwC\times C\times k_h\times k_wC×C× kH × kW, and the group is CCC.

WeightNet Structure

From the above analysis, we can see that the minimum number of packets of the grouping full connection layer is 1(CondConv) and the maximum is the input dimension (SeNet), so we get the universal grouping full connection layer of FIG. C.

As shown in Table 1, the grouping full connection layer contains two hyperparameters MMM and GGG. MMM controls the input dimension, and GGG works with MMM to control the trade-off between the number of parameters and accuracy.

The structure of the WeightNet core module is shown in Figure 2. In order to reduce the number of parameters when generating the activation vector α\alphaα, reduction ratio is used as the two-layer full connection of RRR: Alpha = sigma (Wfc2 * 1 Wfc1 hw ∑ I ∈ h, j ∈ wXc, I, j) \ alpha = \ sigma (W_ {fc2} \ times W_ {fc1} \ times \ frac {1} {hw} {\ sum} _ {I \ in h, J \ w} in X_ alpha = {c, I, j}) sigma (Wfc2 * hw1 Wfc1 ∑ I ∈ h, j ∈ wXc, I, j), Wfc1 RC/r x ∈ CW_ {fc1} \ \ mathbb in ^ {r} {c/r \ times c} Wfc1 ∈ RC/r x c, Wfc2∈RC×C/rW_{fc2}\in \mathbb{R}^{C\times C/ R} Wfc2∈RC×C/ R, RRR = 16, The subsequent weight generation of convolution kernel directly uses the grouping full connection layer with input M×CM\times CM×C, output C×C× KH ×kwC\times C\times k_h\times k_wC×C× kH × kW, and grouping G×CG\times CG×C. The calculation quantities of convolution operation and weight branch in WeightNet are O(hwCCkhkw)O(hwCCk_h k_W)O(hwCCkhkw) O(MCCkhkw/G)O(MCCk_h k_W /G)O(MCCkhkw/G), The number of parameters is zero and O(M/G×C×C×kh×kw)O(M/G\times C\times C\times k_h\times k_w)O(M/G×C×C× KH ×kw).

Experiment


The number of parameters/computation and the curve of accuracy were compared. CondConv 2x and 4x were the number of candidate convolution kernels.

The effect of different λ\lambdaλ configurations of WeightNet, where G=2G=2G=2, with a few dimensional modifications to ShuffleNetV2 to ensure that it is divisible by GGG.

A comparative experiment of various attention modules.

Effects on target detection tasks.

Conclusion


In this paper, SENet and CondConv are summarized in weight space, and a unified framework called WeightNet is proposed, which can dynamically generate convolution kernel weights according to sample characteristics and achieve trade-off between accuracy and speed by adjusting hyperparameters.





If this article was helpful to you, please give it a thumbs up or check it out

For more information, please pay attention to wechat official account [Algorithm Engineering Notes of Xiaofei]