The core of IGC series networks is the extreme application of grouping convolution, which can reduce a large number of parameters by splitting conventional convolution into multiple grouping convolution. In addition, the principle of complementarity and sorting operation can ensure the information flow between groups with the minimum number of parameters. However, overall, although the number of parameters and computation amount are reduced after using IGC module, the network structure becomes more cumbersome, which may lead to slower speed in real use
Source: Xiaofei algorithm engineering Notes public account
IGCV1
Interleaved Group Convolutions for Deep Neural Networks
- Thesis Address:Arxiv.org/abs/1707.02…
Introduction
Interleaved Group convolution(IGC) module contains primary group convolution and secondary group convolution, extracting features of primary and secondary partition respectively. Primary partition is obtained by grouping input features. For example, input features are divided into LLL partitions, and each partition contains MMM dimension features. The corresponding sub-partitions are divided into MMM partitions, each containing LLL dimension features. Primary group convolution is responsible for grouping feature extraction of input feature graph, while secondary group convolution is responsible for fusing the output of primary group convolution, which is 1×11\times 11×1 convolution. IGC module is similar to deeply separable convolution in form, but the concept of grouping runs through the whole module, which is also the key to saving parameters. In addition, two sorting modules are added in the module to ensure information communication between channels.
Interleaved Group Convolutions
The IGC module is shown in Figure 1. The main convolution extracts the grouped features of the input, and then samples the features of the output of the main convolution in separate areas to facilitate the feature fusion of the subsequent secondary convolution, and concates the output of the secondary convolution to restore the sequence before sampling as the final output.
-
Primary group convolutions
It is assumed that there are LLL primary partitions, each of which contains MMM dimension features. The operation of primary grouping convolution is shown in Formula 1. Zlz_lzl is the feature vector of (MS)(MS) dimension extracted according to the size of convolution kernel, SSS is the size of convolution kernel, WllpW^p_{ll}Wllp corresponds to the convolution kernel of the LLL partition, with a size of M×(MS)M\times (MS)M×(MS), X = [z1 ⊤ z2 ⊤.. ZL ⊤] ⊤ x = [z ^ {\ top} _1 \ z ^ {\ top} _2 \ \ cdots \ z ^ {\ top} _L] ^ x = {\ top} [z1 ⊤ z2 ⊤.. ZL ⊤] ⊤ represents the main group convolution of the input.
-
Secondary group convolutions
Rearrange the output of the primary grouping convolution {y1,y2,… yL}\{y_1, y_2,\cdots,y_L \}{y1,y2,… yL} into MMM sub-partitions, each containing LLL dimensional features, so as to ensure that each partition contains features from different primary partitions. The MMM sub-partition consists of the characteristics of the MMM dimension of each primary partition:
Y ˉm\bar{y}_myˉ M corresponds to the MMM sub-partition, ylmy_{lm} YLM is the MMM dimension feature of YLY_LYL, PPP is the ordering matrix, the convolution of sub-grouping is calculated on the sub-partition:
WmmdW^d_{mm}Wmmd corresponds to the 1×11\times 11×1 convolution kernel corresponding to the MMTH sub-partition, the size is L×LL\times LL×L, the output of sub-grouping convolution will be rearranged in the order of the primary partition, The name ‘LLL a rearrangement partition {x1’, x2 ‘,…, xL ‘} \ {x ^ {‘} _1, x ^ {} _2, \ cdots, x ^ {‘} _L \} {x1 ‘, x2 ‘,…, xL ‘} calculation is as follows:
Combining the formula of primary convolution and secondary convolution, IGC module can be summarized as follows:
WpW^pWp and WdW^dWd are block angular diagonal matrices, and we define W=PWdPT⊤WpW=PW^dPT^{\top}W^ PW =PWdPT⊤Wp as mixed convolution kernel, and obtain:
That is, IGC module can be regarded as conventional convolution, and its convolution kernel is the product of two sparse kernels.
Analysis
-
Wider than regular convolutions
Considering the input of single pixel, the number of parameters of IGC module is:
G=MLG=MLG=ML is the number of dimensions covered by IGC. For conventional convolution, the input and output dimensions are CCC and the number of parameters is:
Given the same number of parameters, Tigc=Trc=TT_{igc}=T_{rc}=TTigc=Trc=T, C2=1STC^2=\frac{1}{S}TC2=S1T, G2 = 1 S/L + 1 / MTG ^ 2 = \ frac {1} {} S/L + 1 / M TG2 = S/L + 1 M1T, can further:
Considering S=3×3S=3\times 3S=3×3, G>CG>CG>C can be obtained when L>1L>1L>1, that is, IGC module can have more input dimensions than conventional convolution processing under normal circumstances.
-
2. When is the Dross?
The paper studies the influence of partition numbers LLL and MMM on the convolution width, and transforms Formula 7 to obtain:
When L=MSL=MSL=MS, formula 12 is equal, given a certain number of parameters, and the upper bound of the convolution width is:
When L=MSL=MSL=MS, the convolution width is maximum
The paper lists the width comparison under different Settings. It can be seen that the width is maximum when L≃9ML \simeq 9ML≃9M.
-
Wider leads to better performance?
The fixed number of parameters means that the effective parameters of primary group convolution and secondary convolution are fixed. When the input has more characteristic dimensions, the convolution kernel will be larger, and the convolution will become more sparse, which may lead to performance decline. To this end, the paper also compares the performance of different configurations, as shown in Figure 3.
Experiment
The network structure of small experiment and the comparison of parameter number and computation amount, note that the structure of realization is IGC+BN+ReLU.
Performance comparison on CIFAR-10.
Performance comparisons with SOTA over multiple data sets
Conclusion
The IGC module adopts two-layer grouping convolution and sorting operation to save the number of parameters and computation. The structure design is simple and clever. The thesis also fully deduces and analyzes IGC. It should be noted that although the lightness of IGC module is obtained from the number of parameters and computation, as mentioned in ShuffleNetV2, the number of parameters and computation are not the same as the inference delay.
IGCV2
IGCV2: Interleaved Structured Sparse Convolutional Neural Networks
- Thesis Address:Arxiv.org/abs/1804.06…
Introduction
IGCV1 decomposed the original convolution through two block convolution to reduce parameters and keep complete information extraction. However, the author found that because the primary and secondary convolution are complementary in terms of the number of groups, the number of groups in the secondary convolution is generally smaller, the dimension of each group is larger, and the kernel of the secondary convolution is more dense. For this reason, IGCV2 proposes Interleaved Structured Sparse Convolution, which uses multiple continuous Sparse grouping Convolution to replace the original sub-grouping Convolution. The grouping number of each grouping Convolution is sufficient to ensure the sparsity of the Convolution kernel.
Interleaved Structured Sparse Convolutions
The core structure of IGCV2 is shown in Figure 1. Multiple sparse convolution is used to replace the originally dense sub-convolution, which can be formulated as follows:
PlWlP_lW_lPlWl is a sparse matrix, where PlP_lPl is used to sort matrices and rearrange dimensions. WlW_lWl is a sparse partition matrix, and the dimension of each grouping is KlK_lKl.
In the design of IGCV2, there is a complementarity principle. Each group convolved by a group needs to be associated with another group convolved with each group, and only the one-dimensional feature of each group is associated, that is, there is only one connection between the groups. You can see Figure 1 for an experience, the core is the sorting method. According to the principle of complementarity, the relationship between input dimension CCC and dimension number KLK_LKL of each layer can be obtained:
In addition, similar to the derivation of IGCV1, IGCV2 has the least number of parameters when L=log(SC)L=log(SC)L=log(SC) L=log(SC), and SSS is the size of convolution kernel. Here, convolution is 1×11\times 11×1 in grouping convolution calculation.
Discussions
The paper also discusses the design of IGCV2:
- Non-structured sparse kernels do not use sparse matrix and regularization method is used to guide the sparse convolution kernel, which will restrict the expression ability of the network.
- Complementary condition: the Complementary principle is not necessary, but a criterion for efficient design of grouping convolution. Complex convolution can be designed sparsely without fully connecting input and output.
- Sparse matrix x multi and low-rank matrix x multi. Low rank matrix x is a common method of compression. Sparse matrix x is rarely studied. The next step is to combine sparse matrix decomposition and low rank matrix decomposition to compress convolution networks.
Experiment
This paper compares the network structure of IGCV2 master convolution using deep convolution.
Compare with similar network structures.
Compare with SOTA network.
Conclusion
On the basis of IGCV1, IGCV2 is further sparse, and multiple sparse convolution is used to replace the original dense sub-convolution. The thesis still uses sufficient derivation to analyze the principle and hyperparameters of IGCV2. However, as mentioned above, the number of parameters and the amount of calculation cannot be the same as the reasoning delay, which needs to be compared on the actual device.
IGCV3
Thesis: IGCV3: Interleaved low-rank Group Convolutions for Efficient Deep Neural Networks
- Thesis Address:Arxiv.org/abs/1806.00…
Introduction
Based on the ideas of IGCV and Bootleneck, IGCV3 combines low-rank convolution kernel and sparse convolution kernel to form dense convolution kernel. As shown in Figure 1, IGCV3 uses low-rank sparse convolution kernel (bottleneck module) to expand the dimension of input grouping features and reduce the dimension of output. In the middle, deep convolution is used to extract features. In addition, the relaxation complementarity principle, similar to the strict complementarity principle of IGCV2, is introduced to deal with the situation where the input and output dimensions of grouping convolution are different.
Interleaved Low-Rank Group Convolutions
IGCV3 mainly extends the structure of IGCV2 and introduces low-rank block convolution to replace the original block convolution, including g1g 1G1 low-rank Pointwise block convolution, deep convolution, and G2g 2G2 low-rank Pointwise block convolution. Two low-rank grouping convolution are respectively used to expand feature dimension and restore feature dimension to original size, which can be formulated as:
P1P^1P1 and P2P^2P2 are sorted matrices, W1W^1W1 is 3×33\times 33×3 deep convolution, W^0\hat{W}^0W^0 and W2W^2W2 are low-rank sparse matrices, and the matrix structure is as follows:
W^j,kg∈RC\hat{W}^g_{j,k}\in R^CW^j,kg∈RC contains CG1\frac{C}{G_1}G1C non-zero weights, corresponding to the GGG of the first group convolution of the KKK convolution kernel, which is used to extend the dimensions. Wg2W^2_gWg2 is the convolution kernel of the GGG grouping of the second grouping convolution (the third grouping convolution of Figure 2), which is used to reduce the dimension to its original size.
Since the input and output dimensions of IGCV3’s grouping convolution are different, the complementarity principle proposed by IGCV2 cannot be satisfied (there are multiple connection paths between the input and output), and the sorting operation cannot be used as before. In order to solve this problem, this paper proposes the concept of super-channels, which divides input-output-intermediate dimensions into CsC_sCs superdimensions. The superdimensions of input and output include CCs\ FRAc {C}{C_s}CsC, The superdimension of the intermediate features includes CintCs\frac{C_{int}}{C_s}CsCint, as shown in FIG. 2. The principle of complementarity is satisfied with the unit of superdimension, and the ordering operation, namely the relaxation complementarity principle, is defined as follows:
Experiment
Compared with the previous two versions, IGCV3-D has 1 and 2 gG_1G1 and G2G_2G2, respectively.
Compare with other networks on ImageNet.
Here, ReLU is used in experiments, mainly for MobileNetV2 ReLU use method.
Compare the number of different groups.
Conclusion
IGCV3 integrates the main structure of MobileNetV2 on the basis of IGCV2, and uses more powerful low-rank sparse grouping convolution, which is very close to MobileNetV2 in the overall structure. The core is still in sparse grouping convolution and sorting operation, although the performance is slightly improved than MobileNetV2. But the overall innovation is slightly inadequate.
Conclustion
The core of IGC series networks is the extreme application of grouping convolution, which can reduce a large number of parameters by splitting conventional convolution into multiple grouping convolution. In addition, the principle of complementarity and sorting operation can ensure the information flow between groups with the minimum number of parameters. However, overall, although the number of parameters and computation amount are reduced after using IGC module, the network structure becomes more cumbersome, which may lead to slower speed in real use.
If this article was helpful to you, please give it a thumbs up or check it out
For more information, please pay attention to wechat official account [Algorithm Engineering Notes of Xiaofei]