Abstract:In a Convolutional Neural Network, different features are extracted by using filters whose weights are automatically learned during training, and all of these extracted features are “combined” to make a decision.

This article is shared from Huawei Cloud Community “Summary of Common Convolution in Neural Networks”, the original author: FDAFAD.

The purpose of convolution is to extract useful features from the input. In image processing, a variety of filters can be selected. Each type of filter helps to extract different features from the input image, such as horizontal/vertical/diagonal edge features. In a Convolutional Neural Network, different features are extracted by using filters whose weights are automatically learned during training, and all of these extracted features are “combined” to make a decision.


  1. 2 d convolution
  2. 3 d convolution
  3. 1 * 1 convolution
  4. Space is separable convolution
  5. Depth separable convolution
  6. Grouping volumes according to
  7. Extended convolution
  8. deconvolution
  9. Involution

2 d convolution

Single channel: In deep learning, convolution is essentially the multiplication and accumulation of signals by elements to obtain the convolution value. For an image with 1 channel, the following figure illustrates the operation form of convolution:

The filter here is a 3 x 3 matrix with elements [[0,1,2], [2,2,0], [0,1,2]]. The filter slides through the input data. At each position, it is multiplying and adding elements, element by element. Each slide position ends with a number and the final output is a 3 x 3 matrix.

Multi-channel: Since images generally have RGB3 channels, convolution is generally used for scenes with multi-channel input. The following diagram illustrates the operation form of the multi-channel input scenario:

Here the input layer is a 5 x 5 x 3 matrix with 3 channels, and the filters are 3 x 3 x 3 matrices. First, each kernels in the filters are applied to three channels in the input layer, and the convolution is performed three times to generate three channels with a size of 3×3:

These three channels are then added (element by element) to form a single channel (3 x 3 x 1) that is the result of convolving the input layer (5 x 5 x 3 matrix) using filters (3 x 3 x 3 matrix) :

3 d convolution

As you can see in the previous illustration, this is actually doing 3D-convolution. But in a general sense, it’s still called 2D-convolution for deep learning. Because the depth of the filters is the same as the depth of the input layer, the 3D-filters move only in two dimensions (the height and width of the image), resulting in a single channel. By generalization of 2D-convolution, 3D-convolution is defined as the depth of the filters is less than the depth of the input layer (that is, the number of convolution cores is less than the number of channels in the input layer), so 3D-filters need to slide on three dimensions (length, width and height of the input layer). Perform a convolution at each position of the slide on the filters to get a value. As the filters slide across the 3D space, the structure of the output is also 3D. The main difference between 2D-convolution and 3D-convolution is the spatial dimension of the filters sliding. The advantage of 3D-convolution is to describe the object relations in 3D space. 3D relationships are important in some applications, such as 3D-object segmentation and reconstruction of medical images.

1 * 1 convolution

For 1*1 convolution, on the surface, it seems that each value in feature maps is multiplied by a number, but in fact it is more than that. Firstly, because it will go through the activation layer, it is actually a nonlinear mapping, and secondly, the number of channels of feature maps can be changed.

The figure above illustrates the operation on an input layer with dimensions H x W x D. After 1 x 1 convolution of filters of size 1 x 1 x D, the output channel has dimensions of H x W x 1. If we perform one x one convolution like this N times, and then combine these results, we get an output layer with dimensions H x W x N.

Space is separable convolution

In a separable convolution, we can split the kernel operation into multiple steps. We denote the convolution by y = conv (x, k), where y is the output image, x is the input image, and k is the kernel. This step is very simple. Next, let’s assume that k can be calculated from the following equation: k = k1.dot (k2). This will make it a separable convolution, because we can get the same result by doing two one-dimensional convolutions of k1 and k2, instead of doing the two-dimensional convolution with k.

Take, for example, the Sobel kernel commonly used for image processing. You can get the same kernel by multiplying the vectors [1,0,-1] and [1,2,1].t. To perform the same operation, you only need six arguments instead of nine.

Depth separable convolution

Spatial separable convolution (the previous section), while in deep learning, deep separable convolution performs a spatial convolution while keeping channels independent, and then the deep convolution operation. Suppose we have a 3×3 convolution layer on a 16 input channel and a 32 output channel. What will happen then is that each of the 16 channels will be traversal by 32 3×3 kernels, resulting in 512 (16×32) feature mappings. Next, we synthesize a large feature map by adding the feature mappings from each input channel. Since we can do this 32 times, we get the desired 32 output channels. Now, for the same example, how does a deep separable convolution behave? We traversal 16 channels, each of which has a 3×3 kernel and can give 16 feature mappings. Now, before we do any merging, we’ll go through the 16 feature mappings, each of which contains 32 1×1 convolution, and then start adding them one by one. This results in 656 (16x3x3 + 16x32x1x1) arguments as opposed to the 4608 (16x32x3x3) arguments above. More details will be given below. The 2D convolution kernel 1×1 convolution mentioned in the previous section. Let’s go through standard 2D convolution very quickly. As a concrete example, let’s assume that the size of the input layer is 7 x 7 x 3 (high x wide x channel), the filter size is 3 x 3 x 3, and after a 2D convolution of a filter, the output layer size is 5 x 5 x 1 (only 1 channel). As shown in the figure below:

In general, multiple filters are applied between two neural network layers, and for now assume 128 filters. 128 2D convolution yields 128 output mappings of 5 x 5 x 1. These mappings are then stacked into a single layer of size 5 x 5 x 128. Spatial dimensions such as height and width have shrunk, while depth has increased. As shown in the figure below:

Let’s see how the same transformation can be achieved using deep separable convolution. First, we apply deep convolution at the input layer. Instead of using a single filter of size 3 x 3 x 3, we use three convolution kernels (each filter of size 3 x 3 x 1) in 2D convolution. Each convolution kernel convolves only one channel in the input layer, and each convolution yields a map of size 5 x 5 x 1. These maps are then stacked together to create a 5 x 5 x 3 image, resulting in an output image of size 5 x 5 x 3. This way, the depth of the image remains the same as the original.

Deep Separable Convolution – Step 1: Use three convolution kernels (each filter of size 3 x 3 x 1) in 2D convolution, instead of a single filter of size 3 x 3 x 3. Each convolution kernel convolves only one channel in the input layer, and each convolution yields a map of size 5 x 5 x 1. These maps are then stacked together to create a 5 x 5 x 3 image, resulting in an output image of size 5 x 5 x 3. The second step of the deep separable convolution is to increase the depth, and we do the 1×1 convolution with a 1x1x3 convolution kernel. Each 1x1x3 convolution check the 5 x 5 x3 input image is convolved with a map of size 5 x 5 x1.

So in this case, after 128 1×1 convolution, you get a layer of size 5 x 5 x 128.

Grouping convolution

Group convolution first appeared in AlexNet. Due to limited hardware resources at that time, convolution operations could not all be processed on the same GPU when AlexNet was trained. Therefore, the author assigned feature maps to multiple GPUs for processing respectively. Finally, the results of multiple GPUs were fused.

The following describes how grouping convolution is implemented. First, the traditional 2D convolution steps are shown in the figure below. In this case, an input layer of 7 x 7 x 3 is converted to an output layer of 5 x 5 x 128 by applying 128 filters (each of 3 x 3 x 3). For the general case, it can be summarized as follows: the input layer with the size of Hin x Win x DIN can be converted into the output layer with the size of Hout x Wout x Dout by applying Dout convolution kernels (each convolution kernel has the size of H x W x DIN). In packet convolution, the filters are split into different groups, each of which is responsible for the work of traditional 2D convolution with some depth. The example below makes this a little clearer.

Expansion of the convolution

The parameter that extends the convolution into another convolution layer is called the dilatancy. This defines the spacing between values in the kernel. A 3×3 kernel with an expansion rate of 2 will have the same view as a 5×5 kernel, using only nine parameters. Imagine using a 5×5 kernel and removing rows and columns for each interval. The system can provide a larger receptive field at the same computational cost. Extended convolution is particularly popular in the field of real-time segmentation. Use it only when you need a larger viewing range and can’t handle multiple convolution or a larger kernel.

Intuitively, void convolution makes the convolution kernel “expand” by inserting space between its parts. This increased parameter L (void rate) indicates how far we want to stretch the convolution kernel. The following figure shows the size of the convolution kernel when L =1,2,4. (When l=1, the void convolution becomes a standard convolution).


This is different from 1-D deconvolution computations. FCN authors call it Backwards Convolution It has been claimed that the Deconvolution layer is a very unfortunate name and should rather be called a transposed convolutional layer. As we can see, there are Con layer and Pool layer in CNN. Con layer carries out image convolution to extract features, while Pool layer reduces the image by half to screen important features. For classic image recognition CNN network, such as ImageNet, the final output result is 1X1X1000. 1000 is the category category, and 1×1 gives you 1. The FCN authors, or later the End to End researchers, used deconvolution on the final 1×1 result (in fact the FCN authors’ final output was not 1×1, but 1/32 of the image size, but this did not affect the use of deconvolution). The principle of image deconvolution here is the same as that of full convolution in Figure 6. This deconvolution method is used to make the image larger. The method used by the author of FCN is a variant of the deconvolution mentioned here, so that the corresponding pixel value can be obtained, and the image can achieve end to end.

There are two most commonly used types of deconvolution:

Method 1: Full convolution. Complete convolution can make the original domain larger

Method 2: Record the pooling index, then expand the space, and then fill it with convolution. The process of images deconvolution is as follows:

Input: 2×2, convolution kernel: 4×4, sliding step size: 3, output: 7×7

That is, a deconvolution process with step size of 3 for the image input of 2×2 through 4×4 convolution kernel

1. Carry out a full convolution for each pixel of the input image. According to the calculation of the full convolution size, it can be known that the convolution size of each pixel is 1+4-1=4, that is, the 4×4 size feature map

2. Fusion (i.e., addition) with step size of 3 was performed on the four feature images; For example, the characteristics of the red figure is still in the original input position (top left), green, or in the original position (top right), step length is 3 refers to every three pixel fusion, overlap in addition, the output line 1 column 4 is made up of red array of figure the first line of the fourth column combined with green characteristic figure of the first line in the first column, other so on.

It can be seen that the size of deconvolution is determined by the size of the convolution kernel and the sliding step size. In is the input size, k is the size of the convolution kernel, s is the sliding step size, and out is the output size

So we get out is equal to in minus 1 times s plus k, so this is 2 minus 1 times 3 plus 4, which is 7


Involution: Inverting the Inherence of Convolution for Visual Recognition (CVPR’21)

Code open source address: https://github.com/d-li14/inv…

Despite the rapid development of neural network architecture, convolution is still the main component in the construction of deep neural network architecture. Inspired by classical image filtering methods, the convolution kernel has two notable properties, spatial-agnostic and channel-specific. On Spatial, the property of the former ensures the sharing of the convolution kernel among different locations and realizes the translation invariance. In the Channel domain, the spectrum of the convolution kernel is responsible for collecting different information encoded in different channels, satisfying the latter characteristic. In addition, since the emergence of VGGnet, modern neural networks satisfy the compactness of convolution kernel by limiting the spatial span of convolution kernel to no more than 3*3.

On the one hand, although the spatial-agnostic and spatial-compact properties have significance in improving efficiency and explaining translational invariant equivalence, they deprive the convolution kernel of the ability to adapt to different visual patterns at different Spatial locations. In addition, locality limits the receptive field of convolution, posing a challenge to small targets or fuzzy images. On the other hand, it is well known that the inter-channel redundancy within the convolutional kernel is prominent in many classical deep neural networks, which limits the flexibility of the convolutional kernel for different channels.

In order to overcome the above limitations, the author of this paper proposes an operation called involution. Compared with standard convolution, involution has the properties of symmetric reverse, namely spatial-specific and Channel Agnostic. To be specific, the involution nucleus differs in scope of space, but it is shared in the passage. Because of the spatial characteristics of involution nuclear, if it is parametrified into a fixed size matrix such as convolution kernel, and updated by using the back propagation algorithm, it will hinder the transmission of the learned matching kernel between input images with different resolutions. At the end of dealing with the variable feature resolution, the involution kernel belonging to a specific spatial position may only be generated as an instance under the condition of the incoming feature vector of the corresponding position itself. In addition, the author also reduces the nuclear redundancy by sharing the involution kernel in the channel dimension.

Combined with the above two factors, the computational complexity of involution increases linearly with the number of characteristic channels, and the dynamic parameterized involution kernel has a wide range of coverage in the spatial dimension. Through the reverse design scheme, the involution proposed in this paper has the dual advantages of convolution:

1: Involution can gather the context in a wider space, so as to overcome the difficulty of modeling remote interaction

2: Involution can adaptively allocate weight in different positions, so as to prioritize the visual elements with the most abundant information in the spatial domain.

As you all know, recent self-attention based further research has shown that many tasks propose using Transformer for modeling in order to capture long-term dependencies of features. In these studies, pure self-attention can be used to build independent models with good performance. And this paper will reveal that self-attention is to model the relationship between neighboring pixels through a complex formula about nuclear structure, which is actually a special case of involution. In contrast, the core used in this article is generated on the basis of a single pixel, rather than its relationship to adjacent pixels. Furthermore, the authors demonstrate in experiments that it is possible to achieve accurate self-attention even with a simple version.

The calculation process of involution is shown in the figure below:

For the feature vectors on a coordinate point of the input feature map, first expand them into kernel shapes through ∅ (fc-bn-relu-fc) and reshape (channel-to-space) transformations. To get the corresponding involution kernel on this coordinate point, and then Multiply and Add the feature vector in the neighborhood of this coordinate point on the input feature map to get the final output feature map. The specific operation process and the shape change of TENSOR are as follows:

In addition, some models of MMClassficton, MMSegmentation and MMDetection are implemented based on the MM series codes.

Click on the attention, the first time to understand Huawei cloud fresh technology ~