From shallow to deep: The relationship between convolution layer and transpose convolution layer in CNN

Welcome toTencent Cloud + community, get more Tencent mass technology practice dry goods oh ~

This post was posted by Forrestlin in cloud + Community

Introduction: Transpose Convolution Layer, also known as deconvolution Layer or fractional Convolution Layer, is more and more common in recently proposed convolutional neural networks, especially in adversative generative neural networks (GAN). The Transpose Convolution Layer appears in the up-sampling part of generator networks to recover reduced dimensions. Then, what is the relationship and difference between the transpose convolution layer and the positive convolution layer, and what is the realization process of the transpose convolution layer? The author summarizes this paper according to the recent pre-research project.

1. Convolution layer and full connection layer

On CNN put forward before, we should be most cases mentioned artificial neural network is a feedforward neural network, the main difference is that CNN using convolution layer, while the feedforward neural network with is all connection layer, and the difference between the two layer in the whole connection layer think a layer of all nodes in the next layer is needed, With weighting matrix multiply layer upon layer, layer and convolution is considered in the next layer of a layer of some of the node is actually don’t need, therefore puts forward the concept of convolution kernel matrix, if the size of the convolution kernels is nm, it means that the convolution kernel think each mapped to the next layer of a layer of nodes are only nm node is significant, the specific way of mapping in the next section will talk about. At this point, some beginners will think that the full connection layer can also be done by assigning some weights of the weight matrix to 0. For example, if I calculate the second node of the current layer and think that I don’t need the first node of the previous layer, then I set w01=0. In fact, it is true that the convolution layer can be regarded as a special case of the full connection layer. The convolution kernel matrix can be expanded into a sparse weight matrix of the full connection layer containing many zeros. The following figure is a weight matrix of the full connection layer expanded by the convolution kernel when 44 pictures and 33 convolution kernels generate a 2*2output.

As you can see, the matrix size of 416 above, a lot larger than convolution kernels 33, so using convolution layer without connect all the first reason is can greatly reduce the number of parameters, the second reason is the convolution kernels is certain relationship between the adjacent nodes, learning the local characteristics of the images, can be said to be with a purpose of the study, For example, the convolution kernel of 33 learns the relationship between nodes with a distance of 2. This is greatly different from the full connection layer, which treats all nodes learning without distinction. In this way, the shortcoming of feedforward neural network that cannot learn displacement invariance is solved. For chestnut, when we are in the feedforward neural network learning a 44 in the picture, if there is a design of cross breaks, use the following figure 4 training data for training, so in the end will only 5,6,9, the weight of the four node has a regulation, and then if appear below the picture as a test, will lead to the network can’t identify, Since the weight of convolution kernel is shared among different nodes, this problem is naturally overcome.

2. Operation process of convolution layer

2.1 The simplest convolution

The operation of the convolution layer is actually applying multiple convolution kernels to the input, as shown in the figure below, which is the operation performed by the simplest convolution kernels. No padding, no stride. The blue square at the bottom is the input picture, and the shaded part is the convolution kernel of 33 (generally the convolution kernel is a square with odd side length). When the convolution kernel is scanned, it is multiplied and added together with the input, resulting in an output of 22, corresponding to the cyan region.

Usually a convolution layer can contain multiple convolution kernels, represents the output of the convolution layer depth, such as the image below is what we often see the depth of the network in the paper, the architecture of the one of the first layer is convolution layer + maximum pooling, regardless the biggest pooling layer, at least we can clear the size of the convolution kernels is 55, convolution kernels number is 16, The output size of this layer is 1818.

2.2 Convolution with padding

We can see from the simplest GIF of convolution that the output is smaller than the input after convolution, but sometimes we want the output size to be the same as the input, and the padding is introduced for this purpose, and this padding, to keep the input and output size the same, We’ll call it “same padding”, see the following GIF. The size of the convolution kernel is 3*3, and the padding is 1. The padding is actually filling in as many layers as the padding is around the input, and the upper limit is -1. The padding size will not be given in the paper, so we need to deduce it by ourselves. The derivation formula can be seen below.

According to the padding size, there are three types of padding:

Same padding: Padding to make the output the same size as the input, for example, 3Core of 3, same padding = 1,5Core of 5, same padding = 2.
full padding: padding = kernel size – 1
valid padding: padding = 0

2.3 Convolution with stride greater than 1

Stride is the step length, which represents the distance between two convolution operations of the convolution kernel. The default value is 1. The step length of the two examples mentioned above is 1. Usually, when the stride is greater than 1, we call it isometric downsampling, because in this way, the output will definitely lose information, and the size is smaller than the input.

2.4 Relationship between input and output size and convolution kernel

We mentioned above that the padding usually needs to be calculated by ourselves, so how to calculate it is actually calculated according to the relationship between the input and output sizes and the size of the convolution kernel. The convolution mentioned above is actually the three parameters of the convolution operation: the core size (F), the padding(P) and the stride(S). If careful readers see the GIF, they will find that the output size can be calculated according to the input size and the three parameters. The formula is as follows. Only the width is given here, and the height is the same.

W2 = (W1 – F p + 2) present S + 1

Here we notice that the formula above has a division, so there’s a situation where we can’t divide it all, and then we have to round down, and that’s called the odd convolution, and you can see how that happens in the following GIF.

3. Transpose convolution layer

After talking about the convolution layer, let’s take a look at another layer of convolution operation in CNN, the transpose convolution layer, which is sometimes called the deconvolution layer, because its process is the reverse of normal convolution, but it is only the reverse of size, and the content is not necessarily, so some people will refuse to confuse the two. The biggest use of the transpose convolution layer is up-sampling. We just said that in normal convolution, when the stride is greater than 1, we conduct isometric down-sampling, which will make the size of the output smaller than that of the input. In the transpose convolution layer, we will use the convolution of the stride is less than 1 to up-sample and make the size of the output larger. So the transpose convolution layer is also called the fractional convolution layer. The most common scenario of upsampling can be said to be the generator network in GAN, as shown in the figure below. Although conV is used by the author of this paper, it represents the transpose convolution layer due to its step size of 1/2.

To understand the transpose convolution layer, we need to understand what is called a normal inverse convolution, the novice is often difficult to understand the place, the author through the two figures below for better explanation, the first figure is the process of normal convolution, the second figure is its corresponding transposed convolution, in the first diagram, large square number 1 only participate in the small square in the calculation of the number 1, So in the transpose convolution, the one of the big square can only be generated by the one of the small square, so that’s the reverse process.

As the normal process of convolution is described, the author will also give the corresponding transpose convolution one by one.

3.1 No padding No stride convolution corresponding to transpose convolution

The graph used above to explain the reverse process of transpose convolution is actually the simplest (no padding, no stride) convolution and its corresponding transpose convolution. Its GIF is shown here.

3.2 Transpose convolution of the convolution with padding

If there is padding in forward convolution, there may not be padding in transpose convolution. Its calculation formula will be given below. Here, the corresponding transpose convolution GIF of 2.2 is given first.

3.3 Transpose convolution of the convolution of the stride greater than 1

At the beginning of this section, the convolution with stride greater than 1 is down-sampled, and the corresponding transpose convolution is up-sampled with stride less than 1. However, in both pyTorch and TensorFlow, the parameters of convTranspose function are integers. It is impossible to set the stride to be a floating point number less than 1, so we will still pass the positive convolution stride to the convTranspose function, and how does convTranspose do it? See the following GIF, which is the transpose convolution corresponding to the non-padding convolution in 2.3. Instead of looking at the transpose padding in the transpose convolution, which is the dashed line outside of the GIF, you’ll see that every two blue blocks have a white block inserted between them, which is a zero, so that for every step that the convolution kernel moves, it moves 1/2 of a step, So we can figure out that we need to insert a stride-1 0 between every two blue blocks.

3.4 Conversion relation of forward convolution and transpose convolution

Padding of transpose convolution

From the above 3 examples of transpose convolution, it can be found that when the transpose convolution is realized with the forward convolution, the size of the convolution kernel remains unchanged, and the stride is the reciprocal of the forward convolution stride (only we insert 0 to simulate fractional movement). Finally, how to calculate the padding of the transpose convolution? Although we can pass in the padding of forward convolution if we call pyTorch or TensorFlow without a tube, understanding how convTranspose does it also helps us understand transpose convolution. Having said so, in order to ensure that the transpose convolution is the reverse of the forward convolution, we have to add the transpose padding, denoted by PT, whose calculation formula is: PT=F−P−1, where F is the kernel size of the forward convolution, and P is the padding of the forward convolution.

3.4.2 Output size of transpose convolution

This is actually easy to calculate, because we all talk about the inverse of transpose convolution, so we just need to transpose the formula given in 2.4 to find W1, which is as follows:

W1 = (W2-1) * S – 2 p + F

Where S is the stride of the positive convolution, P is the padding of the positive convolution, and F is the core side length of the positive convolution.

3.4.3 Transpose convolution of odd convolution

This can be said to be transposed convolution is the most difficult to understand a situation, when divided by stride in 2.4 we mentioned we can’t to take down the whole may, so when we ask W1, there will be uncertainty for chestnut, or section 3 figure is given at the beginning, we are hope to enlarge the figure W / 4 W / 2 degree, this is a process of transposed convolution, We first calculate the forward convolution, sampling from W/2 down to W/4, k represents the side length of the core is 3, S is the reciprocal of the stride is 1/2, namely 2, and the padding is deduced to be 1 according to the formula 2.4, so the calculation formula of the forward convolution is as follows: (W2−3+2)÷2+1=W4+12, and then round down to get W4, which is the same as shown in the figure, but if we calculate it in reverse by the formula in 3.4.2, it is (W4−1)×2−2+3=W2−1, which is the uncertainty of odd transpose convolution. Let’s go back to the GIF given in 2.4. You’ll notice that the right and bottom padding areas we didn’t convolve, we ignored them because we rounded them down, so we need to add that back in when we transpose the convolution, so in PyTorch the convTranspose function also has an argument output_padding that takes care of that, TensorFlow should also have parameters that I’m not familiar with. Here’s PyTorch’s description of this parameter, which is exactly what we encountered.

For the output_padding value, it should be (W1−F+2P)%S, which in the example above would be 1.

4. To summarize

This paper first introduces the convolution neural network and traditional relation and distinction between feedforward neural networks, and then through the different parameters of convolution process in this paper, the convolution operation, and then introduce beginner depth when studying arcane transposed convolution, deconvolution under different parameters are given by the corresponding transposed convolution, finally summed up the formula used in the convolution operation. I hope the above analysis and explanation can be helpful to the students who just started CNN. Besides, I am engaged in iOS development, and I just started CNN and deep learning. I hope you AI masters don’t be afraid to give me advice.

5. Reference documents

The intuitive explanation of CNN on Zhihu, translation invariance, is learned from here
Github of A Guide to Convolution Arithmetic for Deep Learning, and the gifs of this paper are all from this
On the relation and difference between transpose convolution and convolution

Machine learning in action! Quick introduction to online advertising business and CTR knowledge

This article has been authorized by the author to Tencent Cloud + community, more original text pleaseClick on the

Search concern public number “cloud plus community”, the first time to obtain technical dry goods, after concern reply 1024 send you a technical course gift package!

Massive technical practice experience, all in the cloud plus community!