This paper is about semantic segmentation: deep Partition v3, arxiv.org/pdf/1706.05… , it is recommended to read the paper and the code carefully after reading this article for better understanding.

In this paper, the author mainly wants to solve/optimize two problems in the direction of semantic segmentation: one is that the resolution of feature map is too low, which leads to the inaccuracy of subsequent restoration to the original resolution; the other is that the detection performance of multi-scale objects is not good. The author tries to find some ways to solve these two problems, some of which are improvements on previously proposed methods.

Empty convolution

In the introduction of FCN in the previous part, we said that the author believed that the full connection layer in the general classification network made the location information of the target disappear, so the full connection layer was replaced by the convolution layer to preserve the location information. In this paper, the author believes that the scaling of the original image by convolution and pooling makes the feature map lose the precise location information of the target, so there is an empty convolution that is different from ordinary convolution:

Empty Convolution has been proposed in DeeplabV1, which is called Atrous Convolution or Dilated Convolution, by inserting zeros between the Convolution kernel, It can achieve the effect of increasing field-of-view without increasing the amount of calculation. As shown in the figure above, the size of the convolution kernel in the standard convolution in the left figure is 3×3, and its receptive field is also 3×3. After inserting 0 in the middle of the convolution kernel, it becomes the empty convolution in the right figure. The actual size of the convolution kernel involved in the calculation is still 3×3, while the receptive field has expanded to 5×5. Kout = kin + (kin-1)(R-1), where r is the atrous rate of the input matrix, (r-1) is the number of zeros inserted into the convolution kernel, and r is the standard convolution.

There are two ways to realize empty convolution: one is to keep the input unchanged and insert 0 in the middle of the convolution kernel; Second, the convolution kernel remains unchanged and the input is sampled at equal intervals. Pytorch uses the second method. Let’s look at a specific example to understand the calculation process of empty convolution:

  • Random input, limited to 0, 1, 2 for easy calculation.

  • The convolution kernel, kernel_size is 3, and the dilation parameter corresponds to the Atrous rate, which is set to 2. I print out the parameters of the convolution kernel and it turns out that instead of actually putting a zero in the middle of the convolution kernel, I’m using the second method described above.

  • As for the output results, take upper left corner 5 as an example to see the calculation process, the bold is the non-zero value of the convolution kernel:

0x1+1×0+1×1+0x0+2×1 +

0x0+1×0+0x0+1×0+2×0 +

0x1+0x0+1×1+0x0+0x1 +

2×0+1×0+0x0+0x0+2×0 +

1×1+0x0+0x1+2×0+0x1 = 5

The advantage of using empty convolution is that the padding can be used to make the input and output the same size (as you can see below) on the premise of enlarging the receptive field, without increasing the computation. Now that you understand the empty convolution, it’s easy to do.

Multi-grid

In order to make the model perform better with multi-scale objects, the author uses different Atrous rates for different convolutional layers:

The above line is the result of using standard convolution. It can be seen that the deeper the network is, the smaller the feature map is, and the more unfavorable it is to recover the location information of objects. After the introduction of empty convolution in the following line, the feature map is at least 1/16 of the original figure (this can be controlled by itself). In the figure, the network structure from Block4 to Block7 is the same, including three 3×3 convolution. The unit rate defined for these three convolution in the paper is (R1, R2, R3), and the actual rate is rate*unit rate. Take Block4 for example. Assume that its unit rate is (1, 2, 4), and the rate of block4 in the figure is 2, then the actual rate of block4 is (2, 4, 8).

ASPP(Atrous Spatial Pyramid Pooling)

ASPP is called spatial pyramid Pooling based on empty convolution. The name is just a code name, but understand what it does. It comes from SPP plus the convolution of the void, so let’s start with SPP.

  • SPP(Spatial Pyramid Pooling)

SPP is proposed in the article SPPNet to solve the problem that convolutional networks can only input fixed-size images. Structure of SPP:

Assume that the number of channels in the feature map output by the convolutional layer is 256, which is the black part in the figure. After each channel is pooled into one value by the pooling layer, the grey 256-D vector in the figure will be generated; after each channel is pooled into four values by the pooling layer, the green 4×256-d vector in the figure will be generated. After each channel is pooled into 16 values by the pooling layer, the blue vector 16×256-d in the figure will be generated. The output of the above three steps will be combined into a vector with a size of (16+4+1)x256. Therefore, no matter what the size of the feature map is, the size will be fixed after SPP.

  • ASPP(Atrous Spatial Pyramid Pooling)

The combination of void convolution and SPP is ASPP. The basic structure is:

The model performs better on multiscale objects by connecting multiple convolutional layers of Atrous rates in parallel. The overall ASPP structure, shown in the yellow box, also consists of two parts (a) and (b). (a) an empty convolution with 1 1×1 and 3 with an Atrous rate of (6, 12, 18) 3×3; (b) The final feature map is pooled +1×1 convolution + BN +bilinear unsample to contain more global information.

Model performance

  • The performance of the model under different output_stride. Output_stride represents the size ratio of the original graph to the feature.

  • The performance of the model under different combinations of unit rates.

  • The impact of ASPP on model performance.

The code address

Originally wanted to reproduce their own, because there is no time to give up, here provides the code address:

Github.com/jfzhang95/p…

This interpretation covers several key points of the paper. For more details, I recommend reading the original text and the code. Have a question welcome message discussion, progress together! Code word is not easy, ask for praise!

PS: Welcome to pay attention to my personal wechat public number [MachineLearning Learning road], every week a CV direction of the paper interpretation provided!