Deeplab series

In 2015, Deeplabv1 was proposed, in 2017 Deeplabv2 was proposed, in 2017 Deeplabv3 was proposed, and in 2018 Deeplabv3+ was proposed. Next, we take a look at what problems these models solve and how to solve them.

Download link:

Deeplabv1: arxiv.org/pdf/1412.70…

Deeplabv2: arxiv.org/pdf/1606.00…

Deeplabv3: arxiv.org/pdf/1706.05…

Deeplabv3 + : arxiv.org/pdf/1802.02…

Deeplabv1

When we use CNN to solve the problem of image segmentation, there are usually two problems: one is that downsampling will lead to the loss of details; The other is that CNN has spatial invariance, that is, when the same image is subjected to spatial transformation (such as translation and rotation), the image classification result remains unchanged, but for image segmentation task, the result is changed

For the first problem, Deeplabv1 is solved by empty convolution. The author uses DenseCRF to solve the second problem

The concept of empty convolution:

The hyperparameter of dilation rate is introduced into the cavity convolution, which can increase the receptive field without changing the size of the output feature image.

Reference links:

zhuanlan.zhihu.com/p/113285797

Blog.csdn.net/qq_30159015…

DenseCRF concept:

Reference links:

zhuanlan.zhihu.com/p/22464586

zhuanlan.zhihu.com/p/22464569

Deeplabv1 structure:

Mainly on the original VGG network for some transformation:

  1. The original full connection layer is realized by convolution layer
  2. There are 5 Max pooling layers in THE VGG network, the last two Max pooling layers are removed first, which is equivalent to 8 times down sampling, and then the ordinary convolution layer after the last two Max pooling is changed to the empty convolution layer
  3. In order to reduce the amount of computation, the first 7*7 convolution layer in VGG is changed to empty convolution
  4. Loss function; Sum of cross entropy

Deeplabv2

Deeplabv2 emphasizes the use of empty convolution and proposes the pooling pyramid of empty space convolution (ASPP) to obtain better segmentation results with multi-scale information. Compared with Deeplabv1, Deeplabv2 uses Resnet network to replace VGG network and proposes ASPP to solve the problem of objects on multiple scales.

Deeplabv3

Deeplabv3 mainly focuses on two issues: continuous pooling or increasing the step size of convolution will make the resolution of feature graph smaller and smaller, and the features of network learning become more abstract, which is not conducive to semantic segmentation, which requires detailed spatial information, and intensive prediction tasks. And the multiscale problem of objects.

For the first problem, the author uses empty convolution. Aiming at the second problem, the author improved the ASPP module by adding BN layer in combination with the void convolution with different expansion coefficients. The author also found that the 3×3 void convolution with larger expansion coefficients would lose long-distance information due to the image boundary effect, so it was reduced to 1X1 convolution, and the imager-level features were fused into the ASPP module

Deeplabv3+

Deeplabv3+ concerns: The first is the problem of multi-scale objects, which has been solved by ASPP structure designed by Deeplabv3. The second reason is that there is a layer with stride=2 in the deep network, which will reduce the feature resolution, thus reducing the prediction accuracy and causing the loss of boundary information.

The author focuses on solving the second problem, and the codec structure is as follows:

Encoder deeplabv3.

Decoder part: Select a feature from the lower level first, and compress the feature of the lower level with 1*1 convolution (originally 256 channels, or 512 channels), in order to reduce the proportion of the lower level. The author believes that the features obtained by the encoder have richer information, so the features of the encoder should have a higher proportion. This is good for training.

Then the output of the encoder is upsampled to make its resolution consistent with the low-level feature. For example, if the feature output from Resnet Conv2 was used, up-sampling was required by 4 times. After the two features are connected, the convolution of 3×3 is performed again (refinement), and then upsampling is performed again to obtain pixel-level prediction. The subsequent experimental results show that this structure has both high accuracy and fast speed when the stride is 16. Stride =8, relatively speaking, only achieved a little improvement in accuracy, but increased a lot of calculation