Semantic segmentation refers to the process of associating every pixel in an image with a category label, which might include a person, a car, a flower, a piece of furniture, and so on. In this article, the author introduces some of the best recent semantic segmentation ideas and solutions that deserve to be described as a 2019 semantic segmentation guide.

Compiled by Derrick Mwiti, Heart of the Machine, Nurhachu Null, Geek AI.

We can think of semantic segmentation as image classification at pixel level. For example, in an image with many cars, the segmentation model will mark all objects (cars) as vehicles. However, another model, called instance segmentation, is able to mark independent objects that appear in an image as independent instances. This segmentation is useful when used to count objects (for example, counting the number of customers in a mall).

Some of the major applications of semantic segmentation are autonomous driving, human-computer interaction, robotics, and photo editing/authoring tools. For example, semantic segmentation is a key technology in autonomous driving and robotics because it is important for models in these fields to understand the context of their operating environment.

Image source: www.cs.toronto.edu/~tingwuwang…

Next, we will review some research papers on the most advanced approaches to constructing semantic segmentation models. They are:
  1. Weakly- and Semi-Supervised Learning of a Deep Convolutional Network for Semantic Image Segmentation

  2. Fully Convolutional Networks for Semantic Segmentation

  3. U-Net: Convolutional Networks for Biomedical Image Segmentation

  4. The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation

  5. Multi-Scale Context Aggregation by Dilated Convolutions

  6. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

  7. Rethinking Atrous Convolution for Semantic Image Segmentation

  8. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

  9. FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation

  10. Improving Semantic Segmentation via Video Propagation and Label Relaxation

  11. Gated-SCNN: Gated Shape CNNs for Semantic Segmentation

1. Weakly- and Semi-Supervised Learning of a Deep Convolutional Network for Semantic Image Segmentation (ICCV, 2015)
This paper proposes a solution to the challenge of dealing with weak labeled data in deep convolutional networks, as well as the combination of well-labeled and poorly labeled data. In this paper, deep convolutional networks and fully connected conditional random fields are combined.
  • Address: arxiv.org/pdf/1502.02…

In PASCAL VOC segmentation benchmarks, this model had a crossover ratio of more than 70% (IOU)


The main contributions of this paper are as follows:

  • Introduce EM algorithms for boundary-box or image-level training, which can be used in weakly supervised and semi-supervised environments.

  • It is proved that the combination of weak annotation and strong annotation can improve the performance. After combining the ANNOTATION of the MS-COCO dataset and PASCAL dataset, the authors of the paper achieved a 73.9% crossover ratio performance on PASCAL VOC 2012.

  • It is proved that their method achieves better performance by combining a small amount of pixel-level annotations with a large number of boundary box annotations (or image-level annotations).

2. Fully Convolutional Networks for Semantic Segmentation (PAMI, 2016)
The model presented in this paper achieved an average IoU of 67.2% on the PASCAL VOC 2012 dataset. The fully connected network takes an image of any size as input and then generates the corresponding spatial dimension. In this model, the classifier in ILSVRC is dropped into the fully connected network, and the dense prediction is enhanced with per-pixel loss and upsampling modules. Training for segmentation is accomplished by fine-tuning, which is accomplished by back-propagation across the network.
  • Address: arxiv.org/pdf/1605.06…






3. U-Net: Convolutional Networks for Biomedical Image Segmentation (MICCAI, 2015)

In biomedical image processing, it is very important to obtain the category label of each cell in the image. One of the biggest challenges in biomedicine is that images for training are not easy to obtain, nor are data volumes large. U-net is a very famous solution. It builds the model on the fully connected convolution layer and modifs it so that it can run on a small amount of training image data and get more accurate segmentation.

  • The paper address: https://arxiv.org/pdf/1505.04597.pdf

Since small amounts of training data are available, the model uses data enhancement by applying flexible deformation to the available data. As depicted in Figure 1 above, the network structure of the model consists of a contracting path on the left and an expanding path on the right.
The contraction path consists of two 3X3 convolution, each of which is followed by a ReLU activation function and a 2X2 maximum pooling operation for downsampling. The extended path phase involves upsampling of a feature channel. This is followed by a 2X2 transpose convolution, which halves the number of feature channels and enlarges the feature graph. The last layer is the convolution of 1X1, which maps the eigenvectors to the required number of categories.

In this model, training is accomplished by input images, their segmentation graphs, and stochastic gradient descent. Data enhancement is used to teach networks the robustness and immutability necessary when using very little training data. The model achieved 92% mIoU in one of the experiments.



4. The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation (2017)
The idea behind DenseNets is to make the network easier to train and more accurate by having each layer connect to all layers in a feedforward fashion.
The model architecture is built on dense blocks containing downsampling and upsampling paths. The downsample path contains two Transitions Down (TD), while the upsample contains two Transitions Up (TU). The circles and arrows represent connection patterns in the network.
  • The paper address: https://arxiv.org/pdf/1611.09326.pdf

The main contributions of this paper are:
  • The structure of DenseNet is extended to full convolutional networks for semantic segmentation.

  • The upsampling path in dense network has better performance than other upsampling paths.

  • Demonstrate that the network produces the best results on standard benchmarks.

This model achieves 88% global accuracy in the CamVid dataset.

5. Multi-Scale Context Aggregation by Dilated Convolutions (ICLR, 2016)


This paper proposes a convolutional network module that can mix multi-scale context information without loss of resolution. This module can then be embedded into an existing structure at arbitrary resolution, based primarily on empty convolution.

  • The paper address: https://arxiv.org/abs/1511.07122

The module was tested on the Pascal VOC 2012 dataset. The results show that adding a context module to the existing semantic segmentation structure can improve the accuracy.


The front-end modules trained in the experiment achieved 69.8% average crossover ratio (mIoU) on the VOC-2012 validation set and 71.3% average crossover ratio on the test set. The prediction accuracy of this module for different objects is as follows:

6. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs (TPAMI, 2017)

In this paper, the author makes the following contributions to the semantic segmentation task:

  • Use convolution with upsampling for intensive prediction tasks

  • Spatial Pyramid Pooling with Holes for Segmented Objects at Multiple Scales (ASPP)

  • The location of the target boundary is improved by using DCNNs

  • The paper address: https://arxiv.org/abs/1606.00915

In this paper, the Proposed DeepLab system achieves 79.7% mean intersection ratio (mIoU) in PASCAL VOC-2012 image semantic segmentation.

This paper addresses the major challenges of semantic segmentation, including:

  • Reduced feature resolution due to repeated maximum pooling and downsampling

  • Detection of multi-scale targets

  • Because the target-centered classifier needs to have invariance to the spatial transformation, the positioning accuracy caused by the invariance of DCNN is reduced.

Atrous convolution has two uses, either up-sampling the filter by inserting zero values, or sparsely sampling the input feature graph. The second method requires sub-sampling of the input feature graph by a factor equal to the cavity convolution rate r, and then deinterlacing it to make it a low resolution graph of R ^2, with one possible migration for each R × R region. After this, a standard convolution is applied to the intermediate feature map and interleaved with the original image resolution.

7. Rethinking Atrous Convolution for Semantic Image Segmentation (2017)
This paper addresses two challenges (mentioned earlier) with semantic segmentation using DCNN: the reduction in feature resolution when using continuous pooling operations, and the presence of multi-scale targets.
  • The paper address: https://arxiv.org/pdf/1706.05587.pdf

To address the second problem, atrous convolution with holes, also known as dilated convolution, is proposed. The second problem is solved by the fact that we can augment the receptive field by using hole-convolution and therefore can include a multi-scale context.

In the absence of dense Conditional Random field (DenseCRF), the DeepLabv3 version of the paper achieved 85.7% performance on the PASCAL VOC 2012 test set.

8. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation (ECCV, 2018)


In this paper, the method “DeepLabv3+” achieved 89.0% and 82.1% performance on PASCAL VOC 2012 and Cityscapes datasets, respectively, without any post-processing. This model improves segmentation results by adding a simple decoding module on top of DeepLabv3.

  • The paper address: https://arxiv.org/pdf/1802.02611v3.pdf

This paper implements two kinds of neural networks with spatial pyramid pooling for semantic segmentation. One captures context information by pooling features at different resolutions, and the other wants to capture clear target boundaries.

9. FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic egmentation (2019)

This paper proposes a Joint up-sampling module called Joint Pyramid Upsampling (JPU) to replace the time-consuming and memory-consuming hole convolution. It formalizes the method of high resolution graph extraction and constructs it as an upsampling problem to achieve good results.

  • The paper address: https://arxiv.org/pdf/1903.11816v1.pdf

This method achieves 53.13% mIoU on Pascal Context data sets and runs three times faster.

In this method, full convolutional network (FCN) is used as the main architecture, and JPU is used to up-sample the final feature image with low resolution, and a high resolution feature image is obtained. There is no performance penalty for using JPU instead of hole convolution.

Joint sampling uses a low resolution target image and a high resolution guidance image. Then the high resolution target image is generated by transferring the structure and details of the guide image.

10. Improving Semantic Segmentation via Video Propagation and Label Relaxation (CVPR, 2019)

This paper proposes a video-based approach to enhance the data set by synthesizing new training samples and improving the accuracy of semantic segmentation networks. This paper explores the ability of video prediction models to predict future frames, and in turn, to predict future tags.

  • The paper address: https://arxiv.org/pdf/1812.01593v3.pdf

This paper proves that training semantic segmentation networks with synthetic data can improve prediction accuracy. The proposed method achieved 8.5% mIoU on Cityscape and 82.9% mIoU on CamVid.

The paper proposes two methods for predicting future labels:
  • Label Propagation: To create new training samples by pairing original future frames with propagated labels.

  • Joint image-label Propagation (JP) : To create new training samples by pairing corresponding image and label Propagation.

This paper has three major contributions: propagating labels to the current adjacent frame using a video prediction model, introducing joint image label propagation (JP) to deal with offsets, and relaxing one-hot label training by maximizing the joint probability of classification on the boundary.

11. Gated-SCNN: Gated Shape CNNs for Semantic Segmentation (2019)

This paper is the latest achievement in the field of semantic segmentation (2019.07). The author proposes a dual-flow CNN structure. In this structure, the shape information of the target is processed through an independent branch, and the shape flow processes only boundary-related information. This is enforced by the gating layer (GCL) and local supervision of the model.

  • The paper address: https://arxiv.org/pdf/1907.05740.pdf

In Cityscapes benchmarking, the model had a mIoU 1.5% higher than Deeplab-V3 and an F-Boundary score 4% higher than Deeplab-V3. On a smaller target, the model was able to achieve a 7% increase in IoU. The following table shows the performance comparison of Gated-SCNN and other models.





The above is the main progress of semantic segmentation recently. With the further improvement of models and data, semantic segmentation is getting faster and faster with higher accuracy, and it may be applied to various real life scenes in the future.

Original link:
https://heartbeat.fritz.ai/a-2019-guide-to-semantic-segmentation-ca8242f5a7fc