preface

This paper gives a brief overview of the important papers on semantic segmentation, introduces their main improvement methods and effects, and provides the download methods of these papers.

This article is from the public CV technical guide technical summary series ****

Welcome to CV technical guide, focusing on computer vision technology summary, the latest technology tracking, classic paper interpretation.

Semantic segmentation refers to the process of linking each pixel in an image to a class label. These labels may include people, cars, flowers, furniture, etc.

We can think of semantic segmentation as pixel-level image classification. For example, in an image with many cars, segmentation marks all objects as car objects. However, a separate category of models called instance segmentation can mark individual instances of objects appearing in the image. This segmentation is useful in applications that are used to calculate the number of objects, such as the flow of people in a shopping mall.

Some of its major applications are self-driving cars, human-computer interaction, robotics, and photo editing/creative tools. For example, semantic segmentation is important in self-driving cars and robotics because it is important for models to understand the context in which they operate.

“Two men riding on a bike in front of a building on the road. And there is a car.”

This paper will introduce some research papers on the latest methods of constructing semantic segmentation models, namely:

  • Weakly- and Semi-Supervised Learning of a Deep Convolutional Network for Semantic Image Segmentation

  • Fully Convolutional Networks for Semantic Segmentation

  • U-Net: Convolutional Networks for Biomedical Image Segmentation

  • The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation

  • Multi-Scale Context Aggregation by Dilated Convolutions

  • DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

  • Rethinking Atrous Convolution for Semantic Image Segmentation

  • Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

  • FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation

  • Improving Semantic Segmentation via Video Propagation and Label Relaxation

  • Gated-SCNN: Gated Shape CNNs for Semantic Segmentation

Attached at the end of the paper above download method

Weakly supervised and semi-supervised learning in deep convolutional Networks for semantic image segmentation

Weakly- and semi-supervised Learning of a Deep Convolutional Network for Semantic Image Segmentation (ICCV, 2015)

Code: HTTPS: / / bitbucket.org/deeplab/deeplab-public

This paper proposes a solution for processing weakly labeled data and the combination of well-labeled and improperly labeled data in deep convolutional neural network (CNN).

In this paper, the combination of deep CNN and fully connected conditional random field is applied.

On PASCAL VOC segmentation benchmark, the model gives mean intersection-over-union (IOU) scores higher than 70%. One of the main challenges with this model is that it requires images to be annotated at the pixel level during training.

The main contributions of this paper are:

  • An expectation maximization algorithm is introduced for boundary box or image level training in weakly supervised and semi-supervised Settings.

  • It is shown that combining weak and strong annotations can improve performance. After combining notes from the MS-COCO and PASCAL datasets, the authors achieved 73.9% IOU performance on PASCAL VOC 2012.

  • It is proved that their method achieves higher performance by combining a small number of pixel-level annotation images with a large number of boundary-box or image-level annotation images.

Full convolutional networks for semantic segmentation

Fully Convolutional Networks for Semantic Segmentation (PAMI, 2016)

Code: fcn.berkeleyvision.org

The proposed model achieves 67.2% average IU performance on PASCAL VOC 2012.

The fully connected network takes images of any size and generates outputs of the corresponding spatial dimensions. In this model, ILSVRC classifiers are projected onto fully connected networks and intensive predictions are enhanced using pixel-level losses and in-network upsampling. Then the segmentation training is completed by fine tuning. Fine-tuning is done by back propagation across the network.

U-net: Convolutional networks for biomedical image segmentation

U-net: Convolutional Networks for Biomedical Image Segmentation (MICCAI, 2015)

Code: LMB. Informatik. Uni – freiburg. DE/people/ronn…

In biomedical image processing, it is very important to obtain a category label for each cell in the image. The biggest challenge in biomedical missions is the difficulty of obtaining thousands of images for training.

In this paper, the complete convolution layer is built and modified to process some training images and produce more accurate segmentation.

Since there is very little training data available, the model uses data enhancement by applying elastic deformation to the available data. As shown in Figure 1, the network architecture consists of a contraction path on the left and an expansion path on the right.

The contraction path consists of two 3×3 convolution. Each convolution is followed by a rectifying linear unit and a 2×2 maximum pooling operation for downsampling. Each downsampling phase doubles the number of feature channels. The extended path step includes up-sampling of the feature channel. And then convolved over 2×2, halving the number of characteristic channels. The final layer is the 1×1 convolution, which maps component feature vectors to the desired number of classes.

In this model, training is done using input images, their segmentation graphs, and Caffe’s stochastic gradient descent implementation. Data enhancement is used to teach the required robustness and immutability of the network when very little training data is used. The model achieved an average IOU score of 92% in one experiment.

One-hundred tier Tiramisu: Fully convolutional DenseNets for semantic segmentation

Thesis: The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation (2017)

Code: github.com/SimJeg/FC-D…

The idea behind DenseNets is to have each layer connected to each other in a feedforward fashion, making the network easier to train and more accurate.

The architecture of the model is built in dense blocks of downsampling and upsampling paths. The downsampling path has two downconversions (TDS) and the upsampling path has two upconversions (TU). The circles and arrows represent connection patterns within the network.

The main contributions of this paper are:

  • DenseNet architecture is extended to complete convolutional networks for semantic segmentation.

  • An upsampling path that performs better than other upsampling paths is proposed from dense networks.

  • Demonstrate that the network can produce SOTA results in standard benchmarks.

  • The model achieved 88% global accuracy on the CamVid dataset.

Multi-scale context aggregation is carried out by extended convolution

Thesis: Multi-scale Context Aggregation by Dilated Convolutions (ICLR, 2016)

Code: github.com/fyu/dilatio…

In this paper, a convolutional network module is developed to fuse multi-scale context information without loss of resolution. The module can then be plugged into an existing schema at any resolution. The module is based on extended convolution.

The module was tested on the Pascal VOC 2012 dataset. It proves that adding context modules to existing semantic segmentation architectures can improve their accuracy.

The front-end modules trained in the experiment achieved 69.8% average IoU on the VOC-2012 validation set and 71.3% average IoU on the test set. The prediction accuracy of this model for different objects is shown below

DeepLab: Semantic image segmentation using deep convolutional networks, Atrous convolution, and fully connected CRF

Thesis: DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs (TPAMI, 2017)

Code: github.com/tensorflow/… (Unofficial)

In this paper, the author makes the following contributions to the semantic segmentation task of deep learning:

  • Convolution with an upsampling filter for intensive prediction tasks

  • Pyramid Pooling in Porous Space for Multi-scale Segmentation Targets (ASPP)

  • Improved positioning of object boundaries by using DCNN.

The proposed DeepLab system achieves 79.7% mIOU in PASCAL VOC-2012 semantic image segmentation task.

This paper addresses the main challenges of using deep CNN in semantic segmentation, including:

  • Reduced feature resolution due to repeated combination maximum pooling and downsampling.

  • The existence of multiscale targets.

  • Because the target-centered classifier requires the invariance of spatial transformation, the invariance of DCNN leads to the reduction of positioning accuracy.

Atrous convolution is applied by up-sampling the filter by inserting zeros or sparsely sampling the input feature graph. The second method requires a subsample of the input feature graph equal to the porous convolution rate r, and a de-interleaving scan is performed to generate R ^2 reduced resolution graphs, with one possible shift for each R ×r. After this, standard convolution is applied to the direct feature graphs, interleaving them with the original resolution of the image.

Rethinking semantic image segmentation with Atrous convolution

Rethinking Atrous Convolution for Semantic Image Segmentation (2017)

Code: github.com/pytorch/vis… (Unofficial)

This article addresses two challenges of semantic segmentation using DCNN (mentioned earlier); Reduction in feature resolution occurs when continuous pooling operations are applied and multiple scale objects are present.

To solve the first problem, the paper suggests using atrous convolution, also known as extended convolution. It addresses the second problem by proposing the use of porous convolution to enlarge the field of view and thus include multiscale context.

The paper’s “DeepLabv3” achieved 85.7% performance on the PASCAL VOC 2012 test set without DenseCRF post-processing.

Encoder-decoder with Atrous separable convolution for semantic image segmentation

Encoder-Decoder with Atrous Convolution for Semantic Image Segmentation (ECCV, 2018)

Code: github.com/tensorflow/…

The proposed method “DeepLabv3+” achieved 89.0% and 82.1% test set performance without any post-processing of PASCAL VOC 2012 and Cityscapes data sets. The model is an extension of DeepLabv3, which refines the segmentation results by adding a simple decoder module.

In this paper, two types of neural networks are implemented, which use spatial pyramid pooling modules for semantic segmentation. One captures context information by aggregating features of different resolutions, while the other captures clear object boundaries.

FastFCN: Rethinking extended convolution in semantic segmentation backbone

FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation (2019)

Code: github.com/wuhuikai/Fa…

In this paper, a Joint Pyramid Upsampling (JPU) module is proposed to replace the time-consuming and memory-consuming extended convolution. Its working principle is that the function of high resolution map extraction is formulated as a joint upsampling problem.

The method achieves 53.13% mIoU performance on Pascal Context data sets and runs three times faster.

In this method, a fully connected network (FCN) is implemented as the backbone, and JPU is used to up-sample the final low-resolution feature image to generate a high-resolution feature image. Replacing extended convolution with JPU does not result in any performance penalty.

Joint sampling uses low resolution target image and high resolution guide image. Then the high resolution target image is generated by transmitting the structure and details of the guide image.

Improved semantic segmentation through video propagation and tag relaxation

FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation (2019)

Code: github.com/NVIDIA/sema…

In this paper, a video-based approach is proposed to expand the training set by synthesizing new training samples. This is aimed at improving the accuracy of semantic segmentation networks. It explores the ability of video prediction models to predict future frames in order to predict future tags.

This paper shows that training segmentation networks on data sets derived from synthetic data can improve prediction accuracy. The proposed method achieves 83.5% mIoU on Cityscapes and 82.9% on CamVid.

The paper proposes two methods for predicting future labels:

  • Label Propagation (LP) creates new training samples by pairing propagated labels with original future frames

  • Joint Image-Label Propagation (JP) creates new training samples by pairing Propagation labels with corresponding Propagation images

The thesis has three main propositions; A video prediction model is used to propagate labels to immediate adjacent frames, joint image label propagation is introduced to deal with the problem of misalignment, and the single-hot label training is relaxed by maximizing the possibility of probabilistic union along the boundary.

Porta-scnn: Gated shape CNN for semantic segmentation

Gated Shape CNNs for Semantic Segmentation (2019)

Code: nv – tlabs. Making. IO/GSCNN /

This paper is the latest achievement in semantic segmentation. The authors propose a dual – flow CNN architecture. In this architecture, shape information is processed as a separate branch. This shape flow processes only boundary-related information. This is enforced by the model’s gated convolution layer (GCL) and local oversight.

The model is 1.5% higher on mIoU than Deeplab-V3 + and 4% higher on F boundary score. The model was evaluated using the Cityscapes benchmark. On smaller, thinner objects, the model achieved a 7% improvement on the IoU.

The following table shows the performance of porta-SCNN compared to other models.

conclusion

We should now master some of the most common — and more recent — techniques for performing semantic segmentation in a variety of contexts.

Access to all the above papers: public CV technical guide background reply keyword “0009” can be obtained

By Derrick Mwiti

Compilation: CV technical Guide

Heartbeat.com et.ml/ A-2019-Guid…

Welcome to pay attention to the public number CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

Reply keyword “technical summary” in the public account to obtain the summary PDF of the original technical summary article of the public account.

Other articles

CV technical Guide – Summary and classification of essential articles

Pytorch’s ten tips for effective training

Common strategies for improving machine learning model performance

The Softmax function and its misconceptions

Resources sharing | SAHI: big slices of small target detection in auxiliary reasoning library

Summary of image annotation tools in computer vision

Batch Size effect on neural network training

ModuleList and Sequential in PyTorch: Distinction and usage scenarios

Summary of tuning methods for hyperparameters of neural networks

Use Ray to load the PyTorch model 340 times faster

Summary of image annotation tools in computer vision

ICCV2021 | MicroNet: at a low FLOPs improve image recognition

CVPR2021 | to rethink BatchNorm in Batch

ICCV2021 | to rethink the visual space dimension of transformers

CVPR2021 | Transformer used for End – to – End examples of video segmentation

Complexity analysis of convolutional neural networks

A review of the latest research on small target detection in 2021

Capsule Networks: The New Deep Learning Network

Classic paper series | target detection – CornerNet & also named anchor boxes of defects

CVPR2021 | open the target detection of the world

Summary of under-fitting and over-fitting techniques

Summary of normalization methods

Summary of common ideas of paper innovation

Summary of efficient Reading methods of English literature in CV direction

Summary of feature pyramid technology in computer vision

Technical summary of attention mechanism in computer vision

Tesseract vs. EasyOCR open Source Framework for Word recognition

Summary of computer vision terms (a) to build the knowledge system of computer vision

A review of small sample learning in computer vision

Loss function technology summary

Classic paper series | to rethink on ImageNet training beforehand