This paper records the reading of CVPR2020 paper “Semi-supervised Semantic Image Segmentation based on Self-correcting Network”, updated on March 26, 2020.26 — Apo

The original text is published on other platforms, I imported it from F may have formatting problems

Semi-supervised Semantic Image Segmentation with Self-correcting Networks: Semi-supervised Semantic Image Segmentation with self-correcting Networks

Abstract

  1. This paper introduces a primitive semi-supervisory framework.
  2. Two variants of self-correcting modules are introduced using linear or convolution functions.
  3. Equal to or better than models trained with large fully supervised sets, while requiring 7 times less annotation work.

Introduction

  1. In this paper, a semi-supervised method using cheap object boundary box labels for training is proposed to reduce the data requirement of semantic segmentation. This approach reduces the need for data annotation at the cost of inferring mask labels for objects in bounding boxes.
  2. A principled framework is proposed to train semantic segmentation models in a semi-supervised environment, using a small group of fully supervised images (with semantic object masks and bounding boxes) and a group of weak images (with bounding box annotations only).
  3. The framework is a self-correcting segmentation model because it improves on the weakly supervised label based on its current object mask probability model.
  4. Experiments on pascal-VOC and Cityscapes data sets show that our model trained with a small subset of fully supervised sets performs as well (and in some cases even better) as models trained with all fully supervised images.

Related Work

  1. Semantic segmentation: Use DeepLabv3+ as our segmentation model because it is better than the previous CRF-based DeepLab model using simple factorial output.
  1. Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In EuropeanConferenceonComputerVision(ECCV),2018. 1, 2, 5, 6, 7
  1. Robust training: Using bounding box information to train segmentation models can be reduced to a robust learning problem from noisy marker instances. These models are limited to image classification and have not been applied to image segmentation.
  1. Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, and Serge Belongie. Learning from noisy large-scaledatasetswithminimalsupervision. In2017IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Pages 6575 — 6583. IEEE, 2017. 2
  2. Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. Joint optimization framework for learning with noisy labels. In Computer Vision and Pattern Recognition (CVPR), 2018. 2, 4
  3. LuJiang,ZhengyuanZhou,ThomasLeung,Li-JiaLi,andLi Fei-Fei. Mentornet: Regularizingverydeepneuralnetworks oncorruptedlabels. InInternationalConferenceonMachine Learning (ICML), 2018. 2
  4. Mostafa Dehghani, Arash Mehrjou, Stephan Gouws, Jaap Kamps, And Bernhard Sch¨olkopf. Fidelity-weighted learning. In the International Conference on learning Representations (ICLR), 2 2018.
  1. Semi-supervised semantic segmentation: ** This paper focuses on deep segmentation CNN using bounding box labeling training. ** Papandreou et al. [41] Based on DeepLabv1[6], an expectation maximization (EM) algorithm is proposed to estimate segmentation labels of weak image sets (containing only box information). At each training step, segmentation labels are estimated based on network output in EM mode. Dai et al. [12] An iterative training method is proposed that alternates between generating regional suggestions (from a fixed suggestion pool) and fine-tuning networks. Similarly, Khoreva et al. [26] Uses iterative algorithm, but relies on GrabCut[47] and manually compiled rules to extract segmentation masks in each iteration. == Our work differs from the previous approach in two significant ways: ==

    I) We replace manual rules with auxiliary CNN to extract probabilistic segmentation labels of objects in weak set boxes.

    Ii) In the training process, we used the self-correction model to correct the mismatch between the output of the auxiliary CNN and the main segmentation model.

[41] George Papandreou, Liang-Chieh Chen, Kevin P. Murphy, and Alan L. Yuille. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In IEEE International Conference on Computer Vision (ICCV), 2015. 1, 2, 3, 4, 5, 6, 7 [12] Jifeng Dai, Kaiming He, and Jian Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In IEEE International Conference on Computer Vision (ICCV), 2015. 1, 2, 3, 4, 7 [47] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM transactions on graphics (TOG). ACM, 2004. 2, 3

In addition to box annotations, segmentation models can also use other forms of weak annotations, such as image pixel level [60, 62, 22, 3, 17, 61, 15], image label level [68], graffiti [64, 31], dot annotations [5] or Web video [20]. Recently, methods based on adversarial learning [2351] have been proposed to solve this problem. Our framework complements other forms of supervision or adversarial training and can be used with them.

[60] Xiang Wang, Shaodi You, Xi Li, and Huimin Ma. Weaklysupervised semantic segmentation by iteratively mining common object features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 2 [22] ZilongHuang,XinggangWang,JiasiWang,WenyuLiu,and Jingdong Wang. Weakly-supervised semantic segmentation network with deep seeded region growing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), ‘2018. 2 [3] Jiwoon Ahn and Suha Kwak. Learning Pixel-level Semantic af association with Image-level Supervision for Weakly. supervised semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2 [17] WeifengGe, SibeiYang, andYizhouYu, Et al. Association of data mining and data mining with data mining techniques, object detection and semantic segmentation based on weakly supervised learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 2 [61] Yunchao Wei, Jiashi Feng, Xiaodan Liang, Ming-Ming Cheng,YaoZhao,andShuichengYan. Objectregionmining with adversarial erasing: A simple classification to Semantic segmentation approach. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2 [64] Jia Xu, Alexander G. Schwing, and Raquel Urtasun. Learning to segment under various forms of weak supervision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 2 [31] Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun. Scribblesup: Scribble-supervisedconvolutionalnetworksfor semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2

Proposed Approach

3.1. We provide the auxiliary model and in section 2. 3.2, we show a simple way to use this model to train the master model. In Section 3.3, Section 3.4, we propose two self-correcting models.

3.1. Auxiliary segmentation model

The key to the semi-supervised training of the segmentation model with boundary box labeling is to infer the segmentation of objects in the box. Existing approaches to this problem rely mainly on hand-coded rule-based processes such as GrabCut[47] or iterative tag optimization mechanisms [41, 12, 26]. The latter process typically iterates between extracting segmentation from the image and label refinement using boundary box information (for example, by zeroing out the mask outside the box).

[47] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM transactions on graphics (TOG). ACM, 2004. 2, 3 [41] George Papandreou, Liang-Chieh Chen, Kevin P. Murphy, and Alan L. Yuille. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In IEEE International Conference on Computer Vision (ICCV), 2015. 1, 2, 3, 4, 5, 6, 7 [12] Jifeng Dai, Kaiming He, and Jian Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In IEEE International Conference on Computer Vision (ICCV), 2015. 1, 2, 3, 4, 7 [16] Mark Everingham, S. M. Ali Eslami, Luc J. Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision (IJCV), 2015. 6

The main problems of this kind of program are: I) the boundary box information is not directly used to extract the segmentation mask; Ii) The program may be suboptimal because it is designed by hand; Iii) Segmentation becomes blurred when multiple boxes overlap.

In this paper, we take a different approach and design an auxiliary segmentation model to form the label distribution per pixel given the image and boundary box annotations. The model using the monitor set (F) is easy to train, can be used as W images in the training signal, the inference, the image and its boundary box are feedback to the network, get panc (y | x (W), b (W)), segmentation label distribution.

Our key observation in designing the auxiliary model is that == encoder-based segmentation networks typically rely on encoders initialized from image classification models (e.g., ImageNet pretraining models). This usually improves segmentation performance by transferring knowledge from large image classification datasets. To maintain the same advantage, we augment the encoding-decoder based segmentation model with a network of parallel bounding box encoders that embed bounding box information at different scales (see Figure 2).

The input to the boxed encoder is a 3D tensor representing the binarization mask of the boxed and a 3D shape representing the target size of the encoder output. == Adjust the input mask tensor to the target shape and then pass through a 3×3 convolution layer with sigmoID activation function. = =

Activation: The resulting tensor can be interpreted as an attention graph multiplied by elements by the feature graph generated by the segmentation encoder. Figure 2 shows the two paths of such a feature map at two different scales, as in the DeepLabv3+ architecture. For each scale, a attention graph is generated, fused with the corresponding feature graph using element multiplication, and fed to the decoder. For an image with a size of W×H×3, we use a binary mask with a size of W×H× (C+1) to represent its object boundary box, and this binary mask encodes the C+1 binary mask. If the CTH binary mask at the pixel is inside a bounding box of the CTH class, the value is 1. If the pixel in the background mask is not covered by any bounding box, the value is 1.

Using the cross-entropy loss training auxiliary model on the complete data set F:

The model is fixed for future experiments

3.2. No self-correction

We observe experimentally that the performance of our auxiliary model is better than that of the segmentation model without box information. This is mainly because the bounding box information leads the auxiliary model to look for objects in the box during reasoning. == The easiest way to train the master model is to train it to make predictions using correctly labeled labels on the fully supervised set F and labels generated by the auxiliary model on the weak set W. For this “no self-correction” model, the self-correction module in Figure 1 only replicates the predictions made by the auxiliary segmentation model. Training is guided by optimization:

Among them, the first item is the cross entropy loss targeted at a one-hot Ground-truth label, and the second item is the cross entropy loss targeted at the soft probability label generated by PANC. Note that the auxiliary model parameterized by θ is fixed. We call this approach a non-self-correcting model because it relies directly on the auxiliary model to train the primary model in W. (

3.3. Linear self-correction

Formula 2 relies on the auxiliary model to predict the label distribution on the weak set. However, the model only trains with instances of F, not data in W.

Vahdat[56] introduced a regular expectation-maximization algorithm that uses KL divergence of linear combinations to infer a general classification problem distributed over missing tags. * * the main point is that the inference on the label distribution q (y | x, b) should be close to the auxiliary model panc (y | x, b) and the master model p (y | x) generated by the distribution. ** However, since the master model could not accurately predict the segmentation mask at the early stage of training, the two terms were re-weighted using the direct proportionality factor α :

Due to panc (y | x, b) and p (y | x) are decomposed into y component on the product of the probability, and since the classification of the distribution of each component is, so also is the factor, which by the independent application model and a linear combination of the auxiliary model logits softmax activation, = = calculation of each component on the classification of the distribution parameters. Here, sigma (.) Is softmax function, is the logits generated by the main model and auxiliary model of MTH pixel. = =

In each iteration, training master model on fixed weak set q (y | x (w), b (w)), after training master model we can use the following method:

Notice the alpha control equation 3 q and p (y | x) and panc (y | x, b) close to degrees. When q = alpha = up panc (y | x, b), the equation of linear correction collapse of itself as the equation 2, while alpha = 0 q = p (y | x). Limited alpha close to keep q p (y | x) and panc (y | x, b). At the start of the training, panc (y | x, b) can’t predict the distribution of segmentation tags. Therefore, we define a schedule for alpha in which alpha decreases from a large value to a small value during the training of the master model.

This correction model is called a linear self-correction model, == because it uses the solution of a linear combination of KL divergence (Equation 3) to infer the distribution on the potential segmentation label. Since the parameters of the master model are optimized during training, α makes the self-correction mechanism biased towards the master model. = =

3.4 Convolutional self-correction

A disadvantage of linear self-correction is the need for hyperparameter search to adjust alpha scheduling during training.

We propose a method to overcome this difficulty, which is to replace linear function == with convolutional network of learning self-correction mechanism. Therefore, the network automatically and dynamically adjusts the mechanism when training the primary model. If the master model accurately predicts labels, the network can move its prediction to the master model.

To this end, we introduce an additional item in the objective function that trains the subnet using the training example in F, while training the master model on the entire dataset:

Since the subnet is randomly initialized, the segmentation label on W cannot be predicted accurately during training. To address this issue, we recommend the following pre-training procedures:

1. Initial training of the auxiliary model: Like the previous self-correcting model, we need to train the auxiliary model. Here, half of the full supervised set (F) is used for this purpose.

2. Initial training of the transformed self-correcting network: fully supervised data (F) is used to train the master model and the convolutional self-correcting network. This is done using the first term and the last term in equation 6.

3. Main training: Fine-tune the previous model by using the objective function in Equation 6 with all the data (F and W).

Half in the first stage USES the basic principle of F is, if we use all the training panc F (y | x, b) model, it will be training with almost perfect predictor of mask, the partition of the set as a result, the convolution correction network follow-up training will only study panc (y | x, b). Training in order to overcome this problem, this paper puts forward the second part of the F to help self-tuning online learning how to panc (y | x, b) and p (y | x).

Experiments

We will evaluate the models on PASCAL VOC 2012 and the Urban Landscape dataset. Both datasets contain object segmentation and bounding box annotations. We split the full dataset annotation into two parts to simulate a full and semi-supervised setup. Like [9,41], performance is measured using the average crossover union (mIOU) of the available classes.

Figure 3: Convolutional self-correcting model learning to refine input label distribution. The subnet receives logins from the primary and secondary models, then connects and feeds the output to the two-layer CNN. >

It’s a decomposition of the word Logit, so it’s Log, so it’s Odds

Training: We use the public Tensorflow[1] implementation of DeepLabv3+[9] as the main model. We used an initial learning rate of 0.007 to train the model 30,000 steps from ImageNet’s pre-trained Exception-65 model [9]. For all other parameters, we use the standard Settings suggested by other authors. In the evaluation, we apply flipping and multi-scale processing to the image, as shown in [9]. We use 4 Gpus, and each GPU has a batch of 4 images.

We defined the following baselines in all experiments:

  1. Ancillary Model: This is the Ancillary Model, described in Section 3.1, that predicts semantic segmentation labels for a given image and its object bounding boxes. The performance of this model is expected to be better than other models because it uses bounding box information.

  2. No self-correction: This is the main model for training using the model introduced in Section 3.2.

  3. Lin. Self-correction: This is the main model for training with linear self-correction, as shown in Section 3.3.

  4. Conv. self-correction: The original model trained by convolution self-correction in Section 3.4.

  5. Em-unxed Baseline: Since our linear self-correcting model is derived from the regularized EM model [56], we compared our model with Papandreou et al. [41] This is also an EM-based model. We implemented EM fixed baselines for them using DeepLabv3+ for fair comparison. This baseline achieved the best results in semi-supervised learning [41].

Linear self-correction, α controls the weight of KL divergence deviation, large α is beneficial to the auxiliary model, small α is beneficial to the main model. We explore different starting and ending values of α, which decay exponentially between them. We find that the starting values α=30 and the final values α=0.5 perform well for both data sets. This parameter setting is robust because moderate changes in these values have little impact.

4.1. PASCAL VOC data set

In this section, we will evaluate all models on the PASCALVOC 2012 subdivision benchmark [16]. The dataset consists of 1464 training, 1449 validation and 1456 test images, including 20 foreground object classes and one background class for segmentation. [18] An auxiliary data set of 9118 training images is provided. However, we suspect that the split label [18] contains a small amount of noise. In this section, we refer to the union of the original PASCAL VOC training set and the auxiliary set as the training set. We evaluated the model primarily on the validation set, while using the online evaluation server we evaluated the best model only once on the test set.

In Table 1, we show the performance of different variants of our model in fully supervised set F of different sizes. The remaining examples in the training set are used as W. We observe from Table 1 that:

I) The auxiliary model for predicting the segmentation label and its object bounding box for a given image performs well even in the following cases. It is trained with a training set as small as 200 images. == This indicates that the model can also provide good training signals for weak sets lacking segmentation labels. = =

Ii) Linear self-correcting models generally perform better than non-self-correcting models, supporting our view that == combined with the main model and auxiliary model used to infer segmentation labels can better train the main model ==.

Iii) == The performance of convolutional self-correction model is equal to or better than that of linear self-correction model ==, and the need to define α scheduling is eliminated. Figure 4 shows the output of these models.

Table 2 compares the performance of our model with different baselines and published results. In this experiment, 1464 images were used as F and 9118 images from the auxiliary data set were used as W. Both self-correcting models achieved similar results and were superior to other models.

Surprisingly, == our semi-supervised model is superior to the fully supervised model. We assume that this observation has two possible explanations ==. First, this may be due to label noise in the 9K auxiliary kit [18] negatively affecting the performance of vanilla Deep PLap V3 +. As evidence, Figure 5 compares the output of the auxiliary model with the ground truth annotation and highlights some instances of incorrect marking. Second, the performance improvement may also be due to explicit modeling and self-correction of label uncertainties. To test this hypothesis, we performed vanilla DeepLabv3+ training on only 1.4K instances in the initial PASCAL VOC 2012 training set and achieved 68.8% of the mAP in the validation set. However, if we train the convolutional self-correction model on the same training set and allow the model to refine ground truth labels using self-correction, the resulting map is as high as 76.88% (the convolutional self-correction at the top of the bounding box yields 75.97% of the map). == This indicates that the robust loss function and self-correcting noise model can significantly improve the performance of segmentation model ==. This is consistent with the self-correction method, which has been shown to be effective for edge detection [66,2], and is contrary to the common segmentation targets used to train models using cross entropy and a thermal annotation mask. Very similar to our approach and reasoning, [67] uses Logits to train lightweight pose estimation models using knowledge distillation techniques.

Unfortunately, == the most advanced models still use older versions of DeepLab==. It was not feasible for us to either re-implement most of these methods using DeepLabv3+ or re-implement our work using older versions. The only exception is the EM fixed baseline [41]. Our reimplementation using DeepLabv3+ achieved 79.25% on the validation set, compared to 64.6% reported in the original paper using DeepLabv1. In the bottom half of Table 2, we document the results of previous releases (using an older version of DeepLab). A close examination of the results shows that our work is superior to previous work because our semi-supervised model is superior to fully supervised models, whereas previous work usually is not.

Finally, comparing Table 1 and Table 2, we find that when F=200 and W=10382, the performance of our linear self-correcting model is similar to that of DeepLabv3+ trained with the entire data set. Using the tag costs reported in [5], this theoretically translates into a seven-fold reduction in annotation costs.

4.2. Urban landscape data set

The performance of the urban landscape dataset [11], which contains images collected from cars driving through the city in different seasons, was evaluated. This dataset has high quality annotations, but some instances are over/under segmented. It includes 2975 training, 500 validation and 1525 test images covering 19 foreground object classes (things and objects) for segmentation.

Table 3: Ablation studies of our model on the urban landscape validation set, using mIOU of different sizes of F. For the last three lines, the remaining images in the training set are used as W, that is, W+F=2975

Table 4: Urban landscape validation set results. 30% of the training examples were used as F, and the rest were used as WDRVBVREo=,size_16,color_FFFFFF, T_70) figure 4: Qualitative results of PASCAL VOC 2012 validation set. The last four columns represent the model in column 1464 of Table 1. Conv. self-correcting models generally segment objects better than other models.

5.Conclusion

In this paper, we propose a semi-supervised framework to train deep CNN segmentation models using a small set of complete markers and a small set of weakly labeled images (box-annotated only). We introduced two mechanisms to enable the underlying master model to correct weak labels provided by the secondary model. The proposed self-correction mechanism combines the prediction of the main model and the auxiliary model, whether using linear function or trainable CNN. Experiments show that our proposed framework outperforms previous semi-supervised models on PASCAL VOC 2012 and Cityscapes datasets. Our framework can also be applied to instance splitting tasks [21,74,72], but we leave further study of this for future work.

[21] Ronghang Hu, Piotr Dollr, Kaiming He, Trevor Darrell, and Ross Girshick. Learning to segment every thing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 8 [74] Yanzhao Zhou, Yi Zhu, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. Weakly supervised instance segmentation using class peakresponse. InIEEEConferenceonComputerVisionand Pattern Recognition (CVPR), 2018. 8 [72] Xiangyun Zhao, Shuang Liang, and Yichen Wei. Pseudo mask augmented object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 8