The background,

Semantic segmentation, which aims to classify all pixels in an image, has always been one of the main tasks in the field of computer vision image. In practical application, it is always a reliable way of fine recognition and image understanding because it can accurately locate the area where the object is located and exclude the influence of background with pixel-level accuracy.

However, building a semantically segmented dataset requires annotating every pixel on each image. According to statistics, it takes about 1.5 hours for single image segmentation and labeling with 1280*720 pixels [1], while the manpower and material resources required for labeling data sets with tens or hundreds of thousands of pixels to produce ideal effects make the input-output ratio of actual business projects extremely low.

In order to solve this problem, weakly supervised semantic segmentation, which only needs image-level annotation to achieve close segmentation effect, has become a hot topic in semantic segmentation research in recent years. In this technique, the pixel-level density prediction of images can be realized by using easier and easier image-level labeling to obtain and optimize the seed segmentation region of objects by training the classification model.

After in-depth research, the eSHIELD algorithm team analyzed the characteristics of the weak supervised semantic segmentation technology in practice and verified its effectiveness in the actual project, so as to successfully land the technology in the actual project and achieve significant improvement in project indicators, effectively helping eshield content security service refinement identification.

Next, this paper will introduce the classification and general process of weakly supervised semantic segmentation, and select some representative papers in this direction for a brief introduction.

2. Basic Information

Classification of 1.

According to the form of weakly supervised signals, common weakly supervised semantic segmentation can be divided into the following four categories (FIG. 1) :

(1) Image-level annotation: it is the simplest annotation to only mark the category of the relevant object in the image;

② Object point marking: mark a point on each object, and the corresponding category;

(3) Object frame labeling: mark the rectangular box where each object is, and the corresponding category;

(4) Object marking: draw a line on each object, and the corresponding category.

FIG. 1 Weakly supervised semantic segmentation classification

This paper focuses on weakly supervised semantic segmentation based on image level annotation, which is the simplest and most difficult annotation.

      

2. Weakly supervised semantic segmentation steps based on image-level annotation

Weakly supervised semantic segmentation based on image-level annotation mostly adopts the form of multi-module series, as shown in Figure 2[2] :

 

Figure 2 weakly supervised semantic segmentation steps

Firstly, a classification model is trained by single label or multi-label classification using the image category labels labeled at the image level. The classification model is used to segment the seed region of false labels by calculating the category feature response graph CAM[3] of the corresponding category in the image. Then, optimization algorithms (such as CRF[4], AffinityNet[5], etc.) are used to optimize and expand the seed region to obtain the final pixel-level segmentation pseudo-label. Finally, traditional segmentation algorithms (such as Deeplab series [6]) are trained using image data sets and segmentation pseudo-labels.

I. Representative work

This part mainly introduces several typical papers in image-level weakly supervised segmentation. Firstly, CAM[3], the basic paper of weakly supervised segmentation, will be introduced, and then two CAM algorithms (OAA[7] and SEAM[8]) will be introduced as the seed region of pseudo-label segmentation. Finally, a typical seed region optimization and expansion algorithm AffinityNet[4] will be introduced.

1. CAM (Class Activation Mapping) [3]

This article was proposed by Zhou Bolei in CVPR in 2016. The author found that the CNN middle layer trained has the feature of target positioning even without positioning tag, but this feature is destroyed by vector stretching after convolution and continuous full connection layer. However, this feature can be retained if the last full connection layers are replaced with a global average pooling layer GAP and a single full connection layer followed by Softmax. At the same time, after a simple calculation, the region with category differentiation that is used by CNN to confirm that the image belongs to a certain category can be obtained, namely CAM.

 

Figure 3 CAM

The specific calculation method of CAM is as follows (as shown in Figure 3) :

Let f k(x,y) be the value of the KTH feature graph at the position (x,y) obtained by the convolutional layer of the last layer, and w k C be the KTH weight corresponding to the full-connection layer of the last layer of class C, then the value of the response feature graph CAM at the position (x,y) of class C is:

M c is CAM. The larger the final VALUE of CAM is, the higher the contribution to classification is: the red area in the thermal map of the last figure in FIG. 3 shows the maximum VALUE of CAM, which is also the Australian dog face area.

In this paper, the author shows that the location of CAM can be directly used as the prediction of weakly supervised target positioning, and carries out relevant experiments, which not only improves the performance obviously compared with the best weakly supervised positioning algorithm at that time, but also only needs a single forward reasoning process to get the positioning frame.

CAM has always been the core algorithm for seed region generation in weakly supervised semantic segmentation. However, the disadvantages of CAM are also obvious: only focusing on the most discerning area can not cover the whole target. Most of the subsequent algorithms are solving this problem or post-processing CAM. Next, several representative works will be selected and introduced.

2. Get better seed areas

1) OAA [7]

The motivation of this paper is simple and direct. Through observation, the author finds that: at different training stages before training convergence, CAM generated by the model will move in different parts of the same target. B, C and D in FIG. 4 represent different training stages, and their CAM (highlighted area) will move. When integrating CAM generated in different stages of the same image, the integrated CAM can better cover the whole target, as shown in Figure 4-E.

 

FIG. 4 CAM movement diagram

The algorithm is shown as follows:

 

FIG. 5 Schematic diagram of OAA algorithm

For a certain category in a single image, ReLU and standardization were performed on the CAM feature graph FCC when the image was trained in the t epoch:

Then, A pixel-level comparison was made with the feature graph A T-1C saved in the previous epoch, and the maximum value of each pixel was selected. The final output was the response graph M T after accumulation. After the training, M T is regarded as the seed region of weakly supervised semantic segmentation false label.

OAA algorithm is simple and effective, and SOTA performance was achieved on VOC weakly supervised segmentation datasets. However, the author found in the actual project attempts that its effect was average, because the method of maximizing the CAM of multiple epochs tended to cause more noises in the seed region, which was difficult to eliminate through post-processing, resulting in a significant decrease in the segmentation accuracy.

(2) SEAM [8]

Fire since the supervised learning concept, introduced the article, through the observation found that the same image after the affine transformation of different CAM inconsistent this feature (as shown in figure 6 – a different size in transformation), using the implicit way such as transform constraint set up similar to the supervision and the consistency of comparison study regularization learning mechanism, to reduce this inconsistency to optimize a CAM, Thus high precision seed region is obtained (FIG. 6-b).

 

FIG. 6 CAM of images with input of different sizes

The SEAM algorithm is shown as follows:

 

Figure 7. SEAM algorithm schematic diagram

The original image and its images after simple radiological transformation A(·) (such as narrowing and left-right reversal) are respectively sent to CNN of shared parameters to obtain corresponding CAM feature graphs Y O and Y T. The optimized CAM feature maps Y O and Y T were obtained by learning the Pixel Correlation Module (PCM). Among them, PCM is an operation similar to self-attention, query and key are feature graphs obtained by CNN, and value is CAM feature. Loss consists of three parts:

Wherein, L CLS is the commonly used multi-label soft margin loss, equivariant regularization (ER) loss is:

Equivariant cross regularization (ECR) loss is:

​(

In reasoning, multiple scales and multiple cams generated by the image flipped left and right are used as the final CAM after standardization.

SEAM algorithm has not only achieved greater improvement in open data set compared with other algorithms, but also achieved better results in the actual project of Easy Shield. Its disadvantage is that the training and reasoning of the algorithm are time-consuming.

3. CAM post-processing

After CAM seed region is obtained, it can be directly trained as semantic segmentation pseudo-label. However, in order to achieve better segmentation results, seed region is usually optimized. Next, AffinityNet, a very effective algorithm, will be introduced [5].

The main approach of this paper is: for an image and its corresponding CAM, the adjacency graph of all pixels and the pixels with a certain radius around it is established, and then AffinityNet is used to estimate the semantic affinity relationship between pixel pairs in the adjacency graph to establish the probability transition matrix. For each category, the point on the target edge in the adjacency graph will encourage the point to spread to the edge of the same semantic position by random walk according to the probability transfer matrix, while punishing the behavior of spreading to other categories. This semantic diffusion and punishment can significantly optimize CAM to better cover the entire target, resulting in more accurate and complete pseudo tags.

The main difficulty of this model training lies in how not to use additional supervisory information. Through observation, the author finds that CAM can be used as the source of training AffinityNet supervisory information. Although CAM has problems of incomplete coverage and noise, it can still correctly cover local areas and confirm the semantic close relationship between pixels in this area, which is exactly the goal that AffinityNet needs to achieve. In order to obtain reliable labels of CAM, the authors discard the pixels with relatively low score of CAM feature map, and only retain high score and background pixels. Pixel pairs are collected on these feature areas, and the label is 1 if they belong to the same category, and 0 otherwise, as shown in Figure 8.

 

FIG. 8 AffinityNet sample and label schematic diagram

In the process of training, the image obtained feature graph F aff through AffinityNet, then the semantic closeness of pixel I and j on the feature graph W ij was calculated as follows:

Where, xi and yi represent the coordinate position of the ith pixel feature on the feature graph F AFF. Then, the cross entropy loss function is used for training.

The overall training and reasoning process is shown in the figure below:

 

FIG. 9 Schematic diagram of AffinityNet training and reasoning

Firstly, the CAM of the training image is used to select multiple pixel pairs as training samples, and the label of semantic close relationship is obtained. These data are used to train AffinityNet (left of FIG. 9). Then the trained AffinityNet is used to perform inference calculation on each image to obtain the semantic closeness matrix of the image adjacency graph as the probability transition matrix. Finally, the matrix is used in the form of random walk on CAM of the graph to obtain the final optimized semantic segmentation pseudo-label (right of FIG. 9).

AffinityNet algorithm has clear thinking and reliable results. As a CAM post-processing method acquired by OAA or SEAM algorithms, AffinityNet algorithm is often used to optimize the accuracy of pseudo-labels and expand the coverage area of pseudo-labels. Its improvement effect is obvious in both qualitative and quantitative analysis.

Third, summary

This paper briefly introduces the concept and process of weakly supervised semantic segmentation, and briefly introduces some of the articles, and analyzes the advantages and disadvantages of these algorithms from the perspective of practice. The existing weakly supervised semantic segmentation also has a cumbersome and lengthy process. In the academic world, some work has proposed end-to-end solutions and achieved certain effects (e.g. [9][10]). In the future, the Algorithm team of E-Shield will continue to follow up the latest directions in the academic world and try to be implemented, so as to further improve the effect of the refined identification project of E-shield content security service.

[1] Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 3213-3223.

[2] Zhang D, Zhang H, Tang J, et al. Causal intervention for weakly-supervised semantic segmentation[J]. Advances in Neural Information Processing Systems, 2020, 33: 655-666.

[3] Zhou B, Khosla A, Lapedriza A, et al. Learning deep features for discriminative localization[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 2921-2929.

[4] Krähenbühl P, Koltun V. Efficient inference in fully connected crfs with gaussian edge potentials[J]. Advances in neural information processing systems, 2011, 24.

[5] Ahn J, Kwak S. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 4981-4990.

[6] Chen L C, Papandreou G, Kokkinos I, et al. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 40(4): 834-848.

[7] Jiang P T, Hou Q, Cao Y, et al. Integral object mining via online attention accumulation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 2070-2079.

[8] Wang Y, Zhang J, Kan M, et al. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 12275-12284.

[9] Zhang B, Xiao J, Wei Y, et al. Reliability does matter: An end-to-end weakly supervised semantic segmentation approach[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34(07): 12765-12772.

[10] Araslanov N, Roth S. Single-stage semantic segmentation from image labels[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 4253-4262.