Active learning

The data annotation process is presented as an interaction between the learning algorithm and the user. AI model among them, the algorithm is responsible for the selection of training samples of higher value, while the user with the selected samples That is, by machine learning method to get to the more “hard” classification of sample data, manual confirmation and review again, and then will get the data again using artificial tagging have supervision and training or a semi-supervised learning model, Gradually improve the effect of the model and integrate the human experience into the machine learning model. Take the image classification problem as an example. Firstly, part of the image data is manually selected and annotated, and the initial model is trained. Then, the remaining unannotated data is predicted by the trained model. Then, the labels of these “difficult” data are manually corrected and the training model is fine-tuned again by adding the training set. “Query method” is one of the core of active learning. The most common “query method” includes sample query strategy based on uncertainty and sample query strategy based on diversity.

                           

Query methods

  • Query of uncertain sampling
  • Based on committee queries
  • Queries based on model change expectations
  • Query based on error reduction
  • Query based on variance reduction
  • Query based on density weight

Query of uncertain sampling

Lowest confidence: Consider sample data with the highest probability of model prediction but low confidence. The confidence of (0.9,0.1) is greater than (0.51,0.49).

Edge sampling: The sample with the smallest probability difference between the largest and second largest predicted by the model is selected. In the dichotomy problem, and confidence is the lowest equivalent.

Entropy: Choose the one with the lowest entropy.

Based on committee queries

Consider the scenario of multiple models, and select the sample data that is “difficult” to distinguish through the mode of voting of multiple models

conclusion

In the field of active learning, the key lies in how to select the appropriate annotation candidate set for manual annotation, and the selection method is query strategy. The query strategy can be based on a single machine learning model or multiple machine learning models, depending on the actual situation.

Interactive annotation scene in semantic segmentation

1, Deep Interactive Object Selection

This is an article from CVPR 2016. In this paper, an interactive two-dimensional image segmentation method using convolutional neural network (CNN) is proposed. The principle of this method is shown in the figure below: the user-provided interaction is some points located in the foreground and background (the green point is the foreground and the red point is the background), and these points are transformed into the distance image based on the foreground point and the distance image based on the background point. The original image is an image with three CHANNELS in RGB. After adding these two distance images, there are altogether five channels. The images of these five channels are taken as the input of FCN to obtain the segmentation result. Therefore, in the process of segmentation, users only need to give a few points to guide the segmentation of FCN.

A probability graph can be obtained from the output of FCN, and then the graph cut is used to modify the probability graph to make the segmentation result closer to the edge of the image. The figure below is an example of segmentation, where (a) is the foreground and background points provided by the input image and the user, (b) is the probability graph obtained by FCN, and (c) is the result optimized by graph Cut.

The paper uses models pre-trained by FCN and fine-tuned on the PASCAL VOC 2012 image segmentation dataset. In the training process, the foreground points and background points are obtained through the simulation of user interaction, without the need for the user to mark these points on the training set. In the test process, users can increase the number of points according to the need, so as to edit the segmentation results. The following is an example of a user providing 1-3 points for segmentation. The first line is the segmentation result of taking the threshold value of the probability graph, and the second line is the result of modifying the probability graph with graph cut.

2, DeepIGeoS: A deep interactive geodesic framework for medical image segmentation

DeepIGeoS is a recent article published on TPAMI. This algorithm is a method for 2 d and 3 D medical image segmentation. Unlike Deep Interactive Object Selection, which requires the user to provide a point of interaction from the start, DeepIGeoS allows the user to provide interaction only in a missegmentation area of automatic segmentation results, thus being more time saving and efficient.

The flow chart of DeepIGeoS

DeepIGeoS uses two CNNS, as shown above. The first CNN(called P-NET) obtains an automatic segmentation result, and on this basis, the user provides an interaction point or a short line to mark the missegmentation area, which is then used as the input of the second CNN(called R-NET) to obtain the correction result.

DeepIGeoS also translates user interactions into distance images as input to CNN, but using geodesic distance, as shown below. In (a), the green curve is the initial segmentation result, the red dot is the foreground of the mark when the user revises it, and the cyan is the background of the mark. (d) and (e) are geodesic distance images corresponding to these two interactions respectively. Geodesic distance can better reflect the edge and context information in the image than Euclidean distance. The original image, the initial segmentation result, the distance image of the foreground interaction point and the distance image of the background interaction point are combined to get a four-channel image, which can be used as the input of the second CNN (R-NET).

DeepIGeoS trains R-Net by simulating the interaction points during training. Below is the simulated user interaction on the initial segmentation results obtained by P-NET:

This paper also proposes a network structure (P-NET) in which the resolution of the feature image remains unchanged during the convolution process. Because FCN has the process of gradually sampling down and then sampling up, some details will be lost in the output results. P-net does not use the layer of sampling down and sampling up. In order to make the network useful and have a large receptive field, Use dilated convolution/atrous convolution to replace the pooling layer similar to that of the nearest Deeplab-v3.

Experiments were conducted on two-dimensional fetal MRI images and three-dimensional brain tumor images respectively in this paper. The results showed that, compared with traditional interactive segmentation methods such as Graph Cuts, Random Walks and ITK-SNAP, DeepIGeoS greatly reduced the demand for user interaction and reduced user time. The user time required for the interactive segmentation of 2-D placenta image and 3-D brain tumor image is about 8 seconds and 60 seconds, respectively.

3, Interactive medical image segmentation using deep learning with image-specific fine-tuning

This is an article published in TMI this year. This method is also called BIFSeg (Bounding box and Image-Specific Fine-Tuning Based Segmentation). BIFSeg performs image segmentation in a method similar to GrabCut: The user first draws a bounding box, and the area within the bounding box is used as CNN input to get an initial result. Then, image-specific fine-tuning is performed on CNN to make CNN better adapt to specific test images and improve the segmentation results. GrabCut completes image segmentation by learning a Gaussian Mixture model (GMM) from a specific image, while BIFSeg learns a CNN from a specific image to achieve segmentation.

Usually, the segmentation methods based on CNN can only deal with the objects that have appeared in the training set, which limits the flexibility of these methods. BIFSeg tried to use CNN to segment the objects that had not been seen during training. As shown above, the training focus includes the placenta and the fetal brain, while the testing focus also includes the fetal lungs and the pregnant woman’s kidneys. The training process is equivalent to BIFSeg learning to extract the foreground part of an object from a bounding box. During the test, CNN can better utilize the information in a specific image through adaptive fine-tuning. The fine-tuning process can be automated (unsupervised) or guided by user interaction (supervised).

Image specificity fine-tuning of BIFSeg

Above is an example of unsupervised specific fine-tuning. This is a segment of the lung that has never been seen in the training set, but CNN is able to get good initial segmentation results, and in this case it gets better segmentation results by auto-tuning.

This diagram is an example of fine-tuning guided by user interaction. The training set included only the brain tumor in THE FLAIR image (including the edema), while the test set included the core region of the brain tumor in the T1 image (excluding the edema) in addition to such images. Above is an example of a tumor nucleus in a segmtioned T1 image during testing. The image-specific fine-tuning process of BIFSeg is realized by alternating optimization of the segmentation results and CNN parameters. When updating CNN parameters, a weighted loss function is used to ignore those pixels that tend to be inaccurately segmtioned, and only the pixels with high reliability are used to optimize CNN.

4, Guide Me: Interacting with Deep Networks

While the previous articles dealt with user interactions with initial or newly added tags that users put on images, This Guide Me article attempts to deal with another type of interaction: updating the results of image segmentation based on the user’s text input.

The schematic diagram of this method is as follows. The CNN used can be regarded as a feature map of the middle layer between the head (encoder) and tail (decoder). During segmentation, user interaction is used to change this feature map, thus changing the output results of the network.

Feature map changes are achieved by using a number of guide parameters, including some multiplication and bias coefficients:

hereRepresents the c channel of the feature map, 和 It’s a multiplication factor and an offset factor. By defining two parameters for each channel of the feature map, the number of parameters can be minimized. But this ignores spatial information, so we can also define parameters like this in different spatial locations:

Among them 和 The coefficients for each row and each column. For a feature graph with height H, width W and number of channels C, there are H+W+C multiplication coefficients, so the number of guide parameters is relatively small on the whole.

When the user adds interactive feedback to a split result, the parameters of the network are not updated, but only the values of the Guide Paramerer are updated to get the new split result. How do you update guide parameters?

In this paper, two methods are proposed. One is that the segmentation system actively asks the user whether a certain area is the sky, and the user says yes or no. According to the user’s answer, the region is set to the corresponding category, and then the local optimal value of guide parameter is obtained through back propagation. Another approach is for the user to give text feedback directly, such as “the sky is not visible in this image,” and use an RNN to convert the text information into a guide parameter. This also requires simulating the user’s input of the training RNN during training, so that the network can process the text information given by the user during the test.

The process of simulating user input is shown in the figure above. An initial segmentation result is compared with the gold standard to identify areas of missegmentation, and text information is generated from these missegmentation areas as simulated user input.

Using the first interaction on the Pascal VOC 2012 dataset, the paper raises the segmentation mIOU from 62.6% to 81.0% by asking the user 20 questions through the system. In the second interactive experiment, using the Coco-Stuff dataset, the interactive correction increased the split mIoU from 30.5% to 36.5%, compared to DeepLab’s 30.8% mIoU on this dataset.

5, PolygonRNN

PolygonRNN is a method proposed for marking the edges of objects in two-dimensional images. Its flow is shown in the figure below. The user provides a boundary box of an object of interest, and this method predicts a series of vertices on the edges of the object in the boundary box through RNN.

These vertices are predicted in a certain order, like clockwise order. Assuming that there are T vertices, a CNN is used to predict the first vertex, and then the TTH vertex is predicted by combining the first vertex, t-1 and T-2 vertices with RNN. The RNN model is shown in the following figure (VGG network is used as the feature extraction network) :

The prediction of vertices is implemented as a classification task. The paper divides the image into 28×28 grids, and then classifies each grid as belonging or not belonging to vertices. In the test image, the user can modify the prediction result of the TTH vertex, and then re-predict the subsequent vertices, so as to get the updated segmentation result.

As the number of vertices that the user edits increases, the segmentation results become more accurate. The efficiency of this method is about 5 times higher than that of the user drawing the vertices of the object manually. Because the resolution of the vertex image is low (28×28), the segmentation accuracy of this method is limited.

6, Polygon-RNN++

Polygon-rnn ++ is a number of improvements to Polygon-rnn, making the prediction of Polygon vertices on object boundaries more accurate and with higher resolution. In terms of network structure, the VGG network was replaced with a resnet-50 based network. Polygon-rnn uses two networks to predict the first and subsequent vertices, respectively. Polygon-rnn ++ proposes a unified framework that combines the two kinds of vertices predictions and can be trained together. It also uses an attention mechanism to make the prediction of the next vertex more focused on the region near the previous vertex.

In the training process of RNN, Polygon-NNN ++ also makes a big change. It uses Reinforcement learning to predict all vertex sequences, so that the prediction results have a better overlap with the gold standard. Polygon-rnn uses only cross entropy as the loss function, which is quite different from the intersection over union used in the evaluation.

To solve the problem of low resolution of polygon-rNN segmentation results, Polygon-RNN++ proposed a Graph Neural Network to enlarge and correct the polygons obtained by RNN.

Polygon-nnn ++ conducts experiments on Cityscapes, KITTI, ADE and other data sets, in which the training process is completed on Cityscapes data sets, and other data sets are used as cross-domain or out-of-domain data. It is used to verify the generality of the method. The results show that good results can be obtained on the data sets not seen during the training.