Abstract:Semantically segmented data sets are relatively large, so training requires very strong hardware support.

This article is shared from the Huawei Cloud Community “[Cloud Co-creation] Semantic Segmentation Algorithm Sharing Based on Transferring Learning”, the original author is Qiming.

This paper is to share two papers on semantic segmentation algorithm based on transfer learning. 1. Learning to Adapt Structured Output Space for Semantic Segmentation; 2. Advent: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation.

Part 1: Background to migration segmentation

Semantic segmentation, detection and classification are the three main directions in the field of machine vision. However, compared with detection and classification, semantic segmentation faces two very difficult problems:

One is that it lacks data sets. The classification data set is a category, the detection data set is a detection box, and the segmentation, its purpose is to make semantic level prediction. This means that its annotation also needs to be pixel level. As we know, it takes time and effort to annotate a data set at pixel level. Take CitySpaces in the automated driving as an example, it takes 1.5 hours for this data set to annotate an image. In this case, the time and effort required to build the semantically segmented data set ourselves would be very costly.

Another problem is that semantic segmentation also needs to cover real-world data sets. But the reality is that it is difficult to cover all the conditions, such as different weather, different places, different styles of architecture, which is the problem of semantic segmentation.

In the face of the above two situations, how do researchers solve the two problems of semantic segmentation?

In addition to solving the problem from data sets, they found that they could use techniques such as computer graphics to reduce the cost of labeling by synthesizing simulated data sets instead of real-world data sets.

Take a familiar and common GTA5 game as an example: in GTA5, one of the tasks is to collect simulation data from the GTA5 game, and then reduce the annotation cost through the natural annotation of the data engine. But there is a problem: models trained on these simulated data suffer performance degradation in the real world. Because traditional machine learning requires a premise, that is, your test set and your training set are the same distribution, and your simulated data set and real data set will inevitably have differences in distribution.

Therefore, our goal now is to solve the problem of the performance degradation of the model trained in the source domain on the target domain through the migration algorithm developed.

The main contributions and related work of the two papers are introduced

The main contributions

Part I Learning to Adapt Structured Output Space for Semantic Segmentation

1. An opportunity-versus-learning transfer segmentation algorithm is proposed;

2. Verify that the confrontation in the output space can effectively determine the scene layout and context information of the two fields;

3. The migration performance of the model is further improved by antagonizing the output of multiple levels.

Advent: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation

1. The entropy based loss function is used to prevent the network from making low reliability prediction for the target domain;

2. An adversarial learning method based on entropy of chance is proposed, which simultaneously considers entropy reduction and structural alignment of two domains;

3. A constraint method based on a priori of category distribution is proposed.

Related work

Before explaining the two papers, let’s briefly introduce another article: FCNS in the Wild. This paper is the first paper to apply transfer learning to semantic segmentation. It proposes that the features extracted from the feature extractor of semantic segmentation are sent into the discriminant, and then the migration of segmentation task is completed by aligning the Global information.

First of all, let’s introduce what kind of semantically segmented network is commonly used. Common semantic segmentation is generally composed of two parts: one is the feature extractor, such as ResNet series, or VGG series to extract the features of the picture; The second is the classifier, which sends the features extracted in the front into the classifier (the classifier is more common is PSP, or the most commonly used in DA segmentation, Deeplab V2’s ASPP). The whole DA is completed by feeding the features mentioned in feature extraction into the discriminant.

Why is it that putting a feature into a discriminant can accomplish DA? We can understand this problem in terms of the role of discriminants.

By training the discriminator, it can tell the difference between an incoming picture and a fake one. In this process, a discriminant is needed to distinguish whether the input feature is a source domain or a target domain. After obtaining a discriminant which can distinguish the feature from the source domain and the target domain, the discriminant parameters are fixed to train the feature extractor of the segmenting network. How to train it: let the feature extractor confuse the discriminant.

So how does the feature extractor confuse the discriminant? Whether it is to extract the features of the source domain or the target domain, the distribution of the two features should be aligned. In this way, it is equivalent to the features of the two domains, so that the discriminant can not be distinguished, and then the task of “confusion” is completed. Once the “obfuscating” task is completed, it indicates that the feature extractor has extracted the “domain invariant” information.

Extracting “domain invariant” information is, in effect, a migration process. Because the network has the function of extracting “domain-invariant” information, both the source domain and the target domain can extract a very good feature.

The following two articles are both based on the idea of “using antagonism and discriminant”, but the difference is that the information input to discriminant is different in the following two articles, which will be described in detail later.

The first paper algorithm model analysis:

Title: Learning to Adapt Structured Output Space for Semantic Segmentation

This paper, like the first paper in the previous related working paper, is composed of a segmentation network and a discriminant. As can be seen from the picture above (or from the title), it is an output space that has no part at all. So what is output space?

The output space here is that the output result of voice segmentation network is transformed into a probability after Softmax. We call this probability output space.

The author of this paper thinks that it is not good to directly use the feature to do the confrontation, instead, it is better to use the output space probability to do the confrontation. Why? Because the author thinks that in the original, like in classification, everyone uses features, but segmentation is different. Because the high-dimensional feature of segmentation, the feature part before you, is a very long vector, such as the last layer of Resnet101, whose feature length is 2048 dimensions, such a high-dimensional feature, the encoded information is of course more complex. But for semantic segmentation, this complex information may not be useful. This is a point of view of the author.

Another point of view of the author is that although the output result of semantic segmentation is of low dimension, that is, the probability of output space, in fact, there is only one such dimension of category number, that is, if the category number is C, its probability is a vector of C *1 for each pixel. Although it is a low-dimensional space, the output of an entire image actually contains rich information such as scene, layout and context. The author of this paper believes that whether the image comes from the source domain or the target domain, the results of segmentation should have very strong spatial similarity. Because whether it is simulation data or simulation data, the same is done on the segmentation task. As shown in the figure above, both the source and target domains are designed for automatic driving. One obvious point is that most of the middle could be roads, the top would generally be sky, and then to the left and right could be buildings. The distribution on this scene has a very strong similarity, so the author believes that a very good effect can be achieved by directly using the probability of the low dimension, namely Softmax output, for confrontation.

Based on the above two insights, the author designs to directly put the probability into the discriminant. The process of training is actually the same as that of GaN, except that instead of passing the feature into the discriminator, the final output probability is passed into the discriminator.

If you go back to the top picture, you can see the two dAs on the left, they’re multiscale. So going back to the semantic segmentation network that we started with, there’s a feature extractor and a classifier, and the input to the classifier is actually the feature extracted by the feature extractor.

As you all know, RESNET actually has several layers. Based on this fact, the author proposes that the output space can be confronted in the last layer and the penultimate layer respectively. That is to send the two features to the classifier, and then take out the results of the classifier, and input the Discriminator to do the fight.

To sum up, the algorithm innovation of this paper is as follows:

1. Confront in the output space, using the structural information of the network prediction results;

2. The model is further improved by using multiple levels of output against each other.

So let’s see what that looks like, right?

The figure above is the experimental results of the GTA5 to CitySpaces task. As we can see, the first Data Baseline (RESNET) trains a model on the source domain and then tests it on the target domain. The second data is a result made in the feature dimension, 39.3. Although it is an improvement compared with the Source Only model, it is relatively low in comparison with the following two models in output space. In the first single level, features are directly extracted from the last layer of RESNET, and then input to the classifier to generate the results. The second multi-level is to do the confrontation on both the one and two floors from the bottom of Resnet. The result can be seen that it will be better.

The second paper algorithm model analysis

Advent: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation

Next, we will talk about the second part: the method of migration segmentation based on entropy decrement and entropy antagonism.

To understand this article, we need to understand the concept of entropy. In the article, the author uses entropy of information as entropy, which is P and logP in the formula, which is a probability log.

Semantic segmentation is the prediction of each pixel of a picture by the network. So for each pixel, it’s going to be a vector of c times one, where c is the number of possible categories, so it’s going to be the probability of each category times the log of the probability of this category. So for a pixel, if you add up the entropy of this category, that’s the entropy of that pixel. Therefore, for an image, you need to sum the length and width of the image, namely the entropy of each pixel.

By observing the above figure, the author finds a phenomenon: the distribution of entropy in the source domain segmentation map shows that the entropy is very high at the edge of these categories (the darker the color, the lower the entropy; The lighter the color, the higher the entropy). So for the image in the target domain, we can see that the predicted result is that there are a lot of light parts in the whole image. Therefore, the author believes that the gap between the source domain and the target domain can be reduced by reducing the entropy value of the target rate, because there is too much unused entropy value in the source domain (due to the presence of certain noise).

So how do we reduce the entropy of the target domain? The author proposes two methods, which are the algorithm innovations of this paper:

1. The entropy based loss function is used to prevent the network from making low reliability prediction for the target domain;

2. An adversarial learning approach based on entropy is proposed, which takes into account entropy reduction and structural alignment of the two domains.

Adversarial learning is used to minimize the entropy value, that is, the overall entropy of an image is obtained, and the overall entropy is optimized directly through gradient back propagation. However, the author believes that if the entropy is directly reduced, a lot of information will be ignored, such as the semantic structure information of the image itself. Therefore, by referring to the method of output space confrontation in the first paper, the author proposes a method to reduce entropy by using adversarial learning.

As can be seen from the above picture, the entropy of the source field is very low. Therefore, he thinks that if a Discriminator is used to distinguish the source field and the target field, so that the entropy graphs of the final output of the source field and the target field are very similar, the entropy of the target field can be reduced. And the way to do it is just like the last one, except the first one is to put the probability directly into the discriminator, and the second one is to put the entropy into the discriminator to complete the whole process.

In this paper, the process of entropy reduction was considered on the one hand, and structural information was applied on the other hand. Therefore, it can be seen from the experimental results that direct entropy minimization from GTA5 to Cityspace has been greatly improved compared with FCNS and the use of output space. And then the entropy antagonism is a little bit better than the original method.

In addition, the authors found that if you add up the probabilities predicted by directly subtracting entropy and by doing entropy against both, and then maximize them, the result would be a few more points higher. In semantic segmentation tasks, this improvement is very significant.

Repetition code

Now let’s enter the process of code repetition.

The principle of replicating the paper is to be consistent with the specific methods, parameters, data enhancement and so on described in the paper.

This time, we first searched the open source code on GitHub. On the basis of the open source code, we implemented the above two papers with the same framework based on the PyTorch framework. If you can read the code in one paper, the code in the other paper is pretty easy to understand.

Below are two QR codes, which are the codes of the two papers. You can scan the codes to check them.

ModelArts profile

Both papers are based on Huawei Cloud’s ModelArts. Let’s start with a brief introduction to ModelArts.

ModelArts is a one-stop AI development platform for developers. It provides massive data preprocessing and semi-automated annotation, large-scale distributed Training, automated model generation, and end-to-side-cloud model deployment capability on demand for machine learning and deep learning, helping users to quickly create and deploy models and manage full-cycle AI workflow. It has the following core functions:

Data management, can save up to 80% of the cost of manual data processing: covering 4 categories of data formats of image, sound, text and video and 9 kinds of annotation tools, while providing intelligent annotation and team annotation to greatly improve the annotation efficiency; Support data cleaning, data enhancement, data verification and other common data processing capabilities; The flexible visual management of multiple versions of data sets supports the import and export of data sets, which can be easily used in model development and training of ModelArts.

Development management, can use the local development environment (IDE) to connect to the cloud services: In addition to being able to develop in the cloud through the interface (management console), ModelArts also provides Python SDK functionality, which allows you to use Python to access ModelArts from any local IDE, including creating, training models, and deploying services, which is more close to your own development habits.

Training management, faster training and more accurate model: three advantages of universal AI modeling workflow based on Ei-backbone:

1. Training the high-precision model based on small sample data, which greatly saves the cost of data annotation;

2. Full-space network architecture search and automatic over-parameter optimization technology can automatically and quickly improve the model accuracy;

3. After loading the Ei-Backbone-integrated pre-training model, the process from model training to deployment can be shortened from several weeks to a few minutes, greatly reducing the training cost.

Model management, support for all iterations and debugging of unified management model: the development of AI model and tuning often need a lot of iteration and debugging, training data set, code or parameter changes are likely to affect the quality of the model, such as not unified metadata management development process, there may be unable to reproduce the optimal model of the phenomenon. ModelArts supports importing models in four scenarios: from training, from templates, from container images, and from OBS.

Deployment management, one-click deployment to end, edge, cloud: ModelArts supports online reasoning, batch reasoning, edge reasoning, a variety of forms. At the same time, high concurrency online reasoning can meet the demand of online large volume of business, high throughput batch reasoning can quickly solve the demand of deposition data reasoning, and high flexibility edge deployment makes the reasoning action can be completed in the local environment.

Mirror management, custom mirror function supports custom running engine: ModelArts adopts container technology at the bottom, you can make container image and run on ModelArts by yourself. The custom mirroring feature supports free-text command line parameters and environment variables, making it more flexible to support job startup requirements of any computing engine.

Code interpretation

Next, let’s make a concrete explanation of the code.

The first is the multi-level output space confrontation in AdaptSegNet. The code is as follows:

As mentioned earlier, the probabilities obtained through Softmax are fed into the discriminator, which is the red box in the picture above.

Why do we have D1 and D2? As mentioned earlier, features can be taken from the penultimate and penultimate layers of Resnet101 to form a multi-level confrontation process. So when it comes to specific loss, you can use BCE_LOSS to handle the confrontation.

Second, the Minimizing entropy with adversarial learning

This paper is needed to calculate entropy. So how is entropy calculated?

So let’s get the probability. The output of the network is calculated through Softmax, and the entropy is calculated using P*logP, which is then sent to the Discriminator.

Both papers use adversarial methods, the only difference is that one puts the output of Softmax in, and the other converts the output probability of Softmax into entropy and sends it in. This is the only change made to the code. Therefore, if you can understand the flow of the first code, it is easy to get used to the second code

conclusion

Semantically segmented data sets are relatively large, so training requires very strong hardware support. In general, the lab may only have a GPU of 10/11/12G, but if you use Huawei Cloud ModelArts (combined with a previous feature introduction to ModelArts), you can get a better output result.

If you are interested, you can click on the >, > and >AI development platform ModelArts to experience it.

Learning to Adapt Structured Output Space for Semantic Segmentation and Advent: I am interested in the two papers of Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation. The full text of the papers can be obtained through the QR code below.

Click on the attention, the first time to understand Huawei cloud fresh technology ~