Semantic segmentation algorithm sharing and code reproduction based on transfer learning

Abstract: semantic segmentation data set is relatively large, so very powerful hardware support is required during training.

This article is shared from the HUAWEI cloud community " creation] Semantic Segmentation Algorithm Sharing Based on Transfer Learning", the original author: Qiming.

This article is to share two papers on semantic segmentation algorithm based on transfer learning. The first article: "Learning to adapt structured output space for semantic segmentation", the second article "ADVENT: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation".

Part 1: Background introduction to migration segmentation

Semantic segmentation, as well as detection and classification, are the three mainstream directions in the field of machine vision. However, compared with detection and classification, semantic segmentation faces two very difficult problems:

One is that it lacks a data set. The classified data set is an indicator with a category, and the detected data set is an indicator with a detection frame, and for segmentation, its purpose is to make semantic-level predictions. This means that its labels also need to be pixel-level labels. We know that labeling a pixel-level data set is very time-consuming and labor-intensive. Take Cityspaces in autonomous driving as an example. It takes 1.5 hours for this data set to label a picture. In this case, if we have to construct the semantic segmentation data set by ourselves, the time and energy cost is very large.

Another problem is that semantic segmentation also needs to cover real-world data sets. But the actual situation is that it is difficult to cover all situations, such as different weather, different places, and different architectural styles. These are all problems that semantic segmentation faces.

In the above two situations, how do researchers solve these two problems of semantic segmentation?

addition to solving from the data set, they found that they can use computer graphics and other technologies to synthesize simulated data sets to replace these data sets in the real world, thereby reducing the cost of labeling.

Take a familiar and common GTA5 game: In GTA5, there is a task to collect simulation data in the GTA5 game, and then use the natural labeling of the data engine to reduce labeling costs. But there will be a problem here: the model trained on these simulated data will experience performance degradation in the real world. Because traditional machine learning requires a prerequisite, that is, your test set and your training set have the same distribution, and your simulated data set and the real data set must have a distribution difference.

Therefore, our current goal is to solve the problem of reduced performance of the model trained in the source domain on the target domain through the researched migration algorithm.

Introduction of the main contributions of the two papers and related work

Main contribution

The first article: "Learning to adapt structured output space for semantic segmentation":

1. Propose a migration segmentation algorithm for opportunistic adversarial learning;

2. It is verified that the scene layout and context information of the two domains can be effectively used for confrontation in the output space;

3. The use of multi-level output for confrontation further improves the migration performance of the model.

The second "ADVENT: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation":

1. Use the loss function based on entropy to prevent the network from making low-reliability predictions on the target domain;

2. Propose a learning method for opportunistic entropy, taking into account the entropy reduction and the structural alignment of the two domains at the same time;

3. A constraint method based on the priori of category distribution is proposed.

Related work

Before explaining the two papers, let's briefly introduce another article: "FCNs in the wild". This paper is the first paper that uses transfer learning in semantic segmentation. It proposes that the features extracted by the feature extractor of semantic segmentation are sent to this discriminator, and then the migration on the segmentation task is completed by aligning the Global information.

First, let's first introduce what kind of form the commonly used semantic segmentation network is. Commonly used semantic segmentation is generally composed of two parts: one is a feature extractor, for example, you can use the Resnet series or the VGG series to extract the features of the picture; the second is the classifier, which sends the previously extracted features into the classifier (classification The most common one is PSP, or ASPP in DeepLab V2, which is the most commonly used in DA segmentation). The whole DA is completed by sending the features mentioned in feature extraction to the discriminator.

Why can DA be completed by sending the features into the discriminator? From the role of the discriminator, we can understand this problem.

Training the discriminator can make it distinguish whether the input picture is true or false. In this process, a discriminator is needed to distinguish whether the input feature is the source domain or the target domain. After obtaining a discriminator that can distinguish between the source domain and the target domain, the parameters of the discrimination are fixed to train the feature extractor of the segmentation network. How to train: Let this feature extractor confuse this discriminator.

So how does the feature extractor confuse the discriminator? Regardless of extracting the features of the source domain or the target domain, the distribution of these two features must be aligned. In this way, it is equivalent to aligning the features of the two domains so that the discriminator cannot distinguish, then the "confusion" is completed. Task. Once the "obfuscation" task is completed, it means that the feature extractor has extracted this "domain invariant" information.

Extracting the "domain invariant" information is actually doing a migration process. Because the network has the function of extracting "domain invariant" information, both the source domain and the target domain can extract a very good feature.

Our next two articles are based on the idea of "using confrontation and discriminator" for guidance, but the difference is that the information input to the discriminator is different in the latter two articles. We will introduce the details later.

Analysis of the algorithm model of the first paper:

Article title: "Learning to adapt structured output space for semantic segmentation"

This paper, like the first work in the previous related working papers, is composed of a segmentation network and a discriminator. From the figure above, or just from the title, it can be seen that it is the output space to be adapted. So what is output space?

The output space here means that after the output of the voice segmentation network passes through softmax, it becomes a probability thing. We call this probability the output space.

The author of this paper believes that it is not good to directly use features for confrontation. It is better to use output space probability for confrontation. Why? Because the author believes that in the original, such as classification, everyone uses features to do it, but the segmentation is different. Because the segmented high-dimensional feature is the feature part in front of you, it is a very long vector. For example, the last layer of Resnet101 has a feature length of 2048 dimensions. With such a high-dimensional feature, the encoded information is of course more complicated. But for semantic segmentation, this complex information may not be useful. This is an opinion of the author.

Another point of the author is that although the output result of semantic segmentation is low-dimensional, that is, the probability of output space, there is actually only one dimension of the number of categories, that is, if the number of categories c, its probability is for each pixel The point is a vector of c*1. Although it is a low-dimensional space, the output of an entire picture actually contains rich information such as the scene, layout, and context. The author of this paper believes that regardless of whether the image comes from the source domain or the target domain, the results of the segmentation should be very similar in space. Because whether it is simulated data or simulated data, it is the same segmentation task. As shown in the figure above, the source domain and target domain are both for autonomous driving. An obvious point is that most of the middle may be roads, the upper part is usually the sky, and then the left and right may be buildings. The distribution on this scene is very similar, so the author believes that the probability of directly using the low-dimensionality, that is, the softmax output for confrontation, can achieve a very good effect.

Based on the above two insights, the author designed to put the probability directly into the discriminator. training process is actually the same as GAN, except that it is no longer passing the features into the discriminator, but passing the final output probability to the discriminator.

Going back to the picture above, you can see that the two DAs on the left are multi-scale. Going back to the semantic segmentation network we talked about at the beginning, it is divided into a feature extractor and a classifier. The input of the classifier is actually the features extracted by the feature extractor.

As everyone knows, Resnet actually has several layers. Based on this fact, the author proposes that the output space can be confronted on the last layer and the penultimate layer, respectively. That is, the two parts of the features are sent to the classifier, and then the result of the classifier is taken out and input into the discriminator (Discriminator) for confrontation.

To sum up, the algorithm innovation of this paper is:

1. Confrontation in the output space, using the structural information of the network prediction results;

2. The model is further improved by confronting multiple levels of output.

So let's take a look at the result like this?

The above picture is the experimental result of this mission from GTA5 to Cityspaces. We can see that the first data Baseline (Resnet) trains a model on the source domain, and then takes it to the target domain for testing. The second piece of data is a result made in the characteristic dimension, 39.3. Although it is improved compared to the source only model, the confrontation with the following two on the output space is relatively low. The first single level is to extract the features directly in the last layer of Resnet, and then input it to the classifier to produce the result; the second multi-level is to do the confrontation in the first and second layers of Resnet, the result can be seen , It would be better.

Analysis of the Algorithm Model of the Second Paper

Article title: "ADVENT: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation"

Next, let's talk about the second article: the method of migration segmentation based on entropy reduction and entropy confrontation.

To understand this article, we must first understand the concept of "entropy". The author uses information entropy as entropy in the article, which is P and logP in the formula, which is a probability log.

Semantic segmentation is the network prediction for each pixel of a picture. Then for each pixel, the final result is a vector of c*1, and c represents the number of possible categories. Therefore, the probability of each category should be multiplied by the probability of the log category. Then for a pixel, adding up the entropy of this category represents the entropy of this pixel. Therefore, for a picture, it is necessary to sum the length and width of the picture, that is, the entropy of each pixel.

By observing the above figure, the author found a phenomenon: the distribution of entropy values is obtained in the segmentation map of the source domain, and it can be found that the entropy value is very high at the edge of these categories (the darker the color, the lower the entropy value; The lighter the color, the higher the entropy value). Then for the picture in the target domain, we can see that the predicted result is that the whole picture has a lot of lighter colors. Therefore, the author believes that because there are too many useless entropy values in the source domain (because there is a certain amount of noise), the gap between the source domain and the target domain can be reduced by reducing the entropy value of the target rate.

So how to reduce the entropy of the target domain? The author proposes two methods, that is, the algorithm innovation of this article:

1. Use the entropy-based loss function to prevent the network from making low-reliability predictions on the target domain;

2. An entropy-based adversarial learning method is proposed, which also considers entropy reduction and structural alignment of the two domains.

Use adversarial learning to minimize the entropy value, that is, obtain the overall entropy of a picture, and directly optimize the overall entropy through gradient backpropagation. However, the author believes that if entropy reduction is done directly, a lot of information will be ignored, such as the semantic structure of the picture itself. Therefore, the author draws on the method of output space confrontation in the first article, and proposes a method of using confrontation learning to do entropy reduction.

It can be seen that the entropy of the source domain in the previous picture is very low, so he thinks that if a Discriminator can be used to distinguish the source domain from the target domain, so that the final output entropy map of the source domain and the target domain are very similar, then Can reduce the entropy of the target domain. The specific method is similar to the previous one, except that the first one is to directly put the probability into the discriminator, and the second one is to send the entropy to the discriminator for discrimination, thus completing the whole process.

This paper considers the process of entropy reduction on the one hand, and uses structural information on the other. Therefore, the results of the experiment can clearly see that in GTA5 to Cityspace, direct minimization of entropy has been compared to FCNs and the use of output space for confrontation. There must be a very big improvement, and the entropy confrontation is even a little bit better than the original method.

In addition, the author also found that if you add the predicted probabilities of direct entropy reduction and entropy to fight against the two, and then find the maximum value, the result will increase a little bit more. In semantic segmentation tasks, this improvement is very impressive.

Code reproduction

Next we enter the process of code reproduction.

The principle of reproducing the paper is to be consistent with the specific methods, parameters, and data enhancements described in the paper.

In this recurrence, we first searched for the open source code on GitHub. Based on the open source code, the above two papers were implemented with the same framework based on the PyTorch framework. If you understand the code of one paper, then the code of another paper is very easy to understand.

There are two QR codes below, which are the codes of the two papers. You can scan the codes to view them.

Introduction to ModelArts

Both papers are based on ModelArts of Huawei Cloud. We first briefly introduce ModelArts.

ModelArts is a one-stop AI development platform for developers. It provides massive data preprocessing and semi-automatic annotation, large-scale distributed training, automated model generation, and end-side-cloud model deployment capabilities for machine learning and deep learning. , To help users quickly create and deploy models, and manage full-cycle AI workflows. It has the following core functions:

data management can save up to 80% of manual data processing costs: covers 4 major data formats of image, sound, text, and video and 9 labeling tools, while providing intelligent labeling and team labeling to greatly improve labeling efficiency; support data cleaning , Data enhancement, data inspection and other common data processing capabilities; flexible and visual management of multiple versions of data sets, support for data set import and export, and easy use for model development and training of ModelArts.

development management, you can use the local development environment (IDE) to ModelArts can be developed on the cloud through the (management console) interface, and also provides the Python SDK function, which can be used in any local IDE through the SDK Use Python to access ModelArts, including creating and training models, and deploying services, which is closer to your own development habits.

training management, faster training of high-precision models: has 3 advantages of pervasive AI modeling workflow centered on large models (EI-Backbone):

1. Training high-precision models based on small sample data, saving a lot of data labeling costs;

2. Full-space network architecture search and automatic hyperparameter optimization technology can automatically and quickly improve the accuracy of the model;

3. After loading the pre-trained model integrated by EI-Backbone, the process from model training to deployment can be shortened from several weeks to several minutes, greatly reducing training costs.

model management supports unified management of all iterative and debugging models: AI model development and tuning often require a lot of iteration and debugging. Changes in data sets, training codes or parameters may affect the quality of the model, such as If the metadata of the development process cannot be managed uniformly, the optimal model may not be reproduced. ModelArts supports importing models in 4 scenarios: select from training, select from template, select from container image, select from OBS.

deployment management, one-click deployment to the end, edge, and cloud: ModelArts supports multiple forms of online reasoning, batch reasoning, and edge reasoning. At the same time, high-concurrency online reasoning meets the demands of online large business volume, high-throughput batch reasoning can quickly solve the reasoning demands of deposited data, and high-flexibility edge deployment enables reasoning actions to be completed in the local environment.

image management, custom image function supports custom running engine: ModelArts bottom layer adopts container technology, you can make container image yourself and run it on ModelArts. The custom mirroring function supports command-line parameters and environment variables in the form of free text, which is relatively flexible, and is convenient to support the job startup requirements of any computing engine.

Code explanation

Next, we give a specific explanation of the code.

First is the first article: the multi-level output space confrontation in AdaptSegNet, the code is as follows:

As mentioned earlier, the probability obtained by softmax is sent to the discriminator, which is the red frame part of the upper picture.

Why are there D1 and D2? As mentioned before, features can use Resnet101's penultimate layer and penultimate layer to form a multi-level confrontation process. In the case of specific loss, you can use bce_loss to deal with the confrontation.

Part 2: Minimizing entropy with adversarial learning in ADVENT

This paper needs to calculate entropy. How is entropy calculated?

Get the probability first. A probability is obtained by passing the output of the network through softmax, and then P*logP is used to obtain the entropy value, and then the entropy is sent to the Discriminator.

Both papers use a confrontational method. The only difference is that one puts the output of softmax in, and the other is to convert the output probability of softmax into entropy and then send it in. The code has also undergone this change. Therefore, if you can understand the flow of the first code, it is easy to get started with the second code

Conclusion

The data set of semantic segmentation is relatively large, so very powerful hardware support is required for training. Generally speaking, the laboratory may only have a 10/11/12g GPU, and if you use Huawei Cloud ModelArts (a feature introduction to ModelArts before), you can get a better output result.

If you are interested, you can click >>> AI development platform ModelArts to experience it.

If you are interested in the two articles "Learning to adapt structured output space for semantic segmentation" and "ADVENT: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation", you can get the full text of the original paper through the QR code below.

Click to follow, and get to know the fresh technology of

Semantic segmentation algorithm sharing and code reproduction based on transfer learning

Part 1: Background introduction to migration segmentation

Introduction of the main contributions of the two papers and related work

Main contribution

Related work

Analysis of the algorithm model of the first paper:

Analysis of the Algorithm Model of the Second Paper

Code reproduction

Introduction to ModelArts

Code explanation

First is the first article: the multi-level output space confrontation in AdaptSegNet, the code is as follows:

Part 2: Minimizing entropy with adversarial learning in ADVENT

Conclusion

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

AI Agent爆火后，MCP协议为什么如此重要！

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

科学计算编程涉及到的技术栈简介

入选ICLR 2025，MIT/UC伯克利/哈佛/斯坦福等提出DRAKES算法，突破生物序列设计瓶颈

manus 的替代品有哪些？使用LLM大模型技术做手机/网页/浏览器自动化操作技术汇总

30分钟内输出结果，新加坡国立大学/MIT等基于SVM构建微生物污染检测模型