Weakly Supervised Semantic Segmentation: Fast Forwarding from Image-Level Annotation to Pixel-Level Prediction

1. Background

Semantic segmentation, which aims to classify all the pixels in an image, has been one of the main tasks in the field of computer vision images. In practical applications, because it can accurately locate the area where the object is located and eliminate the influence of the background with pixel-level accuracy, it has always been a reliable method for refined recognition and image understanding.

However, building a semantic segmentation dataset requires labeling every pixel on every image. According to statistics, a single 1280*720 pixel image segmentation and labeling time is about 1.5 hours [1], and the manpower and material resources required for the labeling of datasets that require tens of thousands or 100,000 to produce ideal results make the actual business project input-output ratio. extremely low.

Aiming at this problem, weakly supervised semantic segmentation, which only needs image-level annotations to achieve close segmentation results, is a hot topic in recent years. This technology achieves pixel-level dense prediction of images by using simpler and more readily available image-level annotations to obtain and optimize the seed segmentation regions of objects in a way of training a classification model.

After in-depth research, the Yidun algorithm team analyzed the characteristics of weakly supervised semantic segmentation technology in practice, and verified its effectiveness in actual projects, thus successfully implementing the technology into actual projects and achieving significant project index improvement , which effectively assists the refined identification of Yidun content security services.

Next, this paper will introduce the classification and general process of weakly supervised semantic segmentation, and select several representative papers in this direction for a brief introduction.

2. Basic information

1. Classification

According to the form of weakly supervised signals, common weakly supervised semantic segmentation can be divided into the following four categories (Figure 1):

① Image-level annotation: It is the simplest annotation that only annotates the category to which the relevant objects in the image belong;

② Object point labeling: label a certain point on each object and the corresponding category;

③ Object frame annotation: mark the rectangular frame where each object is located, and the corresponding category;

④ Object marking: draw a line on each object, and the corresponding category.

Figure 1 Classification of Weakly Supervised Semantic Segmentation

In this paper, we focus on weakly supervised semantic segmentation based on image-level annotation, which is the easiest and most convenient to label.

2. Weakly supervised semantic segmentation based on image-level annotation

Weakly supervised semantic segmentation based on image-level annotation is mostly carried out in the form of multi-module series, as shown in Figure 2 [2]:

title=

Figure 2. Steps of Weakly Supervised Semantic Segmentation

First, a classification model is trained by single-label or multi-label classification using image-level labeled image class labels. seed region; then, optimize and expand the seed region using optimization algorithms (such as CRF [4], AffinityNet [5], etc.) to obtain the final pixel-level segmentation pseudo-labels; finally, use the image dataset and segmentation pseudo-labels to train traditional Segmentation algorithms (such as the Deeplab series [6]).

1. Representative work

This part mainly introduces several typical papers in image-level weakly supervised segmentation. First, it will introduce the basic paper CAM[3] for weakly supervised segmentation, and then it will introduce 2 algorithms on how to obtain a wider and more accurate CAM ( OAA[7], SEAM[8]) are used as the seed area for segmenting pseudo-labels. Finally, a typical seed area optimization and expansion algorithm AffinityNet[4] will be introduced.

1. CAM（Class Activation Mapping）[3]

This article was proposed by Zhou Bolei in CVPR in 2016. The author found that even the middle layer of CNN trained without localization labels has the characteristics of target localization, but this characteristic is stretched by the vector after convolution and Continuous fully connected layers are destroyed, but this feature can be retained if the last multiple fully connected layers are replaced by the global average pooling layer GAP and a single fully connected layer followed by Softmax. At the same time, after a simple calculation, it is possible to obtain a class-discriminative region that prompts the CNN to confirm that the image belongs to a certain class, that is, the CAM.

Figure 3 CAM

Among them, the specific calculation method of CAM is as follows (as shown in Figure 3):

Provided F k (X, Y) last one convolution of the acquired k-th layer, wherein in FIG (X, Y) position value, W k c is class c Corresponding to the kth weight of the last fully connected layer, the value of the response feature map CAM of category c at the position of (x,y) is:

M c the CAM. The larger the final CAM value, the higher the contribution to the classification: the red area of the heat map in the last picture in Figure 3 indicates that the CAM value is the largest, which is also the face area of the Australian dog.

In the article, the author stated that the area where the CAM is located can be directly used as a prediction for weakly supervised target localization, and conducted related experiments, which not only improved significantly compared to the best weakly supervised localization algorithm at that time, but also only required a single forward reasoning process. A positioning frame is available.

In weakly supervised semantic segmentation, CAM has been the core algorithm for generating seed regions. However, the shortcomings of CAM are also obvious: only focus on the most discriminative area and cannot cover the entire target. Most of the subsequent algorithms are to solve this problem or post-process the CAM. Next, several representative works will be selected. Introduce.

2. Get a better seed area

① OAA[7]

The motivation of this article is simple and straightforward. The author found through observation that in different training stages before the training convergence, the CAM generated by the model will move in different parts of the same target, as shown in Figure 4. b, c, d represent different During the training phase, its CAM (highlighted area) is moving. When integrating CAMs generated at different stages from the same image, the integrated CAM is more able to cover the entire target, as shown in Figure 4-e.

Figure 4 CAM movement diagram

The schematic diagram of its algorithm is as follows:

Figure 5 Schematic diagram of OAA algorithm

For a class in a single image, perform ReLU and normalization on its CAM feature map F c when training the image at epoch t:

And then save the previous epoch wherein FIG A 1- C pixel level, and selects the maximum value of each pixel, the final output is the cumulative response FIG M T . After training, M T is used as the seed region of the pseudo-label for weakly supervised semantic segmentation.

The OAA algorithm is simple and effective. At that time, it achieved the performance of SOTA on the VOC weakly supervised segmentation data set; but the author found that its effect was general in the actual project attempt, because the way of taking the maximum value of the CAM of multiple epochs easily made the seed area appear. There are more noises, and these noises are difficult to be eliminated by post-processing, which eventually leads to a significant decrease in segmentation accuracy.

② SEAM[8]

This article introduces the self-supervised learning concept of Dahuo, and through observation, it is found that the CAM produced by different affine transformations of the same image is inconsistent (different size transformations in Figure 6-a), and the implicit equal transformation constraint is used. A consistent regularization learning mechanism similar to self-supervised contrastive learning is established to reduce the degree of inconsistency to optimize the CAM, thereby obtaining high-precision seed regions (Fig. 6-b).

Figure 6 CAM of input images of different sizes

The schematic diagram of SEAM algorithm is as follows:

Figure 7 Schematic diagram of SEAM algorithm

Send the original image and its image after simple radial transformation A( ) (such as reduction, left and right flip) into the CNN with shared parameters to obtain the corresponding CAM feature maps y o and y T the CAM view of the characteristics respectively optimized, these two features of FIG relationship optimization module then passes through the pixel pixel Correlation module (PCM) learning Y O and Y T . Among them, PCM is an operation similar to self-attention, query and key are feature maps obtained by CNN, and value is CAM feature. Its loss consists of 3 parts:

Among them, L cls is a commonly used multi-label classification loss function multi-label soft margin loss, equivariant regularization (ER) loss is:

The equivariant cross regularization (ECR) loss is:

(

During inference, multiple CAMs generated by multiple scales and their left and right flipped images are added and normalized as the final CAM.

The SEAM algorithm has not only achieved a great improvement in the open data set compared with other algorithms, but also achieved good results in the actual project of Yidun. The disadvantage is that the training and reasoning of the algorithm are time-consuming.

3. CAM post-processing

After the CAM seed area is obtained, it can be directly used as a pseudo-label for semantic segmentation for training, but for better segmentation results, the seed area is usually optimized. Next, a very effective algorithm, AffinityNet [5], will be introduced.

The main approach of this article is: for an image and its corresponding CAM, build an adjacency map of all pixels and pixels with a certain radius around them, and then use AffinityNet to estimate the semantic close relationship between pixel pairs in the adjacency map to establish probability transitions matrix. For each class, a point on the target edge in the adjacency graph encourages the point to spread to edge locations with the same semantics by random walk according to the probability transition matrix, while penalizing the spread to other categories. This semantic diffusion and penalty can significantly optimize the CAM to better cover the entire target, resulting in more accurate and complete pseudo-labels.

The main difficulty in training this model is how not to use additional supervision information. The author found that CAM can be used as a source of supervision information for training AffinityNet. Although CAM has incomplete coverage and noise problems, CAM will still correctly cover the local area and confirm the semantic close relationship between pixels in this area, which is exactly what AffinityNet needs to achieve. In order to obtain reliable labels for CAM, the authors discard the pixels with relatively low scores in the CAM feature map, and only keep high-scoring and background pixels. Pixel pairs are collected on these feature regions, and the label is 1 if they belong to the same category, and 0 otherwise, as shown in Figure 8.

title=

Figure 8 Schematic diagram of AffinityNet samples and labels

In the training process, the image is passed through AffinityNet to get the feature map f aff , then the semantic affinity of pixels i and j on the feature map is W calculation method is as follows:

Wherein, X i , Y i denotes the i th pixel characteristic feature in FIG F coordinate position. Then, use the cross-entropy loss function for training.

The overall training and inference process is shown in the following figure:

Figure 9 Schematic diagram of AffinityNet training and inference

First, use the CAM of the training image to select multiple pixel pairs as training samples, and obtain labels with semantic close relationship. These data are used to train AffinityNet (the left of Figure 9); then use the trained AffinityNet to perform inference calculations on each image The semantic affinity matrix of the image adjacency graph is obtained as the probability transition matrix; finally, the matrix is used in the form of random walk on the CAM of the image to obtain the final optimized semantic segmentation pseudo-label (right in Figure 9).

The AffinityNet algorithm has clear ideas and reliable results. It is often used as a CAM post-processing method obtained by algorithms such as OAA or SEAM to optimize the accuracy of pseudo-labels and expand the coverage area of pseudo-labels. The improvement effect is obvious in both qualitative and quantitative analysis.

3. Summary

This paper briefly introduces the concept and process of weakly supervised semantic segmentation, and briefly introduces several of them, and analyzes the advantages and disadvantages of these algorithms from a practical point of view. The existing weakly supervised semantic segmentation also has a cumbersome and lengthy process. There have been some works in the academic community that have proposed end-to-end solutions and achieved certain results (eg 9). In the future, the Yidun algorithm team will continue to follow the latest direction of the academic community and try to implement it, so as to further improve the effect of the refined identification project of Yidun content security services.

[1] Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 3213-3223.

[2] Zhang D, Zhang H, Tang J, et al. Causal intervention for weakly-supervised semantic segmentation[J]. Advances in Neural Information Processing Systems, 2020, 33: 655-666.

[3] Zhou B, Khosla A, Lapedriza A, et al. Learning deep features for discriminative localization[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 2921-2929.

[4] Krähenbühl P, Koltun V. Efficient inference in fully connected crfs with gaussian edge potentials[J]. Advances in neural information processing systems, 2011, 24.

[5] Ahn J, Kwak S. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 4981-4990.

[6] Chen L C, Papandreou G, Kokkinos I, et al. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 40(4): 834-848.

[7] Jiang P T, Hou Q, Cao Y, et al. Integral object mining via online attention accumulation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 2070-2079.

[8] Wang Y, Zhang J, Kan M, et al. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 12275-12284.

[9] Zhang B, Xiao J, Wei Y, et al. Reliability does matter: An end-to-end weakly supervised semantic segmentation approach[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34(07): 12765-12772.

[10] Araslanov N, Roth S. Single-stage semantic segmentation from image labels[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 4253-4262.

Weakly Supervised Semantic Segmentation: Fast Forwarding from Image-Level Annotation to Pixel-Level Prediction

1. Background