人工智能 - NeurIPS 2021 | Zero-label visual learning for object detection and segmentation - 个人文章

(Reproduced from Microsoft Research AI headlines)

Editor's note: With the gradual deepening of research on self-supervised learning, the paradigm of transfer learning has been widely used in various fields of visual learning. A large number of visual tasks are deployed through the use of self-supervised pre-training and supervised fine-tuning. Researchers from Microsoft Research Asia hope to break this paradigm. In a paper published in NeurIPS 2021, the researchers proposed a model that can learn object detection and segmentation from unlabeled videos, so that self-supervised pre-training models can be directly served For application, without any supervised fine-tuning, the zero-label learning is realized.

Contrastive learning is currently the mainstream method in training visual self-supervised models. The core idea is to treat each independent sample in the training data set as a category, and design the pre-training task for the identification of independent individuals. Since there is only one sample for each category, individual identification will be very simple. Researchers usually use data augmentation techniques to create rich in-class samples for each sample. For pictures, data enhancement roughly includes: picture translation, zooming, flipping, color contrast and color changes, blurring, and grayscale transformation, etc. Although these image enhancement technologies have changed the details of the image, they have not changed the semantic content of the image description. In fact, contrast learning is learning feature representations that are invariant to these enhanced transformations. It can be observed from the experiment that contrast learning on data enhancement is very significant.

Figure 1: Contrastive learning strongly relies on the underlying image enhancement technology to learn invariance. Commonly used image enhancement techniques include panning, zooming, color enhancement, and local blurring.

As a pre-training method, comparative learning only learns a feature representation, but this feature representation requires some (a small amount of) supervised downstream data for fine-tuning training before it can be applied to downstream tasks. Although the pre-training representation can greatly improve the fine-tuning performance of downstream tasks, the characteristics that rely on fine-tuning have become the defects and shortcomings of the self-supervised model itself.

Figure 2: The framework of transfer learning: general pre-training + task-specific fine-tuning. Self-supervised learning has become a powerful pre-training method, but it must use a small amount of supervised data from downstream tasks to serve the application.

Learn object detection and segmentation from videos

Based on the analysis and understanding of comparative learning deficiencies, researchers at Microsoft Research Asia hope to design a self-supervised model that can be directly applied to downstream tasks without fine-tuning. In order to achieve this goal, researchers began to look for useful information from the video. Different from computer learning image recognition tasks, humans learn from a continuously changing time series signal. A time-series video signal contains a lot of useful information that is not possible in the picture. For example, a video can describe the motion of an object and its deformation; however, for a static image data set, it is difficult for an object to be captured multiple times in the data set. For another example, through geometric methods, researchers can reconstruct the three-dimensional shape of an object from a video, but it is also difficult to recover from a static image. Therefore, the researchers hope to analyze the movement form of the object from the video, use its movement form to help detect the existence of the object, and segment the shape of the object.

View Synthesis task (View Synthesis)

First, researchers need to find appropriate free supervision information from the video to learn object detection and segmentation. One of the learning goals commonly used in videos is the view synthesis task. Specifically, given two pictures of a video, an initial picture, and a target picture, the view synthesis task will try to learn a warping function to model the pixel reconstruction from the initial frame to the target frame. process. This seemingly simple task has rich application scenarios. For example, if the pixel-to-point correspondence is used to represent this warping function, then the visual synthesis task can realize self-supervised optical flow (optical flow) learning. For another example, if the camera parameters can be obtained, the visual synthesis task can be used to realize self-supervised single-channel depth estimation. The key to realizing different self-supervised tasks is to find a suitable representation that can not only complete view synthesis tasks, but also realize the application tasks of interest, such as optical flow and depth estimation. To give another example, the previous work designed a new multi-plane image representation method in order to complete the stereo magnification of binocular images.

Figure 3: View synthesis tasks can drive a new multi-planar representation, this new representation can help generate views in the case of a large baseline. The picture is taken from the paper "Stereo Magnification: Learning View Synthesis using Multiplane Images".

Researchers hope to use view synthesis tasks to achieve object detection and segmentation. The biggest difference from previous work is that they try to extract and learn the middle and even high-level representations of the picture, instead of just learning some low-level representations of the image. For this purpose, the researchers designed a and the model AMD (Appearance-Motion Decomposition) to achieve zero-label object segmentation.

The related paper "The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos" has been accepted by NeurIPS 2021.

Paper link:
https://papers.nips.cc/paper/2021/file/6d9cb7de5e8ac30bd5e8734bc96a35c1-Paper.pdf

segmentation stream and AMD model

Figure 4 shows the basic architecture of the AMD model. model is mainly composed of two framework networks: appearance pathway and motion pathway . Given a frame of input frame i, the shape network will divide it into several regions, in this example three. Given two consecutive frames of input frame i and frame j, the motion network will first extract the motion features describing the spatial correspondence, and then estimate an overall optical flow (flow offset) for each region predicted by the shape network.

Figure 4: The basic architecture of the AMD model. The lower branch is the shape network for predicting the segmentation, and the upper branch is the motion network for predicting the segmentation flow. The entire model uses the view synthesis task as the training target.

Here, the researchers apply the assumption of the gestalt principle common fate and believe that each area shares a single optical flow. This assumption is a good estimate for the motion of some rigid objects, but it is not true for objects with complex deformations. Based on the predicted optical flow of each area and the corresponding area, the researchers reconstructed an optical flow diagram. Because this optical flow is limited by the result of segmentation, it has only a very low degree of freedom, so it is called segment flow. After getting this segmented stream, you can warp frame i to frame j. The reconstructed frame j can be compared with actual observations to supervise the learning of the entire network.

The AMD model decomposes the appearance information and motion information of a video, thereby realizing the application of image segmentation zero label. In terms of implementation, the shape network uses the traditional ResNet50 structure, and the sports network uses the common PWC-Net. Both networks are trained from zero without adding any pre-training initialization. After the pre-training is completed, the shape branch can be directly applied to the brand-new picture to achieve image segmentation, and does not require any fine-tuning . It is worth noting that training AMD model does not need to add a lot of image enhancement technology . To a certain extent, alleviates the dependence on comparative learning.

Figure 5: Comparison of optical flow and split flow. The optical flow uses a single pixel as the basic unit to describe the motion of an object, and the segmented flow uses a local area as the basic unit to describe the motion. It can be seen that due to its precise description, the optical flow changes greatly in time, and it is difficult to accurately segment the object. Although the researchers' segmentation flow sacrificed the accuracy of motion, they gained cognition of the structure of the object.

downstream applications and experimental results

Without any fine-tuning, the researchers' AMD model can be applied to segmentation tasks such as image segmentation and video moving objects. For image segmentation, researchers only need to migrate the graphics network branch. Figure 6 shows the segmentation effect when tested on a saliency detection data set DUTS. It can be seen that the researchers' pre-trained models can not only detect and segment "movable objects", but can also generalize to segment some static objects, such as sculptures, plates, benches, trees, and so on.

Figure 6: The test effect of saliency detection on DUTS

For moving objects in the segmentation video, all two branches of the AMD model need to be migrated. For a test video, in order to use exercise information, the researchers used test time adaptation techniques. Specifically, the researchers also used the self-supervised task of view synthesis to optimize the test video, and tested the AMD model on three data test sets (the model has never seen the training set of these data sets). The research results show that the AMD model greatly exceeds the existing method on two of the data sets. Figure 7 shows the specific performance and visualization results.

Figure 7: Segmentation of moving objects in the video. The above figure is a visual comparison, and the following table is a numerical comparison.

summary

The research in this paper attempts to propose and design a zero-label self-supervised learning model. The model can be used in some application scenarios without any fine-tuning. This research work decouples the appearance and motion representations in the video, enabling it to segment and detect objects. The researchers also hope that this research work can inspire more zero-label learning related tasks.

references

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnifi- cation: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817, 2018.
Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3828–3838, 2019.
Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3733–3742, 2018.
Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8934–8943, 2018.

Welcome to follow the Microsoft China MSDN subscription account for more latest releases!

NeurIPS 2021 | Zero-label visual learning for object detection and segmentation

Learn object detection and segmentation from videos

View Synthesis task (View Synthesis)

segmentation stream and AMD model

downstream applications and experimental results

summary

微软技术栈

引用和评论

巅峰对决！微软开发者挑战赛决赛名单出炉~

🔥全程不用写代码，我用 AI 程序员写了一个飞机大战

从 DeepSeek 看25年前端的一个小趋势

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南