头图

(Reproduced from Microsoft Research AI headlines)

Editor's note: With the gradual deepening of research on self-supervised learning, the paradigm of transfer learning has been widely used in various fields of visual learning. A large number of visual tasks are deployed through the use of self-supervised pre-training and supervised fine-tuning. Researchers from Microsoft Research Asia hope to break this paradigm. In a paper published in NeurIPS 2021, the researchers proposed a model that can learn object detection and segmentation from unlabeled videos, so that self-supervised pre-training models can be directly served For application, without any supervised fine-tuning, the zero-label learning is realized.

Contrastive learning is currently the mainstream method in training visual self-supervised models. The core idea is to treat each independent sample in the training data set as a category, and design the pre-training task for the identification of independent individuals. Since there is only one sample for each category, individual identification will be very simple. Researchers usually use data augmentation techniques to create rich in-class samples for each sample. For pictures, data enhancement roughly includes: picture translation, zooming, flipping, color contrast and color changes, blurring, and grayscale transformation, etc. Although these image enhancement technologies have changed the details of the image, they have not changed the semantic content of the image description. In fact, contrast learning is learning feature representations that are invariant to these enhanced transformations. It can be observed from the experiment that contrast learning on data enhancement is very significant.

6794c033ab8abdbb437d2cedb5aa712c.png
Figure 1: Contrastive learning strongly relies on the underlying image enhancement technology to learn invariance. Commonly used image enhancement techniques include panning, zooming, color enhancement, and local blurring.

As a pre-training method, comparative learning only learns a feature representation, but this feature representation requires some (a small amount of) supervised downstream data for fine-tuning training before it can be applied to downstream tasks. Although the pre-training representation can greatly improve the fine-tuning performance of downstream tasks, the characteristics that rely on fine-tuning have become the defects and shortcomings of the self-supervised model itself.

029bc6a3292f548363b3d27222ebe5cd.png
Figure 2: The framework of transfer learning: general pre-training + task-specific fine-tuning. Self-supervised learning has become a powerful pre-training method, but it must use a small amount of supervised data from downstream tasks to serve the application.

Learn object detection and segmentation from videos

Based on the analysis and understanding of comparative learning deficiencies, researchers at Microsoft Research Asia hope to design a self-supervised model that can be directly applied to downstream tasks without fine-tuning. In order to achieve this goal, researchers began to look for useful information from the video. Different from computer learning image recognition tasks, humans learn from a continuously changing time series signal. A time-series video signal contains a lot of useful information that is not possible in the picture. For example, a video can describe the motion of an object and its deformation; however, for a static image data set, it is difficult for an object to be captured multiple times in the data set. For another example, through geometric methods, researchers can reconstruct the three-dimensional shape of an object from a video, but it is also difficult to recover from a static image. Therefore, the researchers hope to analyze the movement form of the object from the video, use its movement form to help detect the existence of the object, and segment the shape of the object.

View Synthesis task (View Synthesis)

First, researchers need to find appropriate free supervision information from the video to learn object detection and segmentation. One of the learning goals commonly used in videos is the view synthesis task. Specifically, given two pictures of a video, an initial picture, and a target picture, the view synthesis task will try to learn a warping function to model the pixel reconstruction from the initial frame to the target frame. process. This seemingly simple task has rich application scenarios. For example, if the pixel-to-point correspondence is used to represent this warping function, then the visual synthesis task can realize self-supervised optical flow (optical flow) learning. For another example, if the camera parameters can be obtained, the visual synthesis task can be used to realize self-supervised single-channel depth estimation. The key to realizing different self-supervised tasks is to find a suitable representation that can not only complete view synthesis tasks, but also realize the application tasks of interest, such as optical flow and depth estimation. To give another example, the previous work designed a new multi-plane image representation method in order to complete the stereo magnification of binocular images.

801df7c4f7ad6333a2228fa7a6eee919.png
Figure 3: View synthesis tasks can drive a new multi-planar representation, this new representation can help generate views in the case of a large baseline. The picture is taken from the paper "Stereo Magnification: Learning View Synthesis using Multiplane Images".

Researchers hope to use view synthesis tasks to achieve object detection and segmentation. The biggest difference from previous work is that they try to extract and learn the middle and even high-level representations of the picture, instead of just learning some low-level representations of the image. For this purpose, the researchers designed a and the model AMD (Appearance-Motion Decomposition) to achieve zero-label object segmentation.

The related paper "The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos" has been accepted by NeurIPS 2021.

Paper link:
https://papers.nips.cc/paper/2021/file/6d9cb7de5e8ac30bd5e8734bc96a35c1-Paper.pdf

segmentation stream and AMD model

Figure 4 shows the basic architecture of the AMD model. model is mainly composed of two framework networks: appearance pathway and motion pathway . Given a frame of input frame i, the shape network will divide it into several regions, in this example three. Given two consecutive frames of input frame i and frame j, the motion network will first extract the motion features describing the spatial correspondence, and then estimate an overall optical flow (flow offset) for each region predicted by the shape network.

7f8f780f5bd89e1d4b7a61f39cd6e101.png
Figure 4: The basic architecture of the AMD model. The lower branch is the shape network for predicting the segmentation, and the upper branch is the motion network for predicting the segmentation flow. The entire model uses the view synthesis task as the training target.

Here, the researchers apply the assumption of the gestalt principle common fate and believe that each area shares a single optical flow. This assumption is a good estimate for the motion of some rigid objects, but it is not true for objects with complex deformations. Based on the predicted optical flow of each area and the corresponding area, the researchers reconstructed an optical flow diagram. Because this optical flow is limited by the result of segmentation, it has only a very low degree of freedom, so it is called segment flow. After getting this segmented stream, you can warp frame i to frame j. The reconstructed frame j can be compared with actual observations to supervise the learning of the entire network.

The AMD model decomposes the appearance information and motion information of a video, thereby realizing the application of image segmentation zero label. In terms of implementation, the shape network uses the traditional ResNet50 structure, and the sports network uses the common PWC-Net. Both networks are trained from zero without adding any pre-training initialization. After the pre-training is completed, the shape branch can be directly applied to the brand-new picture to achieve image segmentation, and does not require any fine-tuning . It is worth noting that training AMD model does not need to add a lot of image enhancement technology . To a certain extent, alleviates the dependence on comparative learning.

a9ed11e9316821e7a3ac70198a2410b9.png
Figure 5: Comparison of optical flow and split flow. The optical flow uses a single pixel as the basic unit to describe the motion of an object, and the segmented flow uses a local area as the basic unit to describe the motion. It can be seen that due to its precise description, the optical flow changes greatly in time, and it is difficult to accurately segment the object. Although the researchers' segmentation flow sacrificed the accuracy of motion, they gained cognition of the structure of the object.

downstream applications and experimental results

Without any fine-tuning, the researchers' AMD model can be applied to segmentation tasks such as image segmentation and video moving objects. For image segmentation, researchers only need to migrate the graphics network branch. Figure 6 shows the segmentation effect when tested on a saliency detection data set DUTS. It can be seen that the researchers' pre-trained models can not only detect and segment "movable objects", but can also generalize to segment some static objects, such as sculptures, plates, benches, trees, and so on.

5cdc12b7fd61062a07e46064f32746a2.png
Figure 6: The test effect of saliency detection on DUTS

For moving objects in the segmentation video, all two branches of the AMD model need to be migrated. For a test video, in order to use exercise information, the researchers used test time adaptation techniques. Specifically, the researchers also used the self-supervised task of view synthesis to optimize the test video, and tested the AMD model on three data test sets (the model has never seen the training set of these data sets). The research results show that the AMD model greatly exceeds the existing method on two of the data sets. Figure 7 shows the specific performance and visualization results.

03951b987ee3c8eaa60c8ac7f33548ff.png
8fd135ede33f0a828c7d4042c543ad12.png
Figure 7: Segmentation of moving objects in the video. The above figure is a visual comparison, and the following table is a numerical comparison.

summary

The research in this paper attempts to propose and design a zero-label self-supervised learning model. The model can be used in some application scenarios without any fine-tuning. This research work decouples the appearance and motion representations in the video, enabling it to segment and detect objects. The researchers also hope that this research work can inspire more zero-label learning related tasks.

references

  1. Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnifi- cation: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817, 2018.
  2. Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3828–3838, 2019.
  3. Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3733–3742, 2018.
  4. Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8934–8943, 2018.

Welcome to follow the Microsoft China MSDN subscription account for more latest releases!
image.png


微软技术栈
423 声望996 粉丝

微软技术生态官方平台。予力众生,成就不凡!微软致力于用技术改变世界,助力企业实现数字化转型。