头图

On June 19-25, the world-renowned top international vision conference CVPR2021 (Computer Vision and Pattern Recognition, that is, International Machine Vision and Pattern Recognition) was held online, but it was still very popular, and the passion of the participants was as summer The day is hot.

This year, the Alibaba Cloud Multimedia AI team (composed of Alibaba Cloud Video Cloud and Dharma Academy Vision Team, hereinafter referred to as MMAI) participated in the Large-scale Human Behavior Understanding Open Challenge ActivityNet, the current largest spatiotemporal action positioning challenge AVA-Kinetics, and ultra-large-scale timing 6 tracks on the HACS and the First-View Human Behavior Understanding Challenge EPIC-Kitchens, won 5 championships and 1 runner-up in one fell swoop, including two consecutively on the ActivityNet and HACS tracks. Re-elected champion for the year!

Outstanding record in the top challenge

Large-scale Sequential Motion Detection Challenge ActivityNet in 2016, hosted by KAUST, Google, DeepMind, etc., and has been successfully held for six times.

This challenge mainly solves the problem of timing behavior detection to verify the AI algorithm's ability to understand long-term video. It is one of the most influential challenges in this field. The previous contestants came from many well-known institutions at home and abroad, including Microsoft, Baidu, Shanghai Jiaotong, Huawei, SenseTime, Peking University, Columbia University, etc.

This year, the Alibaba Cloud MMAI team finally won the challenge with Avg. mAP 44.67%!

Figure 1 ActivityNet Challenge Certificate

Spatio-temporal Action Positioning Challenge AVA-Kinetics started in 2018 and has been successfully held for four times. It is organized by Google, DeepMind and Berkeley, and aims to identify atomic-level behaviors that occur in videos in two dimensions of time and space.

Due to its difficulty and practicality, it has attracted the participation of many top international universities and research institutions over the years, such as DeepMind, FAIR, SenseTime-CUHK, Tsinghua University, etc.

This year, the Alibaba Cloud MMAI team defeated the opponent with 40.67% mAP and won the first place!

Figure 2 AVA-Kinetics Challenge Award Certificate

Behavior Detection Challenge HACS started in 2019 and is hosted by MIT. It is the biggest challenge in current sequential behavior detection tasks. The challenge includes two tracks: fully-supervised behavior detection and weakly-supervised behavior detection.

Since the amount of data is more than twice that of ActivityNet, it is very challenging. The previous participating teams include Microsoft, Samsung, Baidu, Shanghai Jiaotong, Shangtang, Xijiao and so on.

This year, the Alibaba Cloud MMAI team participated in two tracks at the same time and won the championship with Avg. mAP 44.67% and 22.45% respectively!

Figure 3 Award certificates for the two tracks of the HACS Challenge

First-View Human Action Understanding Challenge EPIC-Kitchens in 2019 and has been held three times so far. It is sponsored by the University of Bristol and is dedicated to solving the problem of interactive understanding of human movements and target objects under the conditions of the first perspective.

The participating teams over the years include Baidu, FAIR, NTU, NUS, Inria-Facebook, Samsung (SAIC-Cambridge), etc.

This year, the Alibaba Cloud MMAI team participated in the two tracks of sequential motion detection and motion recognition, and won the championship and runner-up of the two challenges with Avg. mAP 16.11% and Acc. 48.5% respectively!

Figure 4 EPIC-Kitchens Challenge Award Certificate

Exploration of key technologies for the four major challenges

The Behavior Understanding Challenge faces four major challenges:

The first is the wide distribution of behavior duration, ranging from 0.5 seconds to 400 seconds. Taking a 200-second test video as an example, 15 frames of images are collected every 1 second, and the algorithm must accurately locate within 3000 frames of images.

Secondly, the video background is complex, and there are usually many irregular non-target behaviors embedded in the video, which greatly increases the difficulty of behavior detection.

Moreover, the difference within the class is large, and the visual performance of the same behavior will change significantly due to the change of the individual, perspective, and environment.

Finally, the algorithm detects human movements and faces other interferences such as mutual occlusion between human bodies, insufficient video resolution, lighting, and viewing angles.

In this Challenge, the team was able to achieve such excellent results, mainly by its behind advanced technology framework EMC2 support , the main framework for the following several core technologies to explore:

(1) Strengthen the optimization training of the basic network

The basic network is one of the core elements of behavioral understanding.

In this challenge, the Alibaba Cloud MMAI team mainly explored the following two aspects: in-depth study of Video Transformer (ViViT); and exploring the complementarity of Transformer and CNN heterogeneous models.

As the main basic network, ViViT training also includes two processes: pre-training and fine-tuning. In the fine-tuning process, the MMAI team fully analyzes the influence of variables such as input size and data augmentation to find the best configuration for the current task.

In addition, considering the complementarity of Transformer and CNN structures, Slowfast, CSN and other structures were also used. Finally, through integrated learning, 48.5%, 93.6%, and 96.1% of classification performance were achieved on EPIC-Kitchens, ActivityNet, and HACS, respectively, compared with last year. The championship results have been significantly improved.


Figure 5 ViViT structure and performance

(2) Entity spatio-temporal relationship modeling in video understanding

For spatio-temporal motion detection tasks, learning about human-human relationships, human-object relationships, and human-scene relationships in videos based on relationship modeling is particularly important for correct implementation of motion recognition, especially interactive motion recognition.

Therefore, in this challenge, Alibaba Cloud MMAI focused on modeling and analyzing these relationships.

Specifically, first locate people and objects in the video, and extract the feature representations of people and objects respectively; in order to model different types of action relationships in a more fine-grained manner, the above features are combined with global video features in the spatio-temporal domain to enhance the features. And apply the relationship learning module based on Transformer structure between different time domain or spatial domain respectively, and the association learning of different positions realizes the position invariance of the associated area through weight sharing.

In order to further model the long-order temporal correlation, we constructed a two-stage temporal feature pool that combines online and offline maintenance, and merged the feature information before and after the video segment into the correlation learning.

Finally, the human body features after association learning are used for action recognition tasks. Based on the decoupling learning method, effective learning of difficult and small sample categories under the long tail distribution of action categories is realized.

Figure 6 Relationship modeling network

(3) Long video comprehension based on coding of action nomination relations

In many tasks related to action understanding, under limited computing conditions, the long video duration is one of its main challenges, and temporal relationship learning is an important means to solve long-term video theory.

In EMC2, a module based on the coding of action nomination relations is designed to improve the long-term perception ability of the algorithm.

Specifically, the basic behavior detection network is used to produce dense action nominations, where each action nomination can be roughly regarded as the time interval in which a specific action entity occurs.

Then, based on the self-attention mechanism, these nominated entities are coded in time series in the time dimension, so that each action nomination can perceive global information, so as to be able to predict more accurate behavior positions. With this technology, EMC2 is used in AcitivityNet, etc. Achieved a championship result in sequential behavior detection.

Figure 7 Code of relationship between action nominations

(4) Network initialization training based on self-supervised learning

Initialization is an important process of deep network training and one of the main components of EMC2.

The Alibaba Cloud MMAI team designed a self-training-based initialization method MoSI, which is to train a video model from a static image.

MoSI mainly includes two components: pseudo-motion generation and static mask design.

Firstly, pseudo video clips are generated according to the specified direction and speed according to the sliding window method, and then only the motion mode of the local area is retained by designing a suitable mask, so that the network can have the ability of local motion perception. Finally, in the training process, the model optimization goal is to successfully predict the speed and direction of the input pseudo video.

In this way, the trained model will have the ability to perceive video movement. In the challenge, taking into account the rule of not using additional data, only a limited number of challenge video frames for MoSI training can achieve a significant performance improvement and ensure the quality of model training for each challenge.

Figure 8 MoSI training process and semantic analysis

"Video behavior analysis has always been considered a very challenging task, mainly due to the diversity of its content.

Although various advanced technologies in basic machine vision have been proposed, our innovations in this competition mainly include:
1) Deep exploration of self-supervised learning and Transformer+CNN heterogeneous fusion;
2) Continuous research on modeling methods of relationships between different entities in the video.
These explorations confirm the importance of current advanced technologies (such as self-supervised learning) for video content analysis.

In addition, our success also illustrates the important role of entity relationship modeling in the understanding of video content, but it has not received enough attention from the industry. "Alibaba senior researcher Jin Rong concluded.

Create multimedia AI cloud products based on video understanding technology

Based on the EMC2 technology base, while conducting in-depth research on video understanding, the Alibaba Cloud MMAI team has also actively carried out industrialization and launched a multimedia AI (MultiMedia AI) technology product: Retina Video Cloud Multimedia AI Experience Center (click 👉 Multimedia AI Cloud Product Experience Center for experience).

The product implements core functions such as video search, review, structuring, and production. It processes millions of hours of video data every day, and provides customers with video search, video recommendation, video review, copyright protection, video cataloging, video interaction, video-assisted production, etc. The core capabilities are provided in the application scenario, which greatly improves the work efficiency and flow efficiency of customers.

Figure 9 Multimedia AI products

Currently, multimedia AI cloud products have landed in the media industry, pan-entertainment industry, short video industry, sports industry, and e-commerce industry:

1) In the media industry , it mainly supports the business production process of head customers in the media industry such as CCTV and People’s Daily, which greatly improves production efficiency and reduces labor costs. For example, in news generation scenarios, it has increased cataloging efficiency by 70% and 50%. % Search efficiency;

2) In the pan-entertainment industry and the short video industry , it mainly supports the group's business parties Youku, Weibo, Qutoutiao and other pan-entertainment video industries under video structuring, image/video review, video fingerprint search, copyright traceability, and video review. It helps to protect video copyrights and improve the efficiency of traffic distribution, with an average of hundreds of millions of times a day;

3) In sports industry , supporting the 21st FIFA World Cup, it has opened up multi-modal information such as vision, sports, audio, and voice, realizing cross-modal analysis of live streaming of football matches, which is an order of magnitude higher than traditional editing efficiency;

4) In the e-commerce industry , it supports Taobao, Xianyu and other business parties, supports the structuring of new videos, video/image review, and assists customers in quickly generating short videos to improve distribution efficiency.

Figure 10 Multimedia AI's label recognition of the sports industry and the film and television industry


Figure 11 Multimedia AI's label recognition of the media industry and e-commerce industry

With the support of EMC2, Retina Video Cloud Multimedia AI Experience Center has the following advantages:

1) Multi-modal learning : Use massive multi-modal data such as video, audio, text, etc. to carry out cross-media understanding and integrate the understanding/production system of knowledge in different fields;

2) Lightweight customization : Users can independently register entities that need to be identified, the algorithm can implement "plug and play" for newly added entity tags, and the use of lightweight data for new categories can approach the effects of known categories;

3) High-performance : Self-developed high-performance audio and video codec library, deep learning inference engine, GPU preprocessing library, targeted optimization for video scene IO and computationally intensive features, and nearly 10 times performance improvement in different scenarios;

4) versatility 160dd3dc956cb6: Multimedia AI cloud products have landing application cases in the media industry, pan-entertainment industry, short video industry, sports industry, and e-commerce industry.

"Video is very helpful to improve the easy-to-understand, easy-to-receive, and easy-to-dissemination of content. In the past few years, we have also seen all walks of life, and various scenarios are accelerating the process of content video. The demand is getting stronger and stronger. How to efficiently and high-quality produce videos that meet the needs of users has become a core issue. There are many detailed issues involved, such as the discovery of hot spots, the understanding of a large number of video materials, and multi-mode. Retrieval, template construction based on user portraits/scenes, etc., all require a large amount of reliance on the development of visual AI technology. The MMAI team combines industries and scenarios to continuously improve visual AI technology, and polishes and builds business-level multimedia based on this. AI cloud products enable high-quality and efficient production of videos, thereby effectively advancing the process of content video in various industries and scenarios." Cloud Video Cloud Director Bi Xuan commented.

In this CVPR2021, MMAI defeated a number of strong domestic and foreign opponents in one fell swoop through a number of academic challenges, and won a number of championships. This is a strong verification of its excellent technology. Its cloud product multimedia AI has served multiple industries. Lead customers, and will continue to create application value in multiple industries.

👇Click to experience
Multimedia AI Cloud Product Experience Center: http://retina.aliyun.com

Source and open source address: https://github.com/alibaba-mmai-research/pytorch-video-understanding

references:

[1] Huang Z, Zhang S, Jiang J, et al. Self-supervised motion learning from static images. CVPR2021: 1276-1285.
[2] Arnab A, Dehghani M, Heigold G, et al. Vivit: A video vision transformer[J]. arXiv preprint arXiv:2103.15691, 2021.
[3] Feichtenhofer C, Fan H, Malik J, et al. Slowfast networks for video recognition. ICCV2019: 6202-6211.
[4] Tran D, Wang H, Torresani L, et al. Video classification with channel-separated convolutional networks. ICCV2019: 5552-5561.
[5] Lin T, Liu X, Li X, et al. Bmn: Boundary-matching network for temporal action proposal generation. ICCV2019: 3889-3898.
[6] Feng Y, Jiang J, Huang Z, et al. Relation Modeling in Spatio-Temporal Action Localization[J]. arXiv preprint arXiv:2106.08061, 2021.
[7] Qing Z, Huang Z, Wang X, et al. A Stronger Baseline for Ego-Centric Action Detection[J]. arXiv preprint arXiv:2106.06942, 2021.
[8] Huang Z, Qing Z, Wang X, et al. Towards training stronger video vision transformers for epic-kitchens-100 action recognition[J]. arXiv preprint arXiv:2106.05058, 2021.
[9] Wang X, Qing Z., et al. Proposal Relation Network for Temporal Action Detection[J]. arXiv preprint arXiv:2106.11812, 2021.
[10] Wang X, Qing Z., et al. Weakly-Supervised Temporal Action Localization Through Local-Global Background Modeling[J]. arXiv preprint arXiv:2106.11811, 2021.
[11] Qing Z, Huang Z, Wang X, et al. Exploring Stronger Feature for Temporal Action Localization

"Video Cloud Technology" Your most noteworthy audio and video technology public account, pushes practical technical articles from the front line of Alibaba Cloud every week, and exchanges and exchanges with first-class engineers in the audio and video field. The official account backstage reply [Technology] You can join the Alibaba Cloud Video Cloud Technology Exchange Group, discuss audio and video technologies with the author, and get more industry latest information.

CloudImagine
222 声望1.5k 粉丝