人工智能 - Technology decryption｜How did the Alibaba Cloud Multimedia AI team win the CVPR2021 5 crowns and 1 sub? - 个人文章

About 5 crown 1 second! Alibaba Cloud Multimedia AI Team CVPR2021 has achieved another great achievement!

On June 19-25, the world-renowned top international vision conference CVPR2021 (Computer Vision and Pattern Recognition, namely International Machine Vision and Pattern Recognition) was held online, but it was still full of popularity, and the passion of the participants was as summer The day is hot.

This year, the Alibaba Cloud Multimedia AI team (composed of Alibaba Cloud Video Cloud and Dharma Academy Vision Team, hereinafter referred to as MMAI) participated in the Large-scale Human Behavior Understanding Open Challenge ActivityNet, the current largest spatiotemporal action positioning challenge AVA-Kinetics, and ultra-large-scale timing 6 tracks on the HACS and the First-View Human Behavior Understanding Challenge EPIC-Kitchens, winning 5 championships and 1 runner-up in one fell swoop, of which two consecutively on the ActivityNet and HACS tracks Re-elected champion for the year

Outstanding record in the top challenge

Large-scale Time Series Action Detection Challenge ActivityNet in 2016 and is hosted by KAUST, Google, DeepMind, etc., and has been successfully held for six sessions so far.

This challenge mainly solves the problem of timing behavior detection to verify the AI algorithm's ability to understand long-term video. It is one of the most influential challenges in this field. The previous contestants came from many well-known institutions at home and abroad, including Microsoft, Baidu, Shanghai Jiaotong, Huawei, Peking University, Columbia University, etc.

This year, the Alibaba Cloud MMAI team finally won the challenge with Avg. mAP 44.67%!

Figure 1 ActivityNet Challenge Certificate

Spatio-temporal Action Positioning Challenge AVA-Kinetics started in 2018 and has been successfully held for four times. It is organized by Google, DeepMind and Berkeley, and aims to identify atomic-level behaviors that occur in videos in two dimensions of time and space.

Due to its difficulty and practicality, it has attracted the participation of many top international universities and research institutions over the years, such as DeepMind, FAIR, SenseTime-CUHK, Tsinghua University, etc.

This year, the Alibaba Cloud MMAI team defeated the opponent with 40.67% mAP and won the first place!

Figure 2 AVA-Kinetics Challenge Award Certificate

Ultra-large-scale Behavior Detection Challenge HACS started in 2019 and is hosted by MIT. It is the biggest challenge in current sequential behavior detection tasks. The challenge includes two tracks: fully-supervised behavior detection and weakly-supervised behavior detection.

Since the amount of data is more than twice that of ActivityNet, it is very challenging. The previous participating teams include Microsoft, Samsung, Baidu, Shanghai Jiaotong, Xijiao, etc.

This year, the Alibaba Cloud MMAI team participated in two tracks at the same time and won the championship with Avg. mAP 44.67% and 22.45% respectively!

Figure 3 Award certificates for the two tracks of the HACS Challenge

First-View Human Action Understanding Challenge EPIC-Kitchens in 2019 and has been held three times so far. It is sponsored by the University of Bristol and is dedicated to solving the problem of interactive understanding of human actions and target objects under the first-perspective condition.

The participating teams over the years include Baidu, FAIR, NTU, NUS, Inria-Facebook, Samsung (SAIC-Cambridge), etc.

This year, the Alibaba Cloud MMAI team participated in the two tracks of sequential motion detection and motion recognition, and won the championship and runner-up of the two challenges with Avg. mAP 16.11% and Acc. 48.5% respectively!

Figure 4 EPIC-Kitchens Challenge Award Certificate

Exploration of key technologies for the four major challenges

Behavior Understanding Challenge faces four major challenges:

The first is the wide distribution of behavior duration, ranging from 0.5 seconds to 400 seconds. Taking a 200-second test video as an example, 15 frames of images are collected every 1 second, and the algorithm must accurately locate within 3000 frames of images.

Secondly, the video background is complex, and there are usually many irregular non-target behaviors embedded in the video, which greatly increases the difficulty of behavior detection.

Moreover, the difference within the class is large, and the visual performance of the same behavior will change significantly due to the change of the individual, perspective, and environment.

Finally, the algorithm detects human movements and faces other interferences such as mutual occlusion between human bodies, insufficient video resolution, lighting, and viewing angles.

In this challenge, the team was able to achieve such excellent results mainly because of its advanced technology framework EMC2 supporting . The framework mainly explores the following core technologies:

(1) Strengthen the optimization training of the basic network

The basic network is one of the core elements of behavioral understanding.

In this challenge, the Alibaba Cloud MMAI team mainly explored the following two aspects: in-depth study of Video Transformer (ViViT); and exploring the complementarity of Transformer and CNN heterogeneous models.

As the main basic network, ViViT training also includes two processes: pre-training and fine-tuning. In the fine-tuning process, the MMAI team fully analyzes the influence of variables such as input size and data augmentation to find the best configuration for the current task.

In addition, considering the complementarity of Transformer and CNN structures, Slowfast, CSN and other structures were also used. Finally, through integrated learning, 48.5%, 93.6%, and 96.1% of classification performance were achieved on EPIC-Kitchens, ActivityNet, and HACS, respectively, compared with last year. The championship results have been significantly improved.

Figure 5 ViViT structure and performance

(2) Entity spatiotemporal relationship modeling in video comprehension

For spatio-temporal motion detection tasks, learning about human-human relationships, human-object relationships, and human-scene relationships in videos based on relationship modeling is particularly important for correct implementation of motion recognition, especially interactive motion recognition.

Therefore, in this challenge, Alibaba Cloud MMAI focused on modeling and analyzing these relationships.

Specifically, first locate people and objects in the video, and extract the feature representations of people and objects respectively; in order to model different types of action relationships in a more fine-grained manner, the above features are combined with global video features in the spatio-temporal domain to enhance the features. And apply the relationship learning module based on Transformer structure between different time domain or spatial domain respectively, and the association learning of different positions realizes the position invariance of the associated area through weight sharing.

In order to further model the long-order temporal correlation, we constructed a two-stage temporal feature pool that combines online and offline maintenance, and merged the feature information before and after the video segment into the correlation learning.

Finally, the human body features after association learning are used for action recognition tasks. Based on the decoupling learning method, effective learning of difficult and small sample categories under the long tail distribution of action categories is realized.

Figure 6 Relationship modeling network

(3) Long video understanding based on action nomination relationship coding

In many tasks related to action understanding, under limited computing conditions, the long video duration is one of its main challenges, and temporal relationship learning is an important means to solve long-term video theory.

In EMC2, a module based on the coding of action nomination relations is designed to improve the long-term perception ability of the algorithm. Specifically, the basic behavior detection network is used to produce dense action nominations, where each action nomination can be roughly regarded as the time interval in which a specific action entity occurs.

Then based on the self-attention mechanism, these nominated entities are coded in the temporal dimension, so that each action nomination can perceive global information, so that it can predict more accurate behavior positions. With this technology, EMC2 is in AcitivityNet, etc. Achieved a championship result in sequential behavior detection.

Figure 7 Code of relationship between action nominations

(4) Network initialization training based on self-supervised learning

Initialization is an important process of deep network training and one of the main components of EMC2.

The Alibaba Cloud MMAI team designed a self-training-based initialization method MoSI, which is to train a video model from a static image. MoSI mainly includes two components: pseudo-motion generation and static mask design.

Firstly, pseudo video clips are generated according to the specified direction and speed according to the sliding window method, and then only the motion mode of the local area is retained by designing a suitable mask, so that the network can have the ability of local motion perception. Finally, in the training process, the model optimization goal is to successfully predict the speed and direction of the input pseudo video.

In this way, the trained model will have the ability to perceive video movement. In the challenge, taking into account the rule of not using additional data, only a limited number of challenge video frames for MoSI training can achieve a significant performance improvement and ensure the quality of model training for each challenge.

Figure 8 MoSI training process and semantic analysis

"Video behavior analysis has always been considered a very challenging task, mainly due to the diversity of its content.

Although various advanced technologies in basic machine vision have been proposed, our innovations in this competition mainly include: 1) Deep exploration of self-supervised learning and Transformer+CNN heterogeneous fusion; 2) Modeling of relationships between different entities in videos Continuous research of methods.

These explorations confirm the importance of current advanced technologies (such as self-supervised learning) for video content analysis.

In addition, our success also illustrates the important role of entity relationship modeling in the understanding of video content, but it has not received enough attention from the industry. "Alibaba senior researcher Jin Rong concluded.

Create multimedia AI cloud products based on video understanding technology

Based on the EMC2 technology base, while conducting in-depth research on video understanding, the Alibaba Cloud MMAI team has also actively carried out industrialization and launched a multimedia AI (MultiMedia AI) technology product: Retina Video Cloud Multimedia AI Experience Center (click 👉 Multimedia AI Cloud Product Experience Center for experience).

The product implements core functions such as video search, review, structuring, and production. It processes millions of hours of video data every day, and provides customers with video search, video recommendation, video review, copyright protection, video cataloging, video interaction, video-assisted production, etc. The core capabilities are provided in the application scenario, which greatly improves the work efficiency and flow efficiency of customers.

Figure 9 Multimedia AI products

Currently, multimedia AI cloud products have landed in the media industry, pan-entertainment industry, short video industry, sports industry, and e-commerce industry:

1) In the media industry , it mainly supports the business production process of head customers in the media industry such as CCTV and People’s Daily, greatly improving production efficiency and reducing labor costs. For example, in news generation scenarios, it has increased cataloging efficiency by 70% and 50%. % Search efficiency;

2) In the pan-entertainment industry and the short video industry , it mainly supports the group’s business parties Youku, Weibo, Qutoutiao and other pan-entertainment video industries under video structuring, image/video review, video fingerprint search, copyright traceability, and video review. It helps to protect video copyrights and improve the efficiency of traffic distribution, with an average of hundreds of millions of times a day;

3) In the sports industry , supporting the 21st World Cup football match, opened up multi-modal information such as vision, sports, audio, voice, etc., realized the cross-modal analysis of the live broadcast of football matches, and improved the efficiency by an order of magnitude compared with traditional editing;

4) In the e-commerce industry , it supports business parties such as Taobao and Xianyu, supports the structuring of new videos, video/image review, and assists customers in quickly generating short videos to improve distribution efficiency.

Figure 10 Multimedia AI's label recognition of sports industry and film and television industry

Figure 11 Multimedia AI's label recognition of the media industry and e-commerce industry

With the support of EMC2, Retina Video Cloud Multimedia AI Experience Center has the following advantages :

1) Multi-modal learning: uses massive multi-modal data such as video, audio, text, etc., to carry out cross-media understanding, and integrate the understanding/production system of knowledge in different fields;

2) Lightweight customization: users can independently register entities that need to be identified, the algorithm can implement "plug and play" for newly added entity tags, and the use of lightweight data for new categories can approach the effects of known categories;

3) High performance: self-developed high-performance audio and video codec library, deep learning inference engine, GPU preprocessing library, targeted optimization for video scene IO and computationally intensive features, and nearly 10 times performance improvement in different scenarios;

4) multimedia AI cloud products have landing application cases in the media industry, pan-entertainment industry, short video industry, sports industry, and e-commerce industry.

"Video is very helpful to improve the easy-to-understand, easy-to-receive, and easy-to-dissemination of content. In the past few years, we have also seen all walks of life, and various scenarios are accelerating the process of content video. The demand is getting stronger and stronger. How to efficiently and high-quality produce videos that meet the needs of users has become a core issue. There are many detailed issues involved, such as the discovery of hot spots, the understanding of a large number of video materials, and multi-mode. Retrieval, template construction based on user portraits/scenes, etc., all require a large amount of reliance on the development of visual AI technology. The MMAI team combines industries and scenarios to continuously improve visual AI technology, and polishes and builds business-level multimedia based on this. AI cloud products enable high-quality and efficient production of videos, thereby effectively advancing the process of content video in various industries and scenes." Cloud Video Cloud Director Bi Xuan commented.

In this CVPR2021, MMAI defeated a number of strong domestic and foreign opponents in one fell swoop through a number of academic challenges, and won a number of championships. This is a strong verification of its excellent technology. Its cloud product multimedia AI has served multiple industries. Lead customers, and will continue to create application value in multiple industries.

👇Click to experience

Multimedia AI Cloud Product Experience Center: http://retina.aliyun.com

Source code open source address: https://github.com/alibaba-mmai-research/pytorch-video-understanding

References:

[1] Huang Z, Zhang S, Jiang J, et al. Self-supervised motion learning from static images. CVPR2021: 1276-1285.

[2] Arnab A, Dehghani M, Heigold G, et al. Vivit: A video vision transformer[J]. arXiv preprint arXiv:2103.15691, 2021.

[3] Feichtenhofer C, Fan H, Malik J, et al. Slowfast networks for video recognition. ICCV2019: 6202-6211.

[4] Tran D, Wang H, Torresani L, et al. Video classification with channel-separated convolutional networks. ICCV2019: 5552-5561.

[5] Lin T, Liu X, Li X, et al. Bmn: Boundary-matching network for temporal action proposal generation. ICCV2019: 3889-3898.

[6] Feng Y, Jiang J, Huang Z, et al. Relation Modeling in Spatio-Temporal Action Localization[J]. arXiv preprint arXiv:2106.08061, 2021.

[7] Qing Z, Huang Z, Wang X, et al. A Stronger Baseline for Ego-Centric Action Detection[J]. arXiv preprint arXiv:2106.06942, 2021.

[8] Huang Z, Qing Z, Wang X, et al. Towards training stronger video vision transformers for epic-kitchens-100 action recognition[J]. arXiv preprint arXiv:2106.05058, 2021.

[9] Wang X, Qing Z., et al. Proposal Relation Network for Temporal Action Detection[J]. arXiv preprint arXiv:2106.11812, 2021.

[10] Wang X, Qing Z., et al. Weakly-Supervised Temporal Action Localization Through Local-Global Background Modeling[J]. arXiv preprint arXiv:2106.11811, 2021.

[11] Qing Z, Huang Z, Wang X, et al. Exploring Stronger Feature for Temporal Action Localization

"Video Cloud Technology" Your most noteworthy audio and video technology public account, pushes practical technical articles from the front line of Alibaba Cloud every week, and exchanges and exchanges with first-class engineers in the audio and video field. The official account backstage reply [Technology] You can join the Alibaba Cloud Video Cloud Technology Exchange Group, discuss audio and video technologies with the author, and get more industry latest information.

Copyright statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

Technology decryption｜How did the Alibaba Cloud Multimedia AI team win the CVPR2021 5 crowns and 1 sub?

Outstanding record in the top challenge

Exploration of key technologies for the four major challenges

Create multimedia AI cloud products based on video understanding technology

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

一文掌握 MCP 上下文协议：从理论到实践

LRU算法，你别跑，我就要吃透你

AI Agent爆火后，MCP协议为什么如此重要！

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

MCP 协议为何不如你想象的安全？从技术专家视角解读