image.png

1 Introduction

On October 11-17, the much-anticipated International Conference on Computer Vision ICCV 2021 (International Conference on Computer Vision) was held online as scheduled, and received extensive attention from researchers in the field of computer vision around the world.
This year, the Alibaba Cloud Multimedia AI team (composed of Alibaba Cloud Video Cloud and Dharma Academy Vision Team) participated in the MFR Mask Character Identification Global Challenge, and won 1 championship, 1 runner-up and 1 runner-up in a total of 5 tracks. The 2 third runner-ups demonstrated our deep technical accumulation and industry-leading technical advantages in the field of character identification.

2. Introduction to the competition

The MFR Mask Character Identification Global Challenge is a global challenge jointly organized by Imperial College, Tsinghua University and InsightFace.AI. It is mainly to solve the challenge of wearing a mask during the new crown epidemic to the character identification algorithm. The competition started on June 1st and ended on October 11th. It lasted for more than 4 months and attracted nearly 400 teams from all over the world. It is the authoritative event with the largest scale and the largest number of participants in the field of character identification so far. According to official statistics, the total number of submissions received in this competition exceeds 10,000 times, and the competition among the teams is extremely fierce.

2.1 Training data set

The training data set for this competition can only use the three officially provided data sets, and other additional data sets and pre-training models are not allowed to ensure the fairness and impartiality of the comparison of various algorithms. The three official data sets are ms1m small-scale data set, glint360k medium-scale data set, and webface260m large-scale data set. The number of person IDs and pictures contained in each data set are shown in the following table:

image.png

2.2 Evaluation data set

The evaluation data set of this competition contains positive and negative sample pairs in the order of trillions, which is the largest and most comprehensive authoritative evaluation data set in the industry. It is worth noting that all evaluation data sets are not open to the public, and only provide an interface for automatic evaluation in the background to avoid algorithm overfitting the test data set.
The detailed statistics of the InsightFace track evaluation data set are shown in the following table:

image.png

The detailed statistics of the WebFace260M track evaluation data set are shown in the following table:

image.png

2.3 Evaluation indicators

The evaluation indicators for this competition not only have performance indicators, but also include feature dimensions and inference time limitations, so they are closer to real business scenarios. The detailed evaluation indicators are shown in the following table:

image.png

image.png

3. Solution

Below, we will deconstruct our solutions one by one from the aspects of data, model, loss function, etc.

3.1 Data cleaning based on self-learning

As we all know, there is widespread noise data in the training data set related to person identification. For example, the same person picture is scattered under different person IDs, and multiple person pictures are mixed under the same person ID. The noise in the data set will affect the performance of the recognition model. Big impact. In response to the above problems, we propose a data cleaning framework based on self-learning, as shown in the following figure:

image.png

First, we use the original data to train the initial model M0, and then use the model to perform a series of operations such as feature extraction, ID merging, inter-class cleaning, and intra-class cleaning. For each person ID, we use the DBSCAN clustering algorithm to calculate the central feature, and then use the central feature for similarity retrieval. The high-dimensional vector feature retrieval engine used in this step is Proxima self-developed by Dharma Institute, which can be fast and accurate Recall the topK results in Doc with the highest similarity to Query records. Then, we use the cleaned data set to train a new model M1, and then repeat the data cleaning and new model training process. Through continuous iterative self-learning, the data quality is getting higher and higher, and the model performance is also increasing. The stronger. Specifically, the schematic diagram of cleaning between classes and cleaning within classes is shown in the following figure:

image.png

It is worth noting that in our cleaning process, we perform inter-class cleaning and then intra-class cleaning, which is different from the CAST [1] data cleaning framework, so that the new ID center feature can be updated after the inter-class cleaning is completed, so that the entire cleaning The process is more complete and the cleaning effect is better. In order to verify the impact of data cleaning on the final performance, we did a series of comparative experiments on the ms1m data set, and the results are shown in the following table:

image.png

The threshold in the table refers to the similarity threshold of the cleaning within the class. It can be seen that when the threshold is set too low (such as 0.05), the noise is not cleaned, so the performance is not the best; and when the threshold is set too high (such as 0.50), the noise is cleaned and the difficult samples are also cleaned, which leads to the weakening of the model's generalization ability, and the performance on the evaluation data set decreases. Therefore, an intermediate threshold of 0.25 is selected, which not only cleans a lot of noise, but also retains difficult samples, and achieves the best performance in all evaluation indicators. In addition, we have also drawn the relationship between different similarity thresholds and the number of remaining pictures, as shown in the following figure:

image.png

3.2 Mask data generation

In order to solve the problem of insufficient data for wearing masks, a feasible solution is to draw masks on existing images without masks. However, most of the current drawing schemes belong to the position map type, and the mask-wearing image generated by this scheme is not realistic enough and lacks flexibility. Therefore, we learn from the idea of PRNet [2, 3] and adopt an image fusion scheme [4] to obtain a mask image that is more in line with the real situation, as shown in the following figure,

image.png

The principle of this scheme is to generate a UV Texture Map from the mask image and the original image through 3D reconstruction, and then synthesize the mask image with the help of texture space. In the data generation process, we used 8 types of masks, which means that we can generate 8 different styles of mask images on the existing data set. The scheme based on UV mapping overcomes the problems of undesirable connection and distortion between the original image and the mask image in the traditional planar projection method. In addition, due to the existence of the rendering process, the mask image can get different rendering effects, such as adjusting the mask angle and lighting effects. An example of the generated image of wearing a mask is shown in the following figure:

image.png

In the process of generating the mask-wearing data training model, we found that the proportion of the mask-wearing data has varying degrees of impact on the performance of the model. Therefore, we set the proportion of wearing masks to 5%, 10%, 15%, 20%, and 25% respectively. The experimental results are shown in the following table:

image.png

From the above table, it is found that when the mask wearing data ratio is 5%, the model has the highest performance on the MR-ALL evaluation set; when the mask wearing data ratio is adjusted to 25%, the performance of the Mask wearing mask evaluation set is significantly improved. But the performance drop on MR-ALL is obvious. This shows that when the mask data and normal data are mixed for training, the ratio is an important parameter that affects the performance of the model. In the end, we chose to wear a mask with a ratio of 15% to achieve a good balance between the performance of wearing a mask and normal data.

3.3 NAS-based backbone network

The ability of different backbone networks to extract features is quite different. In the field of person identification, the commonly used baseline backbone network in the industry is IR-100 proposed in ArcFace [5]. In this competition, we adopted the Zero-shot NAS (Zen-NAS[6]) paradigm proposed by Dharma Academy to search for a backbone network with stronger characterization capabilities in the model space. Zen-NAS is different from traditional NAS methods. It uses Zen-Score to replace the performance evaluation score of the search model. It is worth noting that Zen-Score is directly proportional to the final performance index of the model, so the entire search process is very efficient. The core algorithm structure of Zen-NAS is shown in the figure below:

image.png

Based on the IR-SE baseline backbone network, we use Zen-NAS to search for 3 variables related to the model structure, namely: the number of channels in the Input layer, the number of channels in the Block layer, and the number of times that different Block layers are stacked. The restriction conditions are searched out. The backbone network satisfies the reasoning time constraints of each track. An interesting finding is that the performance of the backbone network searched by Zen-NAS on the ms1m small data set track is almost the same as that of the IR-SE-100, but on a large data set track such as WebFace260M, the performance will be obvious. Better than the baseline. The reason may be that as the search space increases, the searchable range of the NAS increases, and the probability of searching for a more powerful model also increases.

3.4 Loss function

The baseline loss function we used in this competition is Curricular Loss [7], which simulates the idea of course learning during the training process, and trains in the order of samples from easy to difficult. However, because the training data set is usually extremely unbalanced, the number of pictures contained in popular figures is as many as thousands, while the number of pictures contained in unpopular figures is often only one. In order to solve the long tail problem caused by data imbalance, we introduce the idea of Balanced Softmax Loss [8] into Curricular Loss, and propose a new loss function: Balanced Curricular Loss, whose expression is shown in the following figure:

image.png

On the ms1m track, we compared the performance of the Balanced Curricular Loss (BCL) and the original Curricular Loss (CL). The results are shown in the following table:

image.png

It can be seen that the Balanced Curricular Loss compared to the Curricular Loss, both on the Mask and MR-ALL indicators have been greatly improved, which fully proves its effectiveness.

3.5 Knowledge distillation

Due to the constraints on the model's reasoning time in this competition, the results will be cancelled directly if the model expires. Therefore, we adopt the method of knowledge distillation to transfer the powerful representation ability of the large model to the small model, and then use the small model for reasoning to meet the requirement of reasoning time. The knowledge distillation framework we adopted for this competition is shown in the figure below:

image.png

Among them, the distillation loss uses the simplest L2 Loss to transfer the characteristic information of the teacher model, while the student model uses Balanced Curricular Loss for training. The final loss function is the weighted sum of the distillation loss and the training loss. After knowledge distillation, some indicators of the student model on the evaluation data set even exceed the teacher model, and at the same time, the reasoning time is greatly shortened, and the performance of the ms1m small data set track is greatly improved.

3.6 Parallel model and data at the same time

The number of training data IDs of the WebFace260M large data set track is> 2 million, and the total number of pictures is> 40 million. As a result, the traditional multi-machine multi-card data parallel training method can no longer accommodate a complete model. Partial FC [9] uses the FC layer to be evenly distributed on different GPUs. Each GPU is responsible for calculating the sub FC layer results stored in its own video memory unit. Finally, through the synchronous communication operation between all GPUs, an approximate full FC layer result is obtained. The schematic diagram of Partial FC is as follows:

image.png

With Partial FC, model parallelism and data parallelism can be used at the same time, so that large models that could not be trained before can be trained normally. In addition, negative sample sampling can be used to further increase the training batch size and shorten the model training cycle.

3.7 Other skills

Throughout the competition, we have tried different strategies such as data enhancement, label reconstruction, and learning rate change. Among them, the effective strategies are shown in the following figure:

image.png

4. Competition results

In this competition, our mind_ft team won 1 championship (WebFace260M SFR), 1 runner-up (InsightFace unconstrained) and 2 third runners (WebFace260M Main and InsightFace ms1m) in 5 tracks in InsightFace and WebFace260M. Among them, the screenshot of the final result of the official ranking of the WebFace260M track is as follows:

image.png

In the Workshop after the competition, we were invited to share the solutions of this competition on a global scale. In addition, the papers we submitted in this competition were also included in the ICCV 2021 Workshop [10]. Finally, let’s show the honor certificates we have gained in this competition:

image.png

5. EssentialMC2 introduction and open source

EssentialMC2, Entity Spatio-temporal Relationship Reasoning Multimedia Cognitive Computing, is a core algorithm architecture precipitated by a long-term research result of video understanding technology by the MinD-Digital Media Group of Dharma Academy. The core content includes three basic modules of representation learning MHRL, relational reasoning MECR2 and open set learning MOSL3. The three correspond to the optimization of the video understanding algorithm framework from the three aspects of basic representation, relational reasoning and learning methods. Based on these three basic modules, we have summarized a set of code framework suitable for large-scale video understanding algorithm research and development training, and open sourced it. The open source work includes the excellent papers recently published in the group and the results of algorithm competitions.

image.png

essmc2 is a complete set of deep learning training framework code packages that are suitable for large-scale video understanding algorithm development and training. The main goal of open source is to provide a large number of verifiable algorithms and pre-training models to support users to quickly trial and error at a low cost. At the same time, I hope to establish an influential open source ecosystem in the field of video understanding and attract more contributors to participate in the construction of the project. The main design idea of essmc2 is "configuration is the object". Through the concise and clear configuration file and the register design mode (Registry), many model definition files, optimizers, data sets, preprocessing pipelines and other parameters can be configured to the configuration file. The form quickly constructs the object and uses it, which essentially fits the scene of constant adjustment and experimentation in the daily use of deep learning. At the same time, the seamless switch between stand-alone and distributed is realized through a consistent perspective. The user only needs to define it once, and can switch between stand-alone single-card, single-machine multi-card, and distributed environment, and realizes simplicity and ease of use and high portability. Sexual characteristics.
At present, the open source work of essmc2 has released the first usable version, everyone is welcome to try it out, and we will add more algorithms and pre-training models in the future. Link address: https://github.com/alibaba/EssentialMC2 .

6. Product landing

With the videoization of Internet content and the rise of applications such as VR and Metaverse, the number of unstructured video content is growing rapidly. How to quickly identify and accurately understand these content has become a key part of content value mining.
Characters are important content in videos. High-precision video character identification technology can quickly extract key information about video characters and realize intelligent applications such as character clip editing and character search. In addition, the visual, voice, and textual multi-dimensional content of the video is analyzed and understood, and richer video content entity tags such as persons, events, objects, fields, logos, etc. can be identified, which can form video structured information and help extract video key points more comprehensively information.
Furthermore, structured entity tags are used as the basis of semantic inference, and through multi-modal information fusion, they can help understand the core content of the video, realize the high-level semantic analysis of the video content, and then realize the understanding of categories and topics.
The high-accuracy character identification and video analysis technology of the Alibaba Cloud Multimedia AI team has been integrated into the EssentialMC2 core algorithm architecture and productized output. It supports the analysis and understanding of the multi-dimensional content of videos and images and outputs structured tags (click Experience: Retina Video Cloud Multimedia AI Experience Center-Smart Label Product https://retina.aliyun.com/#/Label ).

image.png

Multimedia AI products

Smart label products through the comprehensive analysis of the visual, text, voice, behavior and other information in the video, combined with multi-modal information fusion and alignment technology, to achieve high-accuracy content recognition, comprehensive video category analysis results, and output the appropriate video content Multi-dimensional contextualized tags.

Category tags: realize high-level semantic analysis of video content, and then realize the understanding of categories and topics. Video classification tags are divided into first, second and third levels to realize media asset management and personalized recommendation applications.

Entity tag: The entity tag of video content recognition. The dimensions include video category theme, film and television comprehensive IP, characters, behavior events, objects, scenes, logos, and screen tags. It also supports the knowledge graph information of characters and IP. Among them, the IP search of TV series is based on video fingerprint technology, which compares and searches the target video with the TV series and other resources in the library. It supports IP identification of more than 60,000 movies, TV series, variety shows, animation, and music, which can be analyzed. Identify which movie, TV series and other IP content is included in the target video content to help realize accurate personalized recommendation, copyright search and other applications. Based on various types of data such as Youku, Douban, Encyclopedia, etc., an information graph covering film and television synthesis, music, characters, landmarks, and objects is constructed. For the entity tags hit by the video, it supports the output of knowledge graph information, which can be used for media asset association and related recommendations. And other applications.

Keyword tags: support video voice recognition and video OCR text recognition, combined with NLP technology to analyze the text content of voice and text, and output keyword tags related to the video theme content for refined content matching recommendation.

image.png

Complete label system, flexible customization capabilities

Smart label products integrate PGC and UGC video content from platforms such as Youku, Tudou, and UC overseas for learning and training to provide the most comprehensive and high-quality video label system. In addition to providing a general label category system, it supports open multi-level customization capabilities, and supports extended functions such as face self-registration and custom entity tags; for business scenarios of customer-specific label systems, label mapping, customized training, etc. In this way, we provide one-to-one label customization services to help customers solve the problem of platform video processing efficiency in a more targeted manner.

High-quality human-machine collaboration service

Aiming at business scenarios that require accuracy, smart label products support the introduction of manual interaction judgments to form an efficient and professional human-machine collaboration platform service. AI recognition algorithms and humans complement each other to provide accurate video tags for personalized business scenarios.
The human-machine collaboration system has advanced human-machine collaboration platform tools, professional labeling teams, and standardized delivery management processes such as personnel training, trial operation, quality inspection, and acceptance links to ensure the quality of data labeling and help quickly achieve high quality and low cost Annotation data service. Through the AI algorithm + manual human-machine collaboration, the manual labeling service is provided as a supplement and correction to the AI algorithm, ensuring accurate and high-quality service output results, and realizing the improvement of business efficiency and user experience.

image.png

Video tag recognition in the sports industry and the film and television industry

image.png

Video tag recognition in the media industry and e-commerce industry

The above capabilities have been integrated into Alibaba Cloud Video Cloud smart label products, providing high-quality video analysis and human-machine collaboration services. Welcome everyone to learn about and experience the trial (Smart label product https://retina.aliyun.com/#/Label ) to build more efficient and intelligent video service applications.

references:
[1] Zheng Zhu, et al. Webface260m: A benchmark unveilingthe power of million-scale deep face recognition. CVPR 2021.
[2] Yao Feng, et al. Joint 3d face reconstruction and dense alignment with position map regression network. ECCV, 2018.
[3] Jun Wang et al. Facex-zoo: A pytorch toolbox for face recognition. _arxiv_, abs/2101.04407, 2021.
[4] Jiankang Deng et al. Masked Face Recognition Challenge: The InsightFace Track Report. arXiv, abs/2108.08191, 2021.
[5] Jiankang Deng, et al. Arcface: Additive angular margin loss for deep face recognition. CVPR 2019.
[6] Ming Lin, et al. Zen-NAS: A Zero-Shot NAS for High-Performance Image Recognition. ICCV 2021.
[7] Yuge Huang et al. Curricularface: Adaptive curriculum learning loss for deep face recognition. CVPR 2020.
[8] Jiawei Ren et al. Balanced meta-softmax for long-tailed visual recognition. NeurIPS, 2020.
[9] Xiang An, et al. Partial fc: Training 10 million identities on a single machine. ICCV 2021.
[10] Tao Feng, et al. Towards Mask-robust Face Recognition. ICCV 2021.

"Video Cloud Technology" Your most noteworthy audio and video technology public account, pushes practical technical articles from the front line of Alibaba Cloud every week, and exchanges and exchanges with first-class engineers in the audio and video field. The official account backstage reply [Technology] You can join the Alibaba Cloud Video Cloud Product Technology Exchange Group, discuss audio and video technologies with industry leaders, and get more industry latest information.

CloudImagine
222 声望1.5k 粉丝