1
头图
At CVPR 2021, the top conference in the field of computer vision and pattern recognition, which has just concluded, the results of the various international challenges have all been announced.

image.png

Alibaba Amoy Technology's multimedia algorithm & video content understanding algorithm team achieved success in one fell swoop

🎉 3 international champions🎉
🎉 1 international runner-up🎉
🎉 1 international third place🎉

The technical domain includes image description generation, large-scale instance-level object recognition, multi-modal video emotion understanding, and video character interaction.

As a leading team in the field of multimedia algorithms in the industry, this team of Amoy Technology focuses on creating a video content perception and understanding algorithm platform with "end-cloud integration, cross-modal understanding"; focusing on the construction of AR live broadcast, 3D digital field, and intelligent content production , Review, retrieval, and high-level semantic understanding; support Taobao live broadcast, shopping, Diantao and other Taobao content services, and provide capacity support for the entire Alibaba Group's content business through self-developed content platforms.

The following are the details of the content of the 3 international championships & our methods of conquering.

🏆 Champion🏆 VizWiz Image Captioning

▐ Title

Workshop:CVPR 2021 VizWiz Grand Challenge Workshop
TRACK:Image Captioning

▐ Participants

Hongli, Hongji, Yongliang, Yuqi, Shaolin, Dingren

▐ Technical field

Image description generation

▐ Introduction to the background of the competition

The VizWiz Grand Challenge has been held since 2018 and aims to use computer vision technology to help blind people with visual impairments "see" the world.

The input of this task is an image taken by a blind person, and the output is a description of the image.

Unlike other Image Caption data, the data for this competition was taken by blind people with visual impairments, and the image quality is relatively poor, so the task is more difficult.

▐ Our achievements

We won the first place with a CIDEr-D score of 94.06, far surpassing the second CIDEr-D score of 71.98.

The total score also exceeded last year's champion IBM's CIDEr-D score of 81.04.

image.png

▐ Task difficulty

There are two main difficulties in this task:

  1. Poor image quality: Contains various indoor and outdoor scenes. At the same time, due to the visual impairment of the photographer, the captured images will have problems such as out-of-focus blur, incomplete shooting, and occlusion;
  2. Many image descriptions need to understand the text, different objects, colors and other information in the image, and need the ability to understand details such as OCR and object detection.

▐ We solve these difficulties in the following ways

  1. For the image characteristics of VizWiz data, the grid feature of the swin-transformer extracted image is used to replace the object feature to fully characterize the characteristics of different image regions;
  2. Considering that OCR and object information will generate positive guidance for image caption generation, we extracted OCR and target detection category information as a feature supplement;
  3. Not all images contain OCR information. We use multiple models to complement each other, use visual modal models to enhance those data that do not contain OCR, and use visual + text (OCR + object category) multi-modal models to enhance OCR information. The data;
  4. Regarding the results generated by multiple models, considering that the final measurement indicator is CIDEr, we integrate the results through multiple strategies such as self-cider and ocr maximization.

▐ Applicable scenarios

Image captioning requires visual understanding and text generation. It is a combination of vision and NLP tasks. It can be used to automatically generate content titles for Internet products. It can also help blind and visually impaired users improve their perception of the world.

▐ Event link

  1. workshop:https://vizwiz.org/workshops/2021-workshop/
  2. challenge:https://eval.ai/web/challenges/challenge-page/739/overview

🏆 Champion🏆 Herbarium 2021-Half-Earth Challenge

▐ Title

Workshop:The Eight Workshop on Fine-Grained Visual Categorization
Task:fine-grained plant species identification

▐ Participants

In the first year, Lanjing, Liuxiao, neighbors, warm rain, Jiyu, Liyou

▐ Technical field

Large-scale instance-level object recognition

▐ Introduction to the background of the competition

Herbarium 2021 is a competition of CVPR2021 FGVC8 workshop. The workshop has been held for the eighth consecutive year for instance-level fine-grained recognition problems.

The Herbarium 2021 competition data set is 6.5W 2.5M plant sample pictures collected from many large botanical gardens in the Americas, Oceania and other half of the earth. They are used to train plant recognition algorithms, assist botanists in plant recognition, discover and protect new plants. Species.

This data set has a long-tailed distribution. The category with the smallest number of samples has only 3 samples. At the same time, the visuals between different plants are very similar, and different samples of the same plant are quite different, which brings great challenges to instance-level recognition.

▐ Our achievements

We achieved the first place in this competition with an F1 score of 0.757, far surpassing the 0.735 of the second place and the 0.689 of the third place.
image.png

▐ Task difficulty

There are mainly two difficulties in this task:

  1. There are many kinds of plants and small categories, and the vision between different plants is very similar, and there are differences in different samples of the same plant, which makes it easy to confuse the categories and difficult to distinguish;
  2. The sample distribution of the data set is uneven, and there is a long-tail distribution. The category with the smallest number of samples has only 3 samples. How to improve the accuracy of the long-tail category is very important.

▐ We solve these difficulties in the following ways

The problem of instance-level plant identification in natural scenes is transformed into a large-scale fine-grained feature expression problem, and self-attention pooling is proposed for local feature enhancement to improve feature expression capabilities; Imbalanced Sampler and adaptive category loss are introduced to solve the problem of imbalanced category distribution; in addition, , Based on the large-scale multi-machine multi-card training capability with mixed precision, it realizes the rapid iteration capability under the scale of nearly 3 million data. Realize efficient 10,000-level online difficult sample mining, which greatly improves the generalization ability of features in complex scenarios. In the end, with a 2.2% advantage of leading the runner-up, he won the championship in one fell swoop.

▐ Applicable scenarios

Instance-level fine-grained recognition technology can distinguish subtle visual differences between objects to achieve fine object recognition, and is widely used in product recognition, animal and plant recognition, pedestrian recognition, landmark recognition and other fields.

▐ Event link
1.Workshop:https://sites.google.com/view/fgvc8/home
2.Challengehttps://sites.google.com/view/fgvc8/competitions/herbariumchallenge2021
3.Kaggle leadboard:https://www.kaggle.com/c/herbarium-2021-fgvc8/leaderboard

🏆 Champion🏆ActivityNet Home Action Genome Challenge

▐ Title

Workshop:International Challenge on Activity Recognition
Task:Home Action Genome Challenge

▐ Participants

Shaolin, Liao Yue (Beihang University), Yong Liang, Ye Ying, Li You, Liu Si (Beihang University)

▐ Technical field

Interaction between video characters

▐ Introduction to the background of the competition

The Home Action Genome Challenge was held at the CVPR2021 ActivityNet Workshop for the first time this year. It was hosted by Professor Li Feifei's research group at Stanford University. The competition provided a large-scale multi-view video data set. Through multi-modal video analysis, the interaction between characters in the video was detected.

▐ Our achievements

We achieved the first place in this competition with an accuracy rate of 76.5%, which was a significant lead of 68.4% of the second place and 65.7% of the third place.
image.png
Home Action Genome Challenge Award Certificate

▐ Task difficulty

There are three main difficulties in this task:

  1. The daily household scenes of the dataset are complex, and the target detection of human bodies and objects is difficult
  2. Character relations include action relations and spatial relations, focusing on different visual features
  3. Each group of human bodies and objects has multiple character relationships, and the evaluation must be completely correct before the calculation is correct.

▐ We solve these difficulties in the following ways

Adopt a better detection model: We use Swin-Transformer and ResNeSt as the backbone's performance SOTA detection model, and through a variety of data enhancement strategy training and multi-scale fusion reasoning to improve the accuracy of target detection.
Enhancing the visual features of the relationship between characters: We designed a scheme that merges two-stage and one-stage relationship detection networks. First, Swin-Transformer is integrated into the two-stage relationship detection network for end-to-end training, and then the one-stage relationship detection network is improved and extracted directly< Person, thing> two-tuple, and then determine the relationship through the cascade structure, and give a <person, thing, relationship> triplet. Strategically, we use visual features to determine the action relationship, and the spatial position is used as an input to assist in determining the spatial relationship.
Generation strategy based on statistical bias: When we generate the final character interaction relationship group, we adopt a variety of strategies that integrate the symbiosis probability of the three of <person, object, relationship> and statistical bias weighting.

▐ Applicable scenarios

Video character interaction relationship detection, which detects dynamic structured information of <people, objects, relationships> in the video, and can be applied to video information structured, human-computer interaction and other application scenarios in the future.

▐ Event link

  1. Challenge:https://homeactiongenome.org/results.html
  2. Workshop:http://activity-net.org/challenges/2021/challenge.html

In addition to the above three championships, we also won the second place in the Hotel-ID 2021-Hotel Recognition Challenge competition and the third place in the Evoked Expressions from Videos (EEV) Challenge competition, in the field of multimedia algorithms. Top ranking.

The multi-media algorithm competition team of Amoy Technology said: “As video traffic accounts for an increasing proportion of media representation, video information has the problem of information overload for both individuals and platforms. The multi-dimensional structured representation of video content will It will be one of the hot research directions in the field of vision. In the future, we will also integrate multi-modal information such as text, voice, and vision to understand video content so that users can see more of their favorite content and reduce user information choices. Time cost, to bring users a better visual experience."


大淘宝技术
631 声望3.3k 粉丝

大淘宝技术是阿里巴巴集团新零售技术的王牌军,支撑淘宝、天猫等核心电商业务。依托大淘宝丰富的业务形态和海量的用户,大淘宝技术部持续以技术驱动产品和商业创新,不断探索和衍生颠覆型互联网新技术,以更加智...


引用和评论

0 条评论