Innovative practice of short video content understanding and generation technology in Meituan

For video data, how to use relevant data through computer vision technology to provide better services for users and businesses is an important research and development topic. This article will share with you the implementation practice of short video content understanding and generation technology in Meituan's business scenarios.

1. Background

Meituan has accumulated a wealth of video data around the rich local life service e-commerce scenarios.

An example of a short video in the Meituan scene

video link

The above shows an example of a dish review in the Meituan business scenario. It can be seen that video can provide richer information than text and images. The dynamic interaction of flames, chocolate and ice cream in the creative dish "A Song of Ice and Fire" is vividly presented in the form of short videos, which are then presented to merchants. Provide diversified content display and consumption guidelines to users.

Video Industry Development

We are able to quickly enter the era of video explosion because significant progress has been made in many technical fields, including the miniaturization of shooting and acquisition equipment, the advancement of video encoding and decoding technology, and the improvement of network communication technology. In recent years, due to the continuous maturity of visual AI algorithms, they have been widely used in video scenes. This article will mainly focus on how to improve the efficiency of video content creation, production and distribution through the blessing of visual AI technology.

Meituan AI - scene-driven technology

When it comes to Meituan, the first thing that comes to mind is the take-out scene. However, in addition to take-out, Meituan has more than 200 other businesses, covering "eating", "living", "travel", "playing" and other life Service scenarios, as well as retail e-commerce such as “Meituan Preferred” and “Tuanhaohuo”. The rich business scenarios bring a variety of data and a variety of landing applications, which in turn drive the innovation and iteration of the underlying technology. At the same time, the precipitation of the underlying technology can also empower the digital and intelligent upgrade of various businesses, forming a positive cycle that promotes each other.

Meituan business scene short video

丰富的内容和展示形式（C端）

Some technical practice cases shared in this article mainly revolve around "eating". Meituan has content layout and display forms in each scene and station, and short video technology also has rich applications on the C-side of Meituan, for example: the homepage feed stream video card, immersive video, video that you see when you open the Dianping App Notes, user reviews, search results pages, and more. Before these video contents are presented to users, they must be understood and processed by many algorithmic models.

丰富的内容和展示形式（B端）

On the merchant side (B side), the video content display forms include: introduction to scenic spots - allowing consumers to experience a more three-dimensional playing experience online; hotel album overview - synthesizing the static images in the album into videos to fully display the hotel information to help users quickly understand the whole picture of the hotel (the automatically generated technology will be introduced in the following chapter 2.2.2); business brand advertising - the algorithm can reduce the threshold for business editing and creating videos through functions such as smart editing; business video album - - Merchants can upload all kinds of video content by themselves, and the algorithm tags the videos to help merchants manage videos; product videos/movies - As mentioned above, Meituan's business scope also includes retail e-commerce, and this part is concerned with the display of product information. Very advantageous. For example, it is difficult to present the motion information of fresh commodities, such as crabs and shrimps, through static images, but it can provide users with more commodity reference information through dynamic images.

Short video technology application scenarios

From the perspective of application scenarios, the applications of short video online mainly include: content operation management, content search and recommendation, advertising marketing, and creative production. The underlying supporting technologies can be mainly divided into two categories: content understanding and content production. Content understanding mainly answers the question of what time and what kind of content appears in the video. Content production is usually based on content understanding and processing of video material. Typical technologies include smart video covers and smart clips. Below I will introduce the practice of these two types of technologies in the Meituan scenario.

2. Short video content understanding and generation technology practice

2.1 Understanding of short video content

2.1.1 Video Tags

The main goal of video content understanding is to summarize the important concepts that appear in the video, open the "black box" of video content, let the machine know what's in the box, and provide semantic information for downstream applications to better manage and distribute the video. . According to the form of the results, content understanding can be divided into two types: explicit and implicit. Among them, explicit refers to adding human-comprehensible text labels to videos through video classification-related technologies. Implicit mainly refers to the embedded features represented in the form of vectors, which are combined with models in scenarios such as recommendation and search to directly model the final task. It can be roughly understood that the former is mainly for people, and the latter is mainly for machine learning algorithms.

Explicit video content labels are necessary in many scenarios, such as content operation scenarios, where operators need to perform supply and demand analysis and circle selection of high-value content based on labels. The above figure shows the outline of the process of tagging videos for content understanding. Each tag here is a keyword that can be understood by people. Usually, in order to better maintain and use, a large number of tags will be organized into a tag system according to the logical relationship between them.

2.1.2 Different dimensions and granularities of video tags

So what are the application scenarios of video tags? What is the technical difficulty behind it? A more representative example in the Meituan scene - food exploration video, which is very rich in content. The setting of the tag system is particularly critical. What kind of tags are appropriate to describe the video content?

First of all, the definition of labels needs to be finalized from the perspectives of products, operations, and algorithms. In this case, there are three layers of labels, and the higher the layer, the more abstract it is. Among them, the topic tag has a strong ability to summarize the overall video content, such as the theme of food exploration; the middle layer will be further divided to describe the content related to the shooting scene, such as the environment inside and outside the store; the bottom layer will be divided into fine-grained entities to understand The Kung Pao Chicken is the size of tomato scrambled eggs. Different layers of tags have different applications, and the top-level video topic tags can be applied to the screening and operation methods of high-value content. Its main difficulty is the high degree of abstraction. The word "food exploration shop" has a high degree of generalization. People can understand it after watching the video, but from the perspective of visual feature modeling, what characteristics need to have to be considered a food exploration shop? Yes The learning ability of the model presents a greater challenge.

2.1.3 Basic Representation Learning

The solution mainly focuses on two aspects: on the one hand, label-independent general base representation improvements, and on the other hand, label-specific classification performance improvements. The initial model needs to have a good basic representation ability. This part does not involve the final downstream task (for example: identifying whether it is a video of a food shop), but the pre-training of the model weights. A good basic representation can improve the performance of downstream tasks with half the effort.

Since the cost of labeling video labels is very expensive, what needs to be considered at the technical solution level is how to learn better basic features while using as little business-supervised labeling data as possible. First of all, at the level of task-independent basic model representation, we use self-supervised pre-training features on Meituan video data, which are more in line with business data distribution than pre-training models on public datasets.

Second, at the semantic information embedding level (as shown in the figure above), there are multi-source labeled data that can be utilized. It is worth mentioning that there are relatively distinctive weakly labeled data in the Meituan business scenario, such as: users commenting in restaurants, the upper abstract labels of pictures and videos are food, and the comment text will most likely mention eating in the restaurant. The name of the dish, which is minable high-quality supervision information, can be cleaned by technical means such as visual text correlation measurement. Shown here is an automatically mined video sample tagged as "barbecue".

Video sample
Video sample

By using this part of the data for pre-training, an initial Teacher Model can be obtained, and the unlabeled data of the business scene can be pseudo-labeled. The key here is that because the prediction results are not completely accurate, pseudo-label cleaning needs to be done based on information such as classification confidence, and then the incremental data is obtained together with the Teacher Model for better feature expression in the business scenario, and the Student Model is obtained by iterative cleaning. as the underlying representational model for downstream tasks. In practice, we found that iterating over data yielded more gains than improvements in model structure.

2.1.4 Model Iteration

The main problem of performance improvement for specific tags is how to efficiently iterate the sample data of the target category on the basis of the basic representation model to improve the performance of the tag classification model. The iteration of the samples is divided into two parts: offline and online. Taking the food exploration store label as an example, it is necessary to label a small number of positive samples offline first, and then fine-tune the basic representation model to obtain the initial classification model. At this time, the recognition accuracy of the model is usually low, but even so, it is very helpful for sample cleaning and iteration. Imagine that if the annotator sifted aimlessly from the stock sample pool, it may be difficult to find a sample of a target category after watching hundreds or thousands of videos, but pre-screening through the initial model, you can watch a few videos. Screening out a target sample can significantly improve the labeling efficiency.

The second step is how to continuously iterate more online samples and improve the accuracy of the label classification model. We have two return paths for the results predicted on the model line. If the online model prediction results are very confident, or if several models have the same cognition, the model prediction labels can be automatically returned to the model training. : Confidence learning for automatic rejection. What's more valuable is that we found in practice that the higher ROI for model performance improvement is the manual correction of the model's untrusted data. For example, the samples with large differences in the prediction results of the three models are screened out and handed over to manual confirmation. This active learning method can avoid wasting labeling manpower on a large number of simple samples, and expand the labeling data that is more valuable for model performance improvement in a targeted manner.

2.1.5 Application of Video Hashtags - Screening and Aggregation of High-value Content

The above picture shows the application case of the visual hashtag of the comment recommendation business. The most representative one is the circle selection of high-value content: in the Daren Tandian Tab of the information flow on the homepage of the comment app, the operating classmates filtered out the tags with " The video of the "Food Exploring Shop" tag is displayed. It allows users to have a more comprehensive understanding of the information in the store in an immersive experience, and also provides a good window for merchants to play the role of publicity and drainage.

2.1.6 Different dimensions and granularities of video tags

The above figure shows that different dimension labels have different requirements for technology, among which fine-grained entity understanding needs to identify the specific dish, which is different from the problem of upper-layer coarse-grained labels, and it is necessary to consider how to deal with technical challenges. The first is the fine-grained recognition task, which requires more detailed modeling of visual features; second, the understanding of dishes in video is more challenging than the recognition of dishes in a single image, and it needs to deal with cross-domain problems of data.

2.1.7 The migration of dish image recognition capabilities to the video field

After abstracting the key issues, let's deal with them separately. First of all, on the problem of fine-grained identification, the challenge of measuring the visual similarity of dishes is that there is no standardized definition of the characteristics and positional relationships of different ingredients, and different chefs of the same dish are likely to make two completely different appearances. This requires the model to not only focus on local fine-grained features, but also integrate global information for discrimination. In order to solve this problem, we propose a stacked global-local attention network, which simultaneously captures shape and texture cues and local ingredient differences, which significantly improves the effect of dish recognition. The related results were published at the ACM MM International Conference ( ISIA Food -500: A Dataset for Large-Scale Food Recognition via Stacked Global-Local Attention Network ).

The second part of the challenge is shown in ( ) above. The same objects in images and video frames often have different appearances. For example, the crabs in the pictures are often cooked and placed on the plate, while the video frames often appear fresh crabs during the cooking process, and they are visually different. very large. We mainly deal with this part of cross-domain differences from the perspective of data distribution.

Business scenarios have accumulated a large number of annotated food images, and the discriminative results of these sample prediction results are usually good, but due to differences in data distribution, crabs in video frames cannot be predicted with confidence. In this regard, we hope to improve the discriminativeness of the prediction results in the video frame scene. On the one hand, the method of kernel norm maximization is used to obtain a better prediction distribution. On the other hand, by using knowledge distillation, powerful models are continuously used to guide the prediction of lightweight networks. Combined with the semi-automatic annotation of video frame data, better performance can be obtained in video scenes.

2.1.8 Fine-grained dish image recognition capability

Based on the above accumulation of content understanding in the food scene, we held the Large-Scale Fine-Grained Food Analysis competition at ICCV2021. The images of the dishes are from the actual business scenarios of Meituan, including 1500 categories of Chinese dishes. The competition data set is continuously open: https://foodai-workshop.meituan.com/foodai2021.html#index , you are welcome to download and use them to improve the challenging scenes together recognition performance.

2.1.9 Application of fine-grained labeling of dishes - search for the cover

What are the applications of identifying fine-grained dish names in videos? Here I will share with you an application for reviewing and searching business scenarios - press to find the cover. The effect achieved is to display different covers for the same set of video content according to the search keywords entered by the user. The offline part of the figure shows the segmentation and optimization process of video clips. First, the images suitable for display are screened out through key frame extraction and basic quality filtering; then, through the fine-grained label identification of dishes, we can understand what dishes appear at what time, as Candidate cover materials are stored in the database.

When online users search for interesting content, according to the correlation between multiple cover candidates of the video and the user's query words, the most suitable cover is displayed for the user to improve the search experience.

For example, the same search for "hot pot", the left picture is the default cover, and the right picture is the result of "press to find the cover". It can be seen that the results on the left have some character-centered covers, which do not match what users expect to see when searching for hot pot videos. It intuitively feels like an irrelevant Bad Case. And according to the display results of the searched cover, the searched content is all hot pot pictures, and the experience is better. This is also an innovative application of understanding video clips to fine-grained tags in the Meituan scene.

2.1.10 Mining richer video clip tags

The above are all about food videos, but Meituan has many other business scenarios. How to automatically mine richer video tags, so that the tag system itself can be automatically expanded, instead of relying on manual sorting and definitions, is an important topic. We base our work on review-rich user review data. The example in the figure above is a user's note. You can see that the content contains both video and several pictures, as well as a large description. These modalities are related and have common concepts. Through some statistical learning methods, cross-validation between the two modalities of vision and text can be used to mine the correspondence between video clips and labels.

2.1.11 Example of video clip semantic tag mining results

For example, video clips and tags are automatically mined by the algorithm. The left figure shows the frequency of tags, showing an obvious long-tailed distribution. But it is worth noting that in this way, the algorithm can discover meaningful tags with finer granularity, such as "scarf painting". In this way, more important tags in the Meituan scene can be discovered while minimizing manual participation.

2.2 Generation of short video content

Next, let's talk about how to do content production on the basis of content understanding. Content production is a very important part of short video AI application scenarios. The following sharing is more about the deconstruction and understanding of video materials.

The process link of video content production (as shown in the figure above), in which the content generation link is mainly that after the original video is uploaded to the cloud, as a material, it is edited and processed through an algorithm to better exert the potential value of the content. For example, in the advertising scene, the algorithm identifies and edits the essence of the business environment and the effect of the dishes in the original video to improve the density and quality of information.

In addition, video content production can be divided into three categories according to the application form:

Pictures generate videos, and the common form is the automatic generation of album quick-view videos;
The video generates video clips, a typical case is the editing of long video highlights, which becomes a more streamlined short video for secondary distribution;
Video pixel-level editing mainly involves refined picture special effects editing.

Below, we describe the three types of application forms.

2.2.1 Image Generation Video - Food Motion Picture Generation in Dining Scenes

The first category, images generate video. What this part needs to do is to understand and process image materials, so that users can generate ideal materials end-to-end with one click without being aware of technical details. As shown in the figure above, the merchant only needs to input the image album of the production material, and leave everything to the AI algorithm: first, the algorithm will automatically remove the pictures with poor shooting quality that are not suitable for display; then do content identification and quality analysis. Content identification includes content tags, and quality analysis includes clarity and aesthetic scores; since the size of the original image material is difficult to directly adapt to the target booth, the image needs to be intelligently cropped according to the aesthetic evaluation model; finally, overlay Ken-Burns, transitions Wait for special effects to get the rendering result. Merchants can get a beautifully arranged food video.

2.2.2 Image generation video - hotel scene album snapshot video generation

There is also an example of the generation of a quick photo album in the hotel scene. Compared with the animation, the combination of audio and transition effects is required. At the same time, the video has higher requirements on what kind of content to display first. It is necessary to combine the characteristics of the business scenario, according to the script template formulated by the designer, and automatically filter specific types of images to fill in the corresponding position of the template through the algorithm.

2.2.3 Video to generate video clips

The second category, video generates video clips. It mainly divides long videos and selects several more exciting contents that meet user expectations for display. The algorithm stages are divided into fragment generation and fragment screening sorting. In the segment generation part, shot segments and key frames are obtained through the timing segmentation algorithm. The clip sorting part is more critical, it determines the video priority order. This is also the hard part, it has two dimensions:

Generic quality dimensions, including clarity, aesthetic rating, etc.;
Semantic dimension, for example: in food videos, the display of finished dishes and the production process are usually more exciting clips. The understanding of the semantic dimension is mainly supported by the content understanding model introduced earlier.

2.2.3.1 Smart Covers and Highlights

Original video (1min)
Algorithm clip video (10s)

We generate video clips from videos and implement two application scenarios. The first is the intelligent dynamic cover, which is mainly based on the general basic quality to select the video clip with higher definition, dynamic information, and no flickering as the cover of the video, which is better than the default clip.

2.2.4 Pixel-level editing and processing of video - video effects of dishes

video link

The third category is video pixel-level editing. For example, a creative special effect based on Video Object Segmentation (VOS, Video Object Segmentation) technology is shown here. The key technology behind it is an efficient semantic segmentation method developed by Meituan, which has been published in CVPR 2022 ( Rethinking BiSeNet For Real-time Semantic Segmentation ), interested students can learn about it.

One of the most important technologies for pixel-level editing processing is semantic segmentation. The main technical challenge faced in application scenarios is to ensure both the timeliness of the segmentation model and the resolution, and to maintain high-frequency detail information. We further improve the classic BiSeNet method and propose an efficient semantic segmentation method based on detail guidance.

The specific method is shown in the network structure. The light blue part on the left is the inference framework of the network, which follows the design of the BiSeNet Context branch. The backbone of the Context branch adopts our self-developed backbone STDCNet. Different from BiSeNet, we conduct a detail-guided training for Stage3, as shown in the light green part on the right, to guide Stage3 to learn detailed features; the light green part only participates in training, not model inference, so it will not cause extra time consume. First of all, for the segmented Ground Truth, we obtain a detailed truth value that enriches the edge and corner information of the image through Laplacian convolution of different steps; then guide the shallow feature learning of Stage3 through the detail truth value and the design detail Loss detail features.

Since the background distribution before and after the true value of the details of the image is seriously unbalanced, we use the joint training method of DICE loss and BCE loss; in order to verify the effectiveness of the detail guidance, we have done this experiment, which can be seen from the results of feature visualization. The best results can be obtained by using the detailed ground-truth obtained from multiple scales to guide the network in detail, and the performance of the model can also be improved by the detailed information guidance.

In terms of effect, it can be seen from the comparison that our method has a great advantage in maintaining high-frequency information of segmentation details.

3. Summary and Outlook

The above shared that Meituan hopes to provide merchants and users with more intelligent information display and acquisition methods in the fields of video tags, video covers and clips, and video fine-grained pixel-level editing technologies through combination with business scenarios. In the future, in the application of short video technology, Meituan's rich business scenarios, including local life services and retail e-commerce, will have greater potential value. In terms of video understanding technology, multi-modal self-supervised training is very valuable for alleviating the dependence on labeled data and improving the generalization performance of the model in complex business scenarios. We are also doing some attempts and explorations.

4. About the author

Ma Bin is an engineer in the Visual Intelligence Department of Meituan.

Read more collections of technical articles from the Meituan technical team

| Reply keywords such as [2021 stock], [2020 stock], [2019 stock], [2018 stock], [2017 stock] in the public account menu bar dialog box, you can view the collection of technical articles by the Meituan technical team over the years.

| This article is produced by Meituan technical team, and the copyright belongs to Meituan. Welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication, please indicate "The content is reproduced from the Meituan technical team". This article may not be reproduced or used commercially without permission. For any commercial activities, please send an email to tech@meituan.com to apply for authorization.