Embrace intelligence, new exploration of AI video coding technology

With the increasingly prominent role of video and interaction in daily life, increasingly diverse video scenes and ever-increasing visual pursuits pose higher challenges to video coding. Compared with a variety of video coding technologies that people manually design, AI coding can learn a wider range of inherent coding rules from big data. Industry and academia have made efforts to promote AI video coding standards and explore new frameworks.

Alibaba Cloud Video Cloud has made important contributions to the JVET video coding standard for human eyes and the MPEG video coding standard for machine vision, which has given a strong impetus to the development of the standard. In combination with scenes such as video conferencing and live video broadcasting with strong industrial demand, Alibaba Cloud Video Cloud has also developed an AI generative compression system, which saves 2-3 times the bit rate compared to VVC under the same quality and realizes true ultra-low bit rate video communication.

At the LiveVideoStackCon 2021 Beijing Summit, Alibaba Cloud Intelligent Video Cloud Algorithm Expert Wang Zhao explained to everyone Alibaba Cloud's new exploration in AI video coding technology.

Wen | Wang Zhao

Organize | LiveVideoStack

Hello, everyone, I am Wang Zhao, and I work for Alibaba Cloud Video Cloud. Today's sharing theme is "Embracing Intelligence, New Exploration of AI Video Coding Technology". I mainly want to introduce to you the two cutting-edge work of Alibaba Cloud Video Cloud.

The sharing includes four parts, background and motivation, character video generation coding, machine vision coding and future prospects.

1. Background and motivation

I will extend the background and motivation of Alibaba Cloud Video Cloud to explore AI video coding technology from both human vision and machine vision.

The data volume of the video itself is very large. The original size of a 4K image is 24.3MB, the bandwidth requirement of 4K uncompressed video is about 6Gbps, and the original video produced by an ultra-clear camera is up to 63TB per day. It can only be transmitted after video encoding. storage.

With the development of the times, videos in scenarios such as smart security, autonomous driving, smart cities, and industrial Internet can also be received, perceived, and understood by machines.

Take autonomous driving as an example. The main systems or equipment of a car include camera systems (detecting objects in front), night vision infrared, radar rangefinders, inertial sensors, GPS locators and lidar (360° scanning). Images and videos are collected by the machine, and then delivered to the machine for analysis, finding and solving problems, and improving functions.

The capabilities of machines are superior to humans in certain dimensions, such as observation accuracy, perception sensitivity, work intensity tolerance (machines can operate around the clock), objectivity, and quantification.

According to Cisco statistics, machine-to-machine data transmission based on machine vision will account for 50% of global data transmission, which is a very large amount.

Whether it is human vision or machine vision, the principle of video coding is the same, that is, it relies on the correlation of the video signal itself: adjacent pixels in an image are close, which is spatial correlation; pixels of adjacent images The values are close, which is the time domain correlation; if the pixel is transformed from the spatial domain to the frequency domain, there is also correlation among them. These are the three most basic correlations of video compression, spatial redundancy, temporal redundancy, and information entropy redundancy. From this, the three main modules of video coding and decoding are born, intra-frame prediction, inter-frame prediction, and transformation/entropy. coding.

Redundancy elimination itself is lossless, but there will be distortion in video compression. How is the distortion caused? This is to further improve the compression rate and transform the video information to a certain domain. For example, traditional coding usually transforms to the frequency domain, and then prioritizes them according to their importance, and discards or quantizes low-priority, such as high-frequency information. The operation is eliminated to greatly increase the compression rate, so quantization brings distortion while increasing the compression rate.

In summary, video compression relies on two dimensions. One is the elimination of correlation without causing distortion. The second is to transform the information to a certain domain for priority sorting, and discard, eliminate or quantify the information with low priority.

Based on the above video compression principles, in the past 50 years, video codecs around the world have introduced video standards generation after generation. Although the standards have been updated from generation to generation, they are all based on division, prediction, transformation, quantization, and entropy. The coding framework has not changed. In the past year, the JVET community finalized the VVC standard. After VVC, it is also committed to the exploration of two aspects of traditional coding and neural network coding.

After finalizing AVS3 in China, it is also digging deep into traditional coding and neural network coding in the hope of further improving video coding efficiency. In the field of machine vision, MPEG has established a Machine-Oriented Video Coding Working Group (VCM), and China has established a Machine-oriented Data Coding Working Group (DCM).

Take VVC as an example. Compared with the HEVC standard promulgated in 2013, the VVC formulated last year has doubled the compression performance, but in-depth study of the changes in the number of modes in each module will reveal that intra-frame prediction, inter-frame prediction and transformation have increased There are many modes, which means that the average compression performance gain that each mode can bring becomes smaller.

Each coding mode is mathematically expressed by video coding and decoding experts based on their own learning and understanding of the video signal. The essence of each mode is a mathematical model, and the mathematical models that people master are very simple, such as linear models and exponents. Functions, logarithmic functions, polynomials, etc. There are not many parameters of the model, usually a few, up to dozens. This is also the reason why it is more difficult to improve compression performance. The mathematical model that people can summarize regularly is relatively simple and has limited expressive power, but the inherent regularity of the video is unlimited.

From a model perspective, artificial intelligence-based neural network models can continuously improve mathematical expression capabilities through more parameters. The field of mathematics has strictly proved that neural networks can express any function space. As long as the parameters are sufficient, the expression ability will become stronger and stronger. There are only a few or dozens of parameters manually set, but the parameters in the model designed by the neural network can be as many as several million, and even Google has introduced a super-large model with hundreds of millions of parameters.

From the upper limit point of view, video compression based on AI coding will definitely have a higher performance limit than video compression based on traditional coding.

From the perspective of the redundancy of the video signal itself, as mentioned above, traditional video codecs have eliminated spatial redundancy, temporal redundancy and information entropy redundancy in the past fifty years.

In addition to these three redundancies, there are actually other redundancies that have a lot of room for performance improvement for video compression. The first is structural redundancy. The two flowers at the bottom right are very similar. In the encoding, if the first flower has been encoded, a lot of information when encoding the second flower can be derived from the encoding of the first flower. , No need to fully code. The second is the redundancy of prior knowledge. See the figure on the upper right. If you cover the right half of the face with your hands, only the left half is left. Since the face is close to symmetry, we can still imagine the covered part. This is Because people have prior knowledge that faces are approximately symmetrical in their minds. Then you can also let the machine memorize the prior knowledge without having to encode the information from the encoding end to the decoding end.

Therefore, for video compression, structural redundancy and prior knowledge redundancy are also very important. Traditional coding is not unable to use the two, but artificial intelligence and neural networks are more important in using additional structural redundancy and prior knowledge redundancy. Efficient and more comfortable.

2. Character video generation coding

First look at a simple two-frame encoding problem. The encoding end first tells the decoding end the information of the first image. The decoding end has accepted the first image and decoded it. We use it as a reference frame. At this point, how to compress the current frame?

(Upper right two pictures) In traditional coding, the method is to divide the current image into image blocks, and each image block finds the most similar reference block in the reference frame. The relative displacement between the current image block and the reference block is called Motion vector, so that the prediction value of the current block can be predicted based on the reference block, the most likely prediction frame of the current image is obtained, and the difference between the prediction frame and the current frame can be encoded.

Compared with image compression, the efficiency of video coding is very high, because the time domain prediction and time domain correlation of video coding are very strong. But the code rate will not be very low, because there are many things to be encoded, such as the division information of the image block division, the motion information of each image block, and the residual value. Therefore, although the compression efficiency is much higher than that of image compression, it does not reach the ultra-low bit rate.

In order to achieve ultra-low bit rate compression, we propose an AI generation compression method. (Bottom right two pictures) The whole image is no longer divided into image blocks, but as a whole, the whole image is transformed into a certain characteristic domain through a neural network, and a few key points are extracted from the characteristic domain. Only need to transmit the key points to the decoding end, and the decoding end can drive to generate the image of the current frame based on the reference frame after receiving it.

Among them, the number of key points is variable. For example, there are ten points in the example, so each image only needs to transmit dozens of values, and the code rate is much higher than that of the traditional encoding method.

For the entire video, you can use traditional encoding to transmit the first image, and then use AI to generate encoding to transmit the next image, and extract the key points of each frame at the encoding end and transmit it to the decoding end. How does the decoder generate this frame? Firstly, the key points of the reference frame are extracted and sent to the neural network together with the key points of the current frame decoding to obtain the sparse motion field in the feature domain.

Both sparse sports fields will be sent to the Dense motion Net to get the dense sports field, and an occlusion map will be obtained at the same time. The reference frame, the dense motion field and the occlusion map are then sent to the generator together to generate the current frame.

This is the visualization result of the key points in the feature domain.

Take the image in the first row as an example. The first pair is the reference image and its key points. The second pair is the image and its key points that need to be encoded. The ten color images in the middle are each key point in the feature domain. Reflected sports information. Among them, the third pair reflects the overall frontal movement of the face, the latter ones may reflect the movement of the outside of the head, and the ones closer to the right may reflect the movement of the chin or lips. Finally, the sports fields on the ten feature maps will be merged together to obtain a dense sports field.

This is a subjective display of each link in the entire pipeline process of driver generation.

The first column is the reference frame, the second column is the current frame, and the third column is the sparse field generated in the first step of decoding after encoding past key points. In the current case, the sparse field is used for each image It is a matrix of 4 pictures and 4, you can see that there are 4 pictures and 4 squares in the picture. This is a sparse motion picture. The sparse motion field is applied to the reference frame to get a simplified picture of the current image in the fourth column. It can be seen that the position and movement of the face in the fourth column are very close to the current frame, but there is still a gap in the texture details. Then, the sparse sports field passes through a more complex motion model to obtain a dense sports field, and then the dense sports field is re-applied to the diagram to obtain a finer image after the sports field in the sixth column. Finally, the occlusion map is applied to the image of the sports field to obtain the generated image of the current frame.

The AI-generated compression scheme is tested on the data set of character speech, and the following subjective comparison can be seen.

屏幕录制2021-11-11 下午3.20.46_2.gif

The two columns on the left are the results of the latest VVC reference software encoding, and the two columns on the right are the results of the AI-generated compression scheme. Our bit rate is slightly lower than VVC, but it can be clearly compared to find that the picture quality is far better than VVC. VVC itself has very serious blocking effects and ambiguity, and the AI-generated compression scheme is better in terms of the details of hair, eyes, and eyebrows. The smoothness of the entire head movement and the naturalness of expressions are also significantly improved. .

This is a quality comparison when the bit rate is close, and it can be said that the quality improvement of the generation difference level has been reached.

屏幕录制2021-11-11 下午3.20.46.gif

What is the effect of using AI to generate a compression scheme in a lower bit rate scenario?

In the experiment, the code rate of VVC remains unchanged, and the code rate of the AI generated compression scheme becomes 1/3 of that of VVC. The results show that the generated quality is still better than the picture quality of VVC.

The test video resolution here is 256 pictures 256. For this resolution, the AI-generated compression scheme only needs to use a bit rate of 3~5k to realize video calls between users. It can be inferred from this that in a weak network or even an ultra-weak network environment, the AI-generated compression scheme can still support users to make audio and video calls.

3. Machine Vision Coding

The original motivation for our work in machine vision coding is that in the current video application scenarios, video coding and decoding, video processing and machine vision analysis are all separated, and we hope to combine these points in the future. Form a unified system for end-to-end optimization and training.

We chose an object detection task. For example, this image (top right) may come from a surveillance camera or an automatic car camera. Object detection is to determine which objects are in the image. The objects here include two pieces of information. Object positioning (in the figure) Box) and category recognition (judge the category of pedestrians, vehicles, etc.).

The reason for choosing the object detection task is that object detection is the most widely used technology in the field of contemporary machine vision. Secondly, it is the basis of many machine vision tasks. Only when the object detection is completed, can the gesture recognition be performed. Only by detecting that the "object" is a person can we further judge whether he is falling or walking and other behaviors, and the event analysis can only be continued after the gesture recognition is completed.

For an input image, there will be a neural network at the encoding end to convert the image from the pixel domain to multiple feature maps, and the feature maps are transmitted to the decoding end through entropy encoding. The decoding end parses based on the feature maps, and then reconstructs Complete the machine vision inspection task at the same time as the image.

We proposed an innovative Inverse-bottleneck structure (pictured on the right) on the coding side. The network model is designed to be wide first and then narrow. The network model in the machine vision field generally has more and more channels as the number of layers deepens, so that each layer will be denser and the vision task accuracy will be higher. But not for compression. Compression is to reduce the bit rate and it is impossible to transmit too much data. So how to unify compression and vision?

We found that there is a large amount of high redundancy between the feature channel graphs, and these redundant information can be compressed, so we design the model to be a wide and then narrow anti-bottleneck structure, which basically does not affect the machine vision Under the premise of detection accuracy, the compression efficiency is greatly improved.

Since the entire system has to do both the compression task and the machine vision recognition task, we put the loss of human vision and machine vision together to form a joint loss function for overall optimization, and propose an iterative search to determine the difference between the loss items The weight relationship.

In the MPEG-VCM standard group, many companies around the world make proposals.

Compared with the latest VVC standard, our machine vision compression solution shows that the compression performance has increased by 41.74% on the COCO data set. In recent MPEG-VCM conferences, our proposal performance has maintained the first place.

Here are a few examples of performance comparisons.

In the upper left image, the shooting environment is very dark. For the machine, it is necessary to recognize how many people are in the image. The leftmost is ground truth. It will frame the position of the portrait and mark it with "person". The predicted probability is 100%. Both VVC and our scheme use the same bit rate to compress such an image, and the decoder each obtains the distorted decoded image.

Recognizing on the decoded image of VVC, no boy wearing red short sleeves is detected, but our solution can detect this boy, frame the position, and label "person". The predicted probability is 98%, although it does not reach 100%. , But compared to VVC, it has improved a lot.

The ground truth in the lower right corner framed six people. Similarly, when the image is compressed at the same bit rate, only one person (white box) can be identified on the VVC decoded image, while our solution can identify four people. Compared with VVC, there is a very big performance improvement.

4. Future and Prospects

First of all, in terms of character video encoding, our goal is to achieve ultra-low bit rate video calls and video conferences in complex scenes with multiple people, multiple objects, and multiple sports.

In terms of visual analysis tasks, our goal is to achieve separable multi-task encoding, single-channel encoding at the encoding end, and multiple branches at the decoding end to achieve a unified multi-task system.

The above is the sharing content this time, thank you!

"Video Cloud Technology" Your most noteworthy audio and video technology public account, pushes practical technical articles from the front line of Alibaba Cloud every week, and exchanges and exchanges with first-class engineers in the audio and video field. The official account backstage reply [Technology] You can join the Alibaba Cloud Video Cloud Product Technology Exchange Group, discuss audio and video technologies with industry leaders, and get more industry latest information.

Embrace intelligence, new exploration of AI video coding technology

1. Background and motivation

2. Character video generation coding

3. Machine Vision Coding

4. Future and Prospects

CloudImagine

引用和评论

阿里云 ESA 游戏行业解决方案｜安全防护、加速、低延时的技术融合

基于预生成 QA 对的 RAG 知识库解决方案

🔥吐血整理 Bolt.diy 部署与应用攻略

祛魅最热门的通用Agent赛道

支付宝H5下载被拦截的原因排查与解决指南

Trae 开发工具与使用技巧

Midscene.js：AI 在前端测试领域的应用