On July 24, the 2021 Global Internet Communication Cloud Conference (hereinafter referred to as WICC 2021) with the theme of "New Horizons Connecting the Future" successfully concluded in Beijing. Huang Zhenkun, a video algorithm expert from the organizer's global Internet communication cloud leader Rongyun, shared a speech on "Artificial Intelligence-Based Video Coding Optimization" in the "RTC New Technology and Application" technical sub-forum.
融云在WICC2021分论坛分享视频编码优化技术

Figure 1 Huang Zhenkun, an expert on WICC on-site cloud integration video algorithm, delivered a speech

According to a related research report by Cisco in the United States, by 2022, global mobile data traffic will reach 930 exabytes per year, which is equivalent to transmitting the traffic of all movies in history through the global network every 5 minutes. Video traffic accounts for the proportion of mobile data traffic. Will soar to 79%. Under huge transmission pressure, video coding and compression technology is particularly important. Therefore, this WICC Huang Zhenkun focused on the cutting-edge technology of video compression, and explained the latest research results in the direction of video coding, the exploration and practice of cloud integration, and the future development prospects of this field for developers according to the needs of different scenarios.

Rongyun's video compression technology and solutions in surveillance scenarios

On WICC 2021, Huang Zhenkun took the scene of traffic surveillance video as an example, explaining that with the rapid development of smart transportation, the amount of traffic surveillance video data has exploded, which has brought tremendous pressure to the existing transmission and storage systems. Therefore, , It is very important to improve the compression efficiency of video compression in surveillance scenes.

Huang Zhenkun believes that in order to efficiently compress traffic surveillance video, it is necessary to carefully distinguish the background area from the motion area according to the characteristics of the scene. Typical background areas include buildings, trees, etc., which occupies a relatively large portion of the screen, and are relatively solidified with little change; moving areas include vehicles, pedestrians, etc., which only occupy a small part of the entire video area. A typical surveillance video is shown in Figure 2, and the moving vehicle is the foreground area. On the whole, the subtle changes between adjacent frames of the video should be the focus of surveillance video compression.

融云在WICC2021分论坛分享视频编码优化技术

Figure 2 Typical monitoring scenario

In view of these characteristics, the early practice in the industry is to select a long-term reference frame from the reconstructed frame and combine it with the existing short-term reference frame to provide a reference for the current frame to be coded for inter-frame prediction. However, the selected long-term reference frame may contain foreground objects, resulting in "unclean" background frames.

In order to solve this difficulty, Rongyun uses LaBGen-P to extract background frames. This is because LaBGen-P uses a pixel-level median filtering mechanism. Based on the selection mechanism of motion detection, the pixel with the smallest motion is selected as the background pixel. Through the calculation of the difference between frames, the video effect of the pure background frame can be extracted.

In addition, using the LaBGen-P method to extract the background frame and add the obtained background frame to the long-term reference frame list can not only avoid network loss and decoding errors that will cause errors in the decoder, and cause the error to spread to subsequent P frames, but also use The feedback mechanism combined with long-term reference frames also helps to repair lost video data.

Experiments show that compared with the original OpenH264 encoder without background frames, the test videos in the standard test set can be compared

CiscoVT2people_320x192_12fps.yuv, reduced from the original 56KB to 54KB.

Research Model and Practice Exploration of Video Compression Technology in Regions of Interest

People's focus will be different in different scenarios. Still taking smart transportation as an example, the focus of traffic police on violating vehicles is the license plate number. Whether the license plate number is clear will directly affect whether the collection of law enforcement evidence is effective. Therefore, in the case of limited bandwidth, ensuring the quality of the area of interest is the key to video compression technology.

Traditional encoding methods are dedicated to decorrelating images. Although this method can achieve the effect of removing information redundancy, it ignores visual redundancy. Therefore, in the latest research, the typical approach is to obtain the region of interest by performing target detection on the video, and then allocate more bit rates to the region of interest, thereby improving the coding quality of the region. Figure 3 is a typical area of interest detected by the target detection technology, and the effect of bit rate allocation, and the coding quality of the area of interest is guaranteed.

融云在WICC2021分论坛分享视频编码优化技术

Figure 3 Typical rate allocation effect based on target detection

How to allocate more bitrates to regions of interest, Huang Zhenkun introduced to developers the theoretical research of Wuhan University on behalf of the academic community and the exploration and practice of Rongyun on behalf of the industry.

In 2021, Wuhan University proposed a code rate allocation model based on game theory. The specific content includes:

The coding quality of the region of interest is the leader, and the coding quality of the non-interest region is the follower;

Under the set target code rate, the leader decides the code rate assigned to the area of interest, and the follower decides the code rate assigned to the non-interested area;

For the region of interest, its effectiveness not only depends on itself, but also affects the encoding quality of the entire image;

The non-interest area can only use the remaining bit rate to achieve the optimal utility.

The video coding scheme of Rongyun's area of interest is to combine motion area detection with a bit rate allocation scheme based on game theory, and integrate into a scene-based area of interest detection and bit rate allocation scheme. The feature of this scheme is mainly reflected in training different yolo models, using the same pre-training model yolo to train different scenarios.

融云在WICC2021分论坛分享视频编码优化技术

Figure 4 Rongyun ROI video coding scheme

Specifically, a trained human-based target detection model is adopted for videos of people, and a trained car-based target detection model is adopted for videos of cars. Among them, the method of motion detection uses Vibe to establish a sample background model for each pixel (the background model contains N sample values), and calculate the similarity between the pixel to be classified and the background model. If they are similar, they are classified as background.

融云在WICC2021分论坛分享视频编码优化技术

Figure 5 The extraction effect of the region of interest combined with target and motion detection

Through experiments, it can be seen that after the region of interest is extracted, the game theory-based method is used to allocate the code rate of the region of interest and the non-interest region, and finally in the case of limited bandwidth, the coding quality of the region of interest is improved. The overall coding quality is not lost too much. The specific effect is shown in Figure 6. The quantization coefficient of the face part is smaller than the quantization coefficient of the background area. In the case of limited bandwidth, the details of the face part can be preserved.

融云在WICC2021分论坛分享视频编码优化技术

Figure 6 The effect of rate allocation based on the region of interest

The latest research and application prospects of video compression technology

At present, the research on video compression is mainly based on artificial intelligence-based deep learning technology and end-to-end video compression framework.

Deep learning technology replaces the hybrid coding framework module, which can be used for code rate allocation, block division, and intra-frame prediction and inter-frame prediction. Taking inter-frame prediction as an example, the experimental results show that compared with HEVC, the proposed method based on deep learning can achieve an average bit rate reduction of 1.7% (maximum of 8.6%) in the low delay P configuration. The latest research result of the end-to-end video compression framework is to compress the existing deep learning video, only a small number of reference frames can be used for compression. Researchers have proposed a repeated autoencoder and a repeated probability estimation model.

According to Huang Zhenkun, these technologies are still in a very cutting-edge research stage, but they have very broad application prospects: First, the use of deep learning networks to replace the hybrid coding framework of video compression can improve coding efficiency and has important applications in WebRTC. Value; secondly, the deep reinforcement learning network allocates the bit rate, which will improve the stuttering phenomenon in the use of WebRTC video transmission; thirdly, the bandwidth estimation model based on deep learning will also have more advantages than traditional bandwidth estimation methods.

Concluding remarks

In the real-time audio and video field, video compression is a very important technology. With the improvement of 5G infrastructure, new video application scenarios continue to emerge, and video compression technology is also iterating. In order to ensure high video quality and high transmission efficiency, video compression technology must consider the total cost of storage, codec, computing power, and bandwidth, and balance the image quality, bit rate, and performance. With the improvement of 5G infrastructure, new video application scenarios continue to emerge, and video compression technology will continue to iterate and innovate. Rongyun will be deeply involved and lead the development!


融云RongCloud
82 声望1.2k 粉丝

因为专注,所以专业