Exploration and Practice of Key Technologies of Video Communication

Guide: On October 21, 2021, the "QCon Global Software Development Conference" will be held in Shanghai. As the producer, Chen Gong, the VP of NetEase Intelligent Enterprise Technology, launched the "Converged Communication Technology in the AI Era" special session and invited many technologies Experts share relevant technical topics with everyone.

We will introduce and share four lecture topics one by one. This issue is our second issue, the exploration and practice of key video communication technologies.

guest introduction: Qingrui, NetEase Yunxin Audio and Video Laboratory, senior technical expert.

Foreword

Whether in life scenarios such as entertainment and social networking, online learning, or remote banking, video has become one of the most important interactive methods, and users have put forward higher and higher requirements for video effects. Low latency, strong resistance to weak networks, and clear video quality also make enterprises face high technical challenges.

As an expert in converged communications cloud services, NetEase Yunxin’s business covers major video scenarios, including low-latency real-time audio and video scenarios, partially delayed live broadcast scenarios and on-demand scenarios that do not emphasize delay. This article introduces NetEase Cloud The key technologies and application attempts of XinVideo in various scenarios.

NetEase Yunxin Video Technology Deployment

The following figure shows the main network diagram of NetEase Yunxin's entire converged communication. On the left and right are the devices on the end side, which can be connected to any device, such as mobile phone, pad, PC, web, etc. This part of the center is the server, including the compiled forwarding server and the MCU server. If a certain delay is involved, it will be forwarded to the interactive live broadcast server. Yunxin's video technology is mainly deployed on the end side in the RTC scenario. If it is a live broadcast on-demand service, Yunxin mainly provides video transcoding services for live broadcast retweet deployment.

Video technology for RTC scenes

The following introduces the video technology of NetEase Yunxin in the RTC scene, which is mainly divided into three points.

new generation of audio and video SDK architecture

The following figure is the architecture diagram of NetEase Yunxin's audio and video SDK. At the end of last year, NetEase Yunxin released a new generation of audio and video SDK-G2. The SDK architecture is divided into five layers, of which the media engine layer is the core location, which is mainly divided into three video engines, video, audio and communication engines.

Netease Yunxin Video Engine Architecture and Application Scenarios

The following figure shows the architecture of a video engine of Yunxin. It is mainly divided into five modules, video pre-processing, video encoding, video QoE, video decoding, and video post-processing. Input from the collection terminal, the business supported by Yunxin is mainly divided into two types, one is the real image collected from the camera, and the other is the screen sharing the collected image from the screen. The captured images will be sent to the video pre-processing first. The cloud letter business is distributed all over the world. There are various devices, some low-end devices and some entry-level devices. The captured images deteriorate due to the camera, in order to improve and restore With such image quality, the video pre-processing completes the image quality processing and then enters the encoding compression, which will be transmitted to the network.

Due to the influence of various networks, we will have a video QoE module to ensure that Yunxin users have a perfect video experience. After the network is transmitted to the opposite end, it is decoded and a post-processing is performed. Post-processing is mainly to reduce or improve the loss of image quality caused by network compression and transmission.

The following figure shows the application scenarios of the video engine. The video scenarios of Yunxin are divided into 4 types, one is the real-time communication application scenario, the other is the low-latency application scenario, there are also video conference-related, interactive live broadcast scenarios and Low-latency live broadcast scenes for interactive teaching.

key technology video engine

Video pre-processing

Video pre-processing is mainly to improve the end-to-end video effect of real-time video. In NetEase Yunxin's global business, all kinds of equipment will be connected. We need such video pre-processing to improve the picture quality.

Video AI enhancement

This technique is relatively old and has been studied many years ago. With the advancement of AI technology and deep learning, video enhancement technology has been greatly improved. However, deep learning or AI has too much computing capacity. Yunxin's business is spread all over the world, and various devices will be connected, especially the mobile terminal. There are more entry-level devices in the Indian market and Southeast Asian markets.

These mobile terminals are very sensitive to power consumption and performance. A slightly larger amount of calculations will cause power consumption and power to drop quickly. This leads to the emergence of large models. A better deep learning model is difficult to use in these scenarios. Whereabouts. If some small models are used, the effect cannot be guaranteed.

Our business is communication business, which needs to be transmitted to the opposite end. Simple enhancement may be good, but it may not be good when it is transmitted to the opposite end. Although the decoded image has an enhancement effect, its blocking effect is more serious than that without enhancement. It may perform better at the local end, but there are more high-frequency components, which results in an excessively high compression rate and excessive loss.

Yunxin uses two methods to solve the above problems. One is that the training is easy to overfit, and the other is that the subjective may become worse after enhancement.

First, there is a scene recognition module, which can identify the content of some text areas, sports scenes and game scenes. There will be different models for each different scene. For example, the game scene is a model, and the text scene is a model. It may be the same model but the parameters may be different. This ensures sufficient computing power and good results.

Our model is a small model. As mentioned earlier, the model cannot be too small. If it is too small, the expressive ability is not good. Therefore, our model is a kind of "lightweight model" with a parameter of 1-2K. In fact, this small model effect cannot be achieved because the industry Many of the small model parameters are less than 1K, maybe only a few hundred, and its network level is three to four layers. Because we have a self-developed efficient reasoning framework NENN. Compared with the open source reasoning framework, it has made a unique optimization of the small model to ensure that the speed of the small model is much faster than other open source frameworks.

Video noise reduction

Because some equipment or cameras have a lot of noise in dark scenes, high-frequency noise is needed to eliminate unnecessary bits. If noise reduction is carried out, it is good for coding, good for transmission, and good for improving subjective quality.

In the RTC scenario, noise reduction is the same. There are mostly mobile services. Many areas are entry-level devices, which are very sensitive to performance and power consumption. Complex power consumption cannot be used, and fast algorithms are not effective. If improper noise reduction is used, it may not only erase the noise, but also reduce the useful high-frequency components, which will have a bad impact on the overall video quality.

When NetEase Yunxin pays attention to this issue, we consider it from the subjective perception of the human eye. From the subjective point of view of the human eye, there is a difference between the human eye's viewing. In some scenes, the human eye has a high resolution and distinguishes many high-frequency coefficients. In other scenes, the resolution of the human eye is very low, and the resolution drops sharply.

NetEase Yunxin adopts the human eye sensitivity analysis method, which can extract the sensitive area of the human eye in the pixel-level image. We would rather reduce the noise reduction coefficient, rather than let some noise, be unwilling to sacrifice the high frequency coefficient. Even if it is let go, the human eye can't feel it. We also have a very simple but very efficient noise estimation algorithm. These two methods produce a weight value, so the video speed will be fast and the effect will be good.

Video codec

Yunxin's video encoding supports mainstream encoders, including the most widely used 264 and 265, and based on a deep understanding of RTC, it has developed a self-developed encoder called NE264CC.

The speed of Yunxin encoder is very fast. Our quality can be improved by 50%. Compared with 265, our encoding speed can be 60 times faster. The picture below shows our self-developed NE264. This is a very good protocol. It has been in the industry for 20 years and has been enduring for a long time. It is currently the most widely covered real-time communication protocol. Yunxin has developed NE264 encoder based on 264, which has fast mode decision, efficient sub-pixel search, adaptive reference frame, and CBR code control.

From the figure below, we can see that compared with the encoders of openh264 and X264 and the encoders of iphone, Yunxin is leading in terms of encoding speed and encoding quality. At the same time, the volatility of the bit rate may be ignored. For RTC, video quality and speed are one aspect, and another very important aspect is the volatility of the bit rate. For low-latency scenes with strict RTC, the fluctuation of the bit rate will bring about picture jitter, and the resolution will be reduced. In this, the fluctuation of the bit rate of NE264 is also the smallest.

The figure below is a comparison with X264-ultrafast. This is the fastest mode. Our speed is about 25% lower than it, but our compression rate is nearly 50% higher than it. If the same quality, X264 uses One megabyte of bandwidth, we only need 500, which is the optimization of the basic image.

For the compression optimization of screen sharing, the industry has many very good solutions, such as H265+SCC, AV1, H264+SCC, these are some very good ideas.

When NetEase Yunxin thought about this issue, we believed that 264 is the most widely used protocol for RTC scenarios. As a lightweight protocol, the overhead is very small. 264 this type of agreement is the least costly.

On the other hand, even if we do not change the protocol, do not add tools, and only optimize the encoder, the content of screen sharing itself has a lot of room for the encoding side to be tapped. Based on the 264 protocol, we dig deeper into the improvement of screen sharing. , To improve the effect. The following are some of our results. With and without screen sharing encoding algorithm, our compression rate has increased by 36.72% in the screen sharing scenario, while our speed is only slower by 3%-4%. It can be seen that our compression rate is 41% higher than that of openh264, and the speed is basically unchanged.

Let's take a look at the self-developed NE265, which is currently in continuous iteration. NE265 is characterized by an achievable design with an efficient architecture. Some algorithms for computational complexity have been very finely optimized in 3D. Anyone who knows the encoder knows the veryslow gear. Our speed is 64 times faster than it. This is not the fastest gear. The fastest gear is more than 200 times faster.

264 and 265 were also compared. We compared it with the faster gear. The main disadvantage of 265 is its slow speed. It can be seen that this gear is nearly 30% faster than X264 faster, and the compression rate is increased by an average of 34.78%. These test sequences used the official standard test sequence as well as the test sequence related to Yunxin RTC business and the test sequence of social entertainment.

Based on a deep understanding of RTC and audio and video communications, we invented NEVC, a multi-scale video compression technology. Compared with NE265, the speed is basically the same, but the compression rate is improved, the texture on the right is improved more clearly, and the texture on the left is basically blurred. After we finished the video encoding, the code should be compressed and sent to the network. The network is the most complicated for RTC, especially in the global business, there are various networks. How to ensure the best video quality in a multi-network and complex network environment, we have a video QoE module to support it. The video QoE module will guarantee the five aspects of video fluency, clarity, quality stability, delay, and performance and power consumption.

Video QoE

Video quality control module

Video quality control is interpreted from the three dimensions of video fluency, clarity and quality stability. After collection, pre-processing, encoding and sending, and finally to the network, the network here may have various networks, such as some networks with very low bandwidth, some networks with continuous packet loss, or some networks with relatively jitter .

It is impossible for each different network to transmit with one resolution, one frame rate, and one bit rate, which may produce different and very poor results. The video quality control module here is called VQC. It will first receive a network bandwidth or effective network bandwidth evaluated from the network QoS, and allocate the appropriate video resolution and video frame rate according to this bandwidth, and set the encoder to achieve the most suitable video. At the same time, the image is in a variety of different networks, some are relatively noisy networks, and some are dark scene networks. For VQC modules, information will be collected, and it will decide which video algorithm switches to turn on or off, or adjust video parameters to enhance Or noise reduction, and some algorithms for encoding.

Device control module

Yunxin's business is spread all over the world, and there are a variety of different networks, such as the extremely bad networks in Asia, Africa and Latin America, India, and Southeast Asia. Of course, it also includes Europe and the United States. Domestic networks are relatively good. Another point is that there are many types of terminal platforms. There are high-end mobile phones, low-end mobile phones, PCs and tablets. The device control module of Yunxin sets the video algorithm according to different network characteristics according to different regions, and at the same time according to the type of device platform. Our algorithm.

For example, use low resolution and low frame rate for some poor devices; use high frame rate and advanced algorithms for devices with better contrast.

In the actual process, since the network is not static, and there are factors such as the status of the device and the occupancy rate of the GPU, the device control module adjusts the algorithm in real time through real-time monitoring data to achieve the optimum.

Video decoding

After QoE is finished, the code stream is transmitted to the receiving end for video decoding. Yunxin video decoding is very efficient and supports almost all video formats, and there is no problem with interconnection.

Video post-processing

Video post-processing restores and improves video quality through video screen content optimization and video super-segment optimization. Yunxin’s video super-score, the network parameters are 2K-4K, and the number of network layers is less than 8. We have a self-developed AI inference engine, which is uniquely optimized, and the speed is very fast. At the same time, we will accelerate the effect of the super-score. Targeted data set processing, using Apple or zoom phones to collect data at different focal lengths, and perform real data training. At the same time, some data preprocessing and enhancement will be used to determine the effect. The main advantage is efficiency and speed.

In the table below, the first three are traditional processing time-consuming. This is our self-developed super score. This is a relatively famous lightweight network. From the processing time-consuming point of view, the time-consuming of Yunxin AI It is more than 30 times faster than the lightweight network famous for AI; from the effect point of view, the video quality of Yunxin AI's super-scoring far exceeds the non-AI effect, and the difference between the classic effect is extremely small, and it is basically invisible difference.

The second is desktop sharing optimization. For desktop screen sharing, more than 264 encoding has been post-processed. For the optimization of text scenes, for deep learning, the biggest difficulty of screen sharing is that its resolution is generally very large. It has a high-precision text recognition function to enhance the interpreted text. At the same time, our self-developed reasoning framework NENN will also maintain this speed. This is the text enhancement effect.

Live-on-demand business video technology

Live on-demand architecture

The compiler server was introduced above, which is basically a low-latency RTC line. If you live broadcast, use the short push server for live broadcast on-demand, which can be issued through the CTO.

The live stream on-demand link is from the client through the push to the edge media server, then repost to the live transcoding, and then to the CDN.

There are two problems with this link. One is that when the device is uploaded, its image quality is lost, and it is compressed. It is possible that the camera collection itself has problems, which will also cause losses. The second is after transcoding, when it is distributed through CDN, the transcoding is very high.

In order to solve the above two problems, Yunxin proposed the smart code ultra-clear technology. First, the deep learning video repair technology is used to repair or enhance the video before transcoding, and then the coding technology based on human eye perception saves the bit rate without degrading the subjective quality of the video.

The image first passes through the video repair module to repair or enhance or beautify the video, and then perform perceptual coding. The perceptual coding analyzes the video content and precedes a video analysis module.

Smart Code Ultra Clear Technology Architecture

Video Repair Technology

Video restoration is a relatively difficult technology in the industry. Due to the various degradation models, there are many reasons for video degradation, such as the impact of camera noise, loss of compression, and overexposure and underexposure caused by the poor camera itself. It is also possible that the focus is not right and waiting.

Cloud Credit has developed a picture quality evaluation algorithm, through in-depth learning algorithms, to get the degradation model of this video. Use different restoration methods for different degradation models. If it is noise, we will use the method of video noise reduction; if it is blurry, we will use the method of deblurring; if the texture is not good, we will use the method of texture enhancement and image correction. Repairing after evaluation can beautify or enhance the subjective effect of the video.

Video Perceptual Coding Technology

After the repair, it will be encoded. Yunxin's perceptual coding uses JND technology, which uses the smallest perceptible error of the human eye to measure the sensitivity of the human eye to distortion in different areas of the image.

JND is a technology that has been proposed more frequently. It can be seen from the figure below that the objective distortion is a continuous curve, and the human eye is a stepped shape. Where there is redundancy, it can be optimized to save the bit rate while subjectively decrease.

JND is a more traditional method, but traditional JND encoding is a method that focuses on low-level image features, such as texture, edge, brightness, and color.

The difference between Yunxin JND and others is the addition of video content analysis. For example, in the above picture, we will perform video analysis to analyze the foreground, face, text and other information of this image, and then construct JND separately for different information to achieve the purpose of saving code stream. After this process, the foreground will be output. , Text, face, each feature has JND consensus, and the coefficient of JND is obtained for coding.

The picture below is the test result of Zhima Super Clear. The blue color represents Yunxin, and the other colors are from friends in the industry. On the left is the subjective score of the human eye, so the higher the better, and the compressed file size is obviously the lower the better.

Netease Yunxin Entertainment Social Industry Line Video Technology

This is the key input and output of NetEase Yunxin.

Beauty Technology

Yunxin's beauty technology provides 26 functions such as microdermabrasion, whitening, big eyes, more than 50 filters, age, gender, and gaze recognition tracking, and supports 2D and 3D stickers. These industries have them, but our The characteristic is the efficient processing speed under the beauty quality, which is our core competitive advantage.

The cost of beautifying, dermabrasion, whitening, face thinning, etc. for 720P video. On the Snapdragon processor, the basic beauty of Yunxin can reach 30. For our overseas markets, especially in markets such as India and Southeast Asia In the case of entry-level models everywhere, this is very competitive, and the entire video experience is completely different.

background segmentation technology

Yunxin background segmentation technology uses a large number of data sets. Our accuracy is relatively high, iou reaches 0.93, the robustness is relatively good, and the reasoning speed is relatively fast, less than 10 milliseconds. The figure below is a comparison between our accuracy and friends in the industry. The higher the accuracy, the better.

Landing Practice

After the technology is finished, you can take a look at the actual landing practice of Netease Yunxin. NetEase Yunxin's video engine has served more than 10,000 users worldwide.

Users who have access to both the SDK and the video engine, such as LOOK live broadcast, NetEase cloud music online KTV, NetEase conference, and POPO inside NetEase, including some third-party manufacturers based on conference components.

NetEase News’s live on-demand application and cloud music large-scale concerts all use NetEase’s live on-demand function. For example, last year’s very famous concert that broke the record number of people also used Yunxin’s video engine. We will also follow-up. Continue to cultivate in the technical field and bring more and better products to everyone.