With the development and application of 5G, people's quality requirements for audio and video communication are constantly improving. Users who are accustomed to high-definition visual enjoyment obviously cannot accept returning to the "mosaic era". Follow [Rongyun Global Internet Communication Cloud] to learn more

However, in the global Internet communication cloud service, in the face of the richness and complexity of networks and terminals, it is inevitable to encounter the situation of insufficient user bandwidth. It is very important to use technical means to increase the resolution of images and videos so that users can still obtain higher-definition images and videos at lower bandwidths.

This article shares the optimized solution for implementing WebRTC high-definition video transmission under limited bandwidth using image super-resolution technology.

Main Factors Affecting Video Definition in Video Coding

In theory, for the same encoder, under the premise of the same resolution, the higher the bit rate, the higher the video quality. But from a visual point of view, there is an optimal value for the bit rate for a specific encoder and resolution.

Usually, the video compression scheme used for video calls/conferences is H264 or VP8. In H264 encoding, the recommended video bit rates for 720P and 1080P are 3500Kbps and 8500Kbps respectively.

In mobile devices, the recommended bit rate of H264 encoder is as follows.

1280 x 7201920 x 1080
Very low bit rate500 Kbps1 Mbps
low bit rate1 Mbps2 Mbps
Medium bit rate2 Mbps4 Mbps
High bit rate4 Mbps8 Mbps
Very high bit rate8 Mbps16 Mbps

(Recommended bit rates for different resolutions on the mobile terminal)

As can be seen from the above table, when making a video call under the framework of the WebRTC technical solution, the H264 encoder is used to compress 1080P high-definition video. If the video is set to a medium bit rate, 4Mbps is required.

For some end users with insufficient bandwidth, if the 1080P video is still played, there will be a freeze. Under the technical framework of WebRTC, simulcast or SVC can be used to transmit code streams of different resolutions or different frame rates to different end users according to their own characteristics.

To put it simply, if the user's network environment is not suitable for transmitting 1080P video, send 720P video to the user, and the required bit rate is half of 1080P, which can maximize the smoothness of video transmission under limited bandwidth.

High-definition has become a trend. How to obtain high-definition video on the terminal without changing the encoder, ie, H264 or VP8, has become a topic that needs to be jointly studied and solved by academia and industry.

Application of Image Super Resolution Technology in WebRTC

Image super-resolution technology (Super Resolution, SR) is an important technology in computer vision to improve the resolution of images and videos, which can reconstruct a low-resolution image into a high-resolution image.

In layman's terms, when we zoom in on a smaller image, there will be blurring. Super-resolution reconstruction is to use more pixels to represent the content corresponding to one pixel in the original image in the enlarged image. , to make the image as clear as possible.

With the development of deep learning, the current image super-resolution technology has shifted from traditional computer vision analysis methods to solutions based on CNN and Transformer.

The usual way of applying image super-resolution technology to WebRTC is to send low-resolution video, and use super-resolution technology to enlarge the image when each terminal displays it. Since the transmitted video is a low-resolution image, higher definition can be obtained with a lower amount of transmitted video data.

In WebRTC, the image super-resolution algorithm needs to meet the dual requirements of real-time super-resolution and sharpness at the same time. Through extensive research and experiments on the industry's mainstream deep learning-based super-score algorithms, we found that optimization based on ESPCN can meet the requirements.

ESPCN was proposed by the paper "Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network" published on CVPR in 2016. The experimental results are shown in the figure, achieving a balance between speed and accuracy.

Due to space limitations, the specific structure of the model will not be introduced here.

(Comparison of the accuracy and speed of various algorithms)

ESPCN model experiment

In this experiment, we use Pytorch to build, train and infer the ESPCN model. The training data can generally use common image databases, such as PASCAL VOC, COCO, etc. Of course, corresponding data sets can also be collected according to industry characteristics to make model reasoning more targeted.

In order to increase the database, we will use some database expansion operations during training. One of the most effective methods is random cropping of images. This method can not only expand the database, but also weaken data noise and improve model stability.

In WebRTC video transmission, the image is downsampled by 1/2 at the encoding end. For example, when transmitting 1080P video, the downsampling of the video at the encoding end is 960×540.

According to the above analysis and some experimental results in this paper, it is shown that the data volume can be reduced by more than half when transmitting 960×540 resolution video compared to transmitting 1080P video.

In practical applications, when end users cannot directly transmit 1080P video due to bandwidth limitations, they can transmit 960×540 video, and then use the ESPCN model to super-reconstruct the 960×540 video to obtain the original 1920×1080 video. Meet users' requirements for high-definition video.

Terminal deployment ESPCN model optimization

Generally speaking, after the training of the deep learning model is completed, the model needs to be deployed to different terminals in a targeted manner.

First, we reduce the size of the model through model quantization, reduce the occupation of terminal storage, improve the calculation speed, and accelerate the optimization of deep learning inference.

In this regard, the Pytorch framework provides corresponding interfaces, which can perform various quantizations on the model, and we use int8 quantization.

Through int8 quantization, part of the data in the model is changed from float to int type, which is integrated into WebRTC using libtorch, and the trained model will be converted into pt format that libtorch can read.

The results show that the size of the ESPCN model is 95KB before quantization and 27KB after quantization.

Secondly, for the reasoning optimization of different models of mobile phones, the upsampling from 256 256 to 512 152, the model can reach the inference speed of 30ms on mid-to-high-end mobile phones such as Huawei P10, and it is currently used on terminals such as Asus Fonepad 8 and Samsung Galaxy M31. Real-time performance is not ideal yet. We will also use pruning and various quantitative methods to continuously optimize the model to meet the requirements of various mobile phones.

ESPCN model experimental results

The key code of the model reading part is as follows, where espcn.pt is the model before quantization.

// Deserialize the ScriptModule from a file using torch::jit::load().
auto  device = torch::kCUDA;
std::string str = "espcn.pt";
  
try
{
    module = torch::jit::load(str, device);
}
catch (const c10::Error& e)
{
    std::cerr << "error loading the model\n" << e.what();
}

pmodule = &module;

The inference of the model is shown in the following code, and output is the output of the model inference. Because the output data is 0 to 1, the output data needs to be qualified and converted.

It should be noted that the ESPCN model defaults to the super-division for the Y channel. For UV data, bicubic interpolation is directly performed, that is, the super-division of the three YUV channels is completed.

std::vector<torch::jit::IValue> inputs;
inputs.push_back((img_tensor1.cuda()));
at::Tensor output = pmodule->forward(inputs).toTensor();
output = output.mul(255).clamp(0, 255).to(torch::kU8);

Quantized ESPCN model effect


(The original image)

Upscale the original image to the target size using bicubic interpolation

Reconstruction with ESPCN model super-resolution

Visually, the ESPCN model performs much better than the bicubic interpolation method.

This paper mainly introduces the scheme of WebRTC transmission optimization based on image super-resolution technology, involving model selection, training, quantization and so on.

There are many optimization schemes for video transmission, and image super-resolution technology is only one of them. At a time when real-time audio and video interaction has become the basic requirement of communication, Rongyun will actively combine relevant theories and mature algorithms to develop new technologies, optimize video compression and transmission, and meet the needs of various users all over the world with different needs.


融云RongCloud
82 声望1.2k 粉丝

因为专注,所以专业