Research and practice of super-resolution technology in the field of real-time audio and video

foreword

Recently, CVPR, the top conference in the field of computer vision and pattern recognition, was held in New Orleans, USA. At the same time, NTIRE, the most influential global top event in the field of computer image restoration, was awarded at the conference. Resolution Challenge overall performance track winner, and run time track runner-up. This article will focus on the implementation of AI super-resolution technology from research to deployment, introduce the current status of super-resolution technology, as well as the opportunities and challenges faced by the application of video super-resolution in mobile applications.

Overview of Super Resolution Technology

In recent years, Internet video data has exploded. At the same time, the resolution of video is getting higher and higher to meet people's growing demand for video quality of experience (QoE). However, due to the limitation of bandwidth, the video transmitted over the network is usually down-sampled and compressed, which inevitably leads to the degradation of the video quality, which in turn affects the user's experience and perception. The super-resolution technology is designed to restore high-resolution output with better visual quality from low-resolution input, which can effectively solve the problem of poor video quality, thereby meeting the needs of playback end users for extreme high-definition image quality. It has very important application value in the fields of live broadcast on demand, monitoring equipment, video codec, mobile phone shooting, medical imaging, digital high-definition and video restoration.

Classification and development direction of super-resolution technology

In a broad sense, super-resolution technology includes three situations: single-image super-resolution, super-resolution reconstruction of single-frame images from multiple consecutive images, and super-resolution reconstruction of video sequences.

Single image enlargement mainly uses the prior knowledge of high-resolution images and high-frequency information in the form of aliasing to restore. In the latter two cases, in addition to using prior knowledge and single image information, complementary information between adjacent images can also be used for super-resolution reconstruction, resulting in a higher resolution than any low-resolution image. image, but these two cases often bring unacceptable computational cost and the risk of discontinuous reconstruction of adjacent frames. Therefore, in the actual landing, it is biased towards single-image super-resolution technology.

According to the classification of time and effect, single image super-resolution algorithms can be divided into traditional algorithms and deep learning algorithms.

Traditional Super-Resolution Reconstruction Algorithms

Traditional super-resolution reconstruction algorithms mainly rely on basic digital image processing techniques for reconstruction. The common ones are as follows:

Interpolation-based super-resolution reconstruction: The interpolation-based method regards each pixel on the image as a point on the image plane, then the estimation of the super-resolution image can be regarded as the use of known pixel information for the plane. The process of fitting unknown pixel information, which is usually done by a predefined transformation function or interpolation kernel. The method based on interpolation is simple to calculate and easy to understand, but has obvious defects. The restored images often appear blurry and jagged. Common interpolation-based methods include nearest neighbor interpolation, bilinear interpolation, and bicubic interpolation.

Degraded model-based super-resolution reconstruction: This kind of method starts from the degraded degradation model of the image, and assumes that the high-resolution image has undergone appropriate motion transformation, blurring and noise to obtain the low-resolution image. This approach constrains the generation of super-resolution images by extracting key information from low-resolution images and incorporating prior knowledge of unknown super-resolution images. Common methods include iterative back projection method, convex set projection method and maximum a posteriori method.

Learning-based super-resolution reconstruction: Learning-based methods use a large amount of training data to learn a certain correspondence between low-resolution images and high-resolution images, and then predict low-resolution images according to the learned mapping relationship. The high-resolution image corresponding to the image, so as to realize the super-resolution reconstruction process of the image. Common learning-based methods include manifold learning and sparse coding methods.

Super-resolution reconstruction algorithm based on deep learning

SRCNN is the first attempt of deep learning method in super-resolution problem. It is a relatively simple convolutional network consisting of three convolutional layers, each of which is responsible for different functions. The first convolutional layer is mainly responsible for extracting high-frequency features, the second convolutional layer is responsible for the nonlinear mapping from low-definition features to high-definition features, and the last convolutional layer is responsible for reconstructing high-resolution features Image. The network structure of SRCNN is relatively simple, and the super-resolution effect needs to be improved, but it establishes the basic idea of deep learning methods in dealing with problems such as super-resolution. Later deep learning methods basically follow this idea to perform super-resolution reconstruction.

Later ESPCN made some improvements based on SRCNN, but due to the limited ability of network reconstruction, the effect of super-resolution is not particularly ideal. Because at the time, the training of deep convolutional networks was problematic. Generally, for convolutional neural networks, when the number of network layers increases, the performance will also increase, but in practical applications, people find that when the number of network layers increases to a certain extent, due to the principle of back propagation, the gradient will disappear. , resulting in poor network convergence and reduced model performance. This problem was not solved until ResNet proposed the residual network structure. However, it is worth noting that the ESPCN network first proposes a sub-pixel convolution layer, which removes the pre-upsampling operation before the low-resolution image is sent to the neural network, which greatly reduces the computational load of SRCNN and improves the reconstruction efficiency.

VDSR is the first application of residual network and residual learning idea to super-resolution problem, which increases the number of layers of super-resolution network to 20 for the first time. Using the residual learning method, the network learns residual features, the network converges quickly, and is more sensitive to details. Later, some convolutional neural networks proposed more complex structures. For example, RGAN proposed to use generative adversarial networks to generate high-resolution images. SRGAN consists of two parts, one is the generation network and the other is the discriminant network. The function of the generation network is to generate a high-resolution image based on a low-resolution image, and the function of the discriminant network is to judge the high-resolution image generated by the generation network as false, so that when the network is trained, the generation network and The judgment network constantly competes between the two, and finally reaches a balance, thereby generating high-resolution images with more realistic details and textures, and has better subjective visual effects. Other deep convolutional network methods, such as SRDenseNet, EDSR, and RDN, use more complex network structures. The convolutional layers of the network are getting deeper and deeper, and the super-resolution effect on a single image is getting better and better.

However, many jobs are difficult to deploy on resource-constrained devices due to high computational cost and memory footprint. To this end, efficient model design for super-resolution has also attracted widespread attention. FSRCNN employs a compact hourglass-type architecture to accelerate SR networks for the first time; DRCN and DRRN employ recurrent layers to build deep networks with fewer parameters. CARN reduces the computation of SR networks by combining efficient residual blocks with group convolutions. An attention mechanism is also introduced to find the most informative regions for better reconstruction of high-resolution images. Additionally, knowledge distillation has also been referenced to lightweight super-resolution networks to improve their performance.

Challenges of real-time video super-resolution

In the mobile Internet era, the mobile terminal, as the most important carrier platform for video content, is responsible for the playback of a large number of PGC and UGC video content. The following characteristics of AI-based super-resolution algorithms make their real-time deployment on mobile devices face great challenges:

The subjective effect is not good. If you directly use the deep learning-based super-resolution algorithm mentioned above, you will find that its subjective effect is similar to that of traditional algorithms such as Bicubic, and the effect of improving video quality is very limited.

The amount of parameters in the network model of the SOTA method in the academic world is too large. Even many networks that are lightly weighed have more than 500K parameters, which leads to excessive model calculations and slow reasoning, which cannot meet the requirements of real-time video processing on the mobile terminal.

Yunxin AI Super Score

Training data based on real downsampling

The training data of existing deep learning-based super-resolution algorithms are often obtained through Bicubic or other known downsampling methods. However, the real scene is often not, resulting in a large gap between the model training data and the actual prediction data, making the effect of the super-score algorithm less than ideal.

We employ a realistic downsampling generation approach that is also based on adversarial generative networks. As shown in the figure below, for a high-resolution image, we train the down-sampling generator G and the discriminator D, so that the low-resolution image generated by G is close to the real low-resolution image, so as to obtain the real down-sampling G. After obtaining G, we can use the high-resolution map to generate a large number of training data pairs that satisfy the real downsampling degradation.

Yunxin Super Resolution Algorithm

Netease Yunxin Video Lab proposed an edge-oriented efficient feature distillation network (EFDN). In the 2022 CVPR NTIRE high-efficiency super-resolution challenge, the Overall Performance track Yunxin won the first place with obvious advantages, and the Runtime competition won the first place. Dao got a good result in third place.

Full report of the match:

https://arxiv.org/abs/2205.05675

In order to improve the accuracy of the model and reduce the cost of the model, this method uses the edge-oriented convolution block (ECB) in the training phase to replace the SRB shallow residual block in the residual feature distillation module (RFDB) based on the idea of structural re-parameterization. In the inference stage, the edge-oriented convolutional block (ECB) is converted into an ordinary 3x3 convolutional layer, which can extract the texture information and edge information of the image more efficiently, and improve the network performance while reducing the overhead; The attention (ESA) module performs pruning, reduces the amount of parameters, and increases the pooling layer step size, which further reduces the algorithm overhead.

In order to further implement the project and obtain a model that can run in real time on mobile devices, the Yunxin team has adopted the following optimization methods:

Model compression: In the actual implementation process, in order to meet the requirements of real-time processing, we use channel pruning on the basis of the CVPR NTIRE 2022 high-efficiency super-score challenge model - an edge-oriented high-efficiency feature distillation module (EFDN). , knowledge distillation and other model compression techniques further reduce redundant parameters in the model in the optimized model architecture, remove the channels that contribute little to the model performance, and achieve the purpose of reducing the complexity of the model. At the same time, the quantization technology is used to store the weights in low bits, thereby reducing the size of the model and speeding up the calculation.

Engineering optimization: In the case of limited computing power and memory bandwidth of mobile devices, it is necessary to meet the requirements of real-time video processing by the super-resolution algorithm without increasing too much power consumption, which requires very high engineering deployment. Our optimization on the engineering side mainly saves memory overhead and inference time through optimization methods such as SIMD, model memory optimization, and data layout optimization. At the same time, the algorithm is deeply combined with business scenarios to achieve zero memory copy between the rendering pipeline and the device, and the algorithm is completed. high-performance landing.

The following table shows the single frame time consumption of Yunxin Super Score on different platforms/devices.

Effect display and future prospect

The mobile video super-score can break through the coding and decoding boundaries and the efficiency bottleneck of video transmission, optimize the user experience such as video transmission speed and playback fluency, and bring many practical effects:

Improve video clarity, take advantage of the high-resolution advantages of high-end computer screens, play low-definition videos in high-definition, and provide ultra-high-definition picture quality for high-definition videos, improving user video consumption experience.
Reduce the bandwidth, reduce the resolution of transcoding distribution video through the sender/server, combine with the receiver's over-resolution processing to present a high-resolution effect, lower the threshold for high-definition playback, improve fluency, and reduce user network pressure.

In the future, we will continue to optimize video enhancement algorithms, including super-resolution algorithms, to create industry-leading image restoration and image enhancement technologies to help customers improve video quality, reduce video playback costs, and provide lower time-consuming, lower-power higher power consumption, better subjective quality, an algorithm that covers more models and saves more bitrates, allowing users to enjoy ultra-high-definition video experience on different mobile phones and different network environments.

Left is the Bicubic upsampling result, right is the super-score optimization result

References

[1] Yawei Li, Kai Zhang, Luc Van Gool, Radu Timofte, et al. Ntire 2022 challenge on efficient super-resolution: Methods and results. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2022.

[2] Zheng Hui, Xinbo Gao, Yunchu Yang, and Xiumei Wang. Lightweight image super-resolution with information multi-distillation network. In Proceedings of the ACM International Conference on Multimedia, pages 2024–2032, 2019.

[3] Jie Liu, Jie Tang, and Gangshan Wu. Residual feature distillation network for lightweight image super-resolution. In European Conference on Computer Vision Workshops, pages41–55. Springer, 2020.

[4] Zhang, Xindong and Zeng, Hui and Zhang, Lei. Edge-oriented Convolution Block for Real-time Super Resolution on Mobile Devices. In Proceedings of the 29th ACM International Conference on Multimedia, pages4034--4043. 2021.

Research and practice of super-resolution technology in the field of real-time audio and video

foreword

Overview of Super Resolution Technology

Classification and development direction of super-resolution technology

Traditional Super-Resolution Reconstruction Algorithms

Super-resolution reconstruction algorithm based on deep learning

Challenges of real-time video super-resolution

Yunxin AI Super Score

Training data based on real downsampling

Yunxin Super Resolution Algorithm

Effect display and future prospect

网易数智

引用和评论

InfoQ官媒报道|网易云信裴明明：云原生架构下中间件联邦高可用架构实践

三分钟掌握音视频处理 | 在 Rust 中优雅地集成 FFmpeg

三分钟掌握视频分辨率修改 | 在 Rust 中优雅地使用 FFmpeg

CVPR 2025 | 火山引擎获得NTIRE 视频质量评价挑战赛全球第一

三分钟掌握音视频信息查询 | 在 Rust 中优雅地集成 FFmpeg

【harmonyOS NEXT 下的前端开发者】WAV音频编码实现

什么是抖动以及如何使用抖动缓冲区来减少抖动？