The latest research paper "Deformable Convolution Dense Network for Compressed Video Quality Enhancement" by Alibaba Cloud Video Cloud Video Coding and Enhancement Technology Team has been selected as the subject of ICASSP 2022 Image, Video & Multidimensional Signal Processing The conference received, and was invited to give a programmatic presentation to industry and academia at the global conference in May this year. The following is the core content sharing of technical achievements.
Jiafu|Author
background
Video compression algorithm is a technology widely used in video transmission and video storage. It can help save bandwidth and storage space, but it also brings the problem of video quality degradation. The goal of the compressed video quality enhancement task is to reduce the artifacts brought by video compression and improve the video quality.
In recent years, methods based on multi-frame strategies have become the mainstream of compressed video quality enhancement tasks. In order to fuse multi-frame information, most of these methods rely heavily on optical flow estimation. However, inaccurate and inefficient optical flow estimation algorithms are limited. live to enhance the performance of the algorithm. In order to break the limitation of optical flow estimation algorithm, this paper proposes a dense residual connection network structure combined with deformable convolution, which can complete the transition from high-quality frames to low-quality frames without the help of explicit optical flow estimation. Compensation for quality frames.
Deformable convolution is used to achieve implicit motion estimation, and dense residual connections are used to improve the model's tolerance to errors. Specifically, our proposed network structure consists of two modules, a motion compensation module that utilizes deformable convolutions to achieve implicit estimation, and a motion compensation module that uses dense residual connections to improve model error tolerance and information retention. The quality enhancement module, in addition, this paper also proposes a new edge enhancement loss to enhance the object edge structure. Experimental results on public datasets show that this method significantly outperforms other baseline models.
Method parsing
Inspired by MFQE [1], our method also uses PQF as the reference frame. In MFQE, PQF is defined as a video frame whose quality is higher than the consecutive frames before and after it, and in this paper, I frame is used as PQF, high-quality PQF can provide more accurate information for low-quality input frames, so that Maximize the quality of video frames.
Figure 1 shows the structure of our model, where \( F_{np} \) represents the current frame, \( F_{p1} \) and \( F_{p2} \) represent the most recent before and after PQF, respectively, and MC module represents motion The compensation module, followed by multiple dense residual blocks and convolutional layers constitute the quality enhancement module.
Taking the PQF ( \( F_{p1} \) or \( F_{p2} \) ) as the reference frame, the deformable convolutional layer in the motion compensation module can predict the temporal motion information for it, and compensate the reference frame as the input The content of the frame, the compensation frame at this time \( {F}^{c}_{p1} \), \( {F}^{c}_{p2} \) both have and the input frame \( F_{np } \) Similar content and quality similar to the reference frames \( F_{p1} \), \( F_{p2} \).
Next, the quality enhancement module \( R_{\theta_{qe}} \) will fuse the information of multiple reference frames, and finally output an enhanced frame \( F_{enh} \).
$$ F_{enh}=F_{np}+R_{\theta_{qe}}(\left [ F^{^{c } }_{p1},F _{np} ,F^{c}_{p2} \right ] ) $$
In addition, considering that artifacts usually appear near the edges of objects, we specifically propose an edge enhancement loss, which can detect and emphasize the edges of objects \( W \) in video frames, helping the model to better reconstruct those corrupted by artifacts Outline of dropped objects.
$$ L_{e} =\frac{1}{N} \sum_{i=1}^{N}{W} \ast \left ( F_{raw} - F_{enh} \right ) ^{2} $$
Experimental results
Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) are the most widely used image quality assessment indicators. In order to compare the algorithm effects more conveniently and intuitively, this paper uses \( \bigtriangleup PSNR \) and \( \bigtriangleup SSIM \ ), that is, the increment of PSNR and SSIM of the enhanced frame relative to the input frame as the evaluation index.
Our method is compared with 5 other baseline models. Among the 5 comparison methods, ARCNN[2], DnCNN[3] and RNAN[4] are all compressed image quality enhancement algorithms that can independently Frames are enhanced, but the performance is mediocre. MFQE 1.0 is a compressed video quality enhancement algorithm based on multi-frame strategy and PQF. On the basis of MFQE 1.0, MFQE 2.0 [5] further improves the enhancement effect by improving the PQF detector and quality enhancement module. As can be seen from Table 1, our method achieves higher \( \bigtriangleup PSNR \) and \( \bigtriangleup SSIM \) than the other 5 methods. In particular, for the test sequence with QP=37, our performance gain over MFQE2.0 is nearly twice that of MFQE2.0 over MFQE1.0.
Figure 2 shows the subjective effects of the five methods, and it is clear that our proposed method can improve the quality of video frames even higher. Taking the ball, umbrella stand, and mouth in Fig. 2 as examples, our method recovers sharper object edges and more details, which indicates that for fast-moving objects in the video, such as balls, the pyramid used in our network The structured deformable convolution can compensate for motion more accurately, and with the efficient help of the quality enhancement module and the correct guidance of the edge enhancement loss, our method achieves better performance on edge reconstruction and detail supplementation.
Based on the in-depth research and development of this technology, the edge detail repair effect of Alibaba Cloud Video Cloud Narrowband HD products on low-quality video has been greatly improved, especially in the face area that people are more concerned about. Viewing experience, this achievement can be widely used in short video and live broadcast scenarios, such as CCTV Spring Festival Gala, Ali Health and other scenarios. In addition, this technology also has a good visual enhancement effect for medium and high-quality videos. Under the same bandwidth, the overall picture becomes clearer. In the future, this technology will be widely used in more scenes to improve the viewing experience.
About Narrowband HD
Narrowband HD is a media processing function based on Alibaba Cloud's exclusive transcoding technology. It adopts Alibaba Cloud's unique algorithm to break through the upper limit of video encoder capabilities, upgrade and iterate the transcoding technology, and continuously optimize the smoothness and clarity of video playback. , to achieve a more stream-saving viewing experience under the same image quality and a higher-definition viewing experience under the same bandwidth. Narrowband HD provides important technical support for the 2022 Beijing "Cloud Winter Olympics" and Alibaba Cloud's "Alibaba Cloud ME" by utilizing its technical features such as low-code HD, image quality regeneration, scene customization, and 50% bandwidth cost savings. (Narrowband HD product official website)
references
[1]Ren Yang, Mai Xu, Zulin Wang, and Tianyi Li, “Multiframe quality enhancement for compressed video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6664–6673.
[2]Chao Dong, Yubin Deng, Chen Change Loy, and Xiaoou Tang, “Compression artifacts reduction by a deep convolutional network,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 576–584.
[3]Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
[4]Yulun Zhang, Kunpeng Li, Kai Li, Bineng Zhong, and Yun Fu, “Residual non-local attention networks for image restoration,” arXiv preprint arXiv:1903.10082, 2019.
[5] Zhenyu Guan, Qunliang Xing, Mai Xu, Ren Yang, Tie Liu, and Zulin Wang, “Mfqe 2.0: A new approach for multi-frame quality enhancement on compressed video,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
"Video Cloud Technology", your most noteworthy public account of audio and video technology, pushes practical technical articles from the frontline of Alibaba Cloud every week, where you can communicate with first-class engineers in the audio and video field. Reply to [Technology] in the background of the official account, you can join the Alibaba Cloud video cloud product technology exchange group, discuss audio and video technology with industry leaders, and obtain more latest industry information.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。