With the advent of the 5G era, audio and video services such as Internet short videos, movies and TV series, e-commerce live broadcasts, game live broadcasts, and video conferencing have developed explosively.

As a general cloud transcoding platform, Alibaba Cloud Video Cloud's narrowband HD needs to process massive videos of different quality. For medium and high-quality video, the existing narrowband HD 1.0 can provide satisfactory transcoding effect and bring about a 30% reduction in bandwidth cost; while for low-quality video with obvious compression distortion and imaging noise, it is necessary to use higher performance A good Narrowband HD 2.0 performs decompression distortion, denoising and enhancement processing for a better viewing experience.

At the 2022 Rare Earth Developers Conference, Alibaba Cloud intelligent video cloud technology expert Zhou Mingcai shared in-depth the research and development thinking and practice of Alibaba Cloud video cloud in narrowband HD with the theme of "Alibaba Cloud Narrowband HD Evolution Breakthrough and Scenario Combat".

01 Source of Narrowband HD

Before talking about narrowband HD, let's talk about the common cloud transcoding process. Transcoding is essentially a process of decoding and then encoding. As can be seen from the figure below, ordinary cloud transcoding is to form an original video on the client side, and then transmit it to the server side in the form of a video stream after encoding. distribution network), the main function of ordinary transcoding at this time is to unify the video format and reduce the bit rate to a certain extent.

What is Narrowband HD? What is the main difference between it and normal transcoding? It can be understood from the literal meaning of narrow-band high-definition. "Narrowband" means that after the video is transcoded in narrow-band high-definition, the demand for bandwidth becomes smaller. At the same time, "high-definition" means that the image quality after transcoding can still maintain high-definition and rich visual experience.

The lower part of the above figure is also the process of Narrowband HD, which is different from the ordinary transcoding process in that after decoding in the cloud, Narrowband HD will also enhance the video quality, and use encoding information to assist in improving the video quality. After quality improvement, it is encoded with an encoder optimized for subjective quality, and finally distributed.

To sum up, Narrowband HD essentially solves the problem of quality improvement and compression, and its main goal is to pursue the optimal balance of quality, bit rate and cost. Alibaba Cloud has proposed the concept of narrowband HD as early as 2015. In 2016, the technology brand of Narrowband HD was officially launched and commercialized.

This year, Alibaba Cloud launched the Narrowband HD 2.0 Ultimate Repair Generation version. Compared with the previous version, the biggest feature is that it can generate detailed textures for ultimate repair.

Narrowband HD Panorama

Narrowband HD mainly considers three dimensions when making adaptive parameter decisions: business scenarios, video popularity, and video content.

Due to different business scenarios, such as e-commerce live broadcast, game live broadcast, and event live broadcast, the required video enhancement and encoding parameters are different; for some high-profile content, such as: in the mobile shopping scene, narrowband HD 2.0 can be used to start secondary transcoding To further improve the quality and save the bit rate; in the video content dimension, some High-level and Low-level analysis will be done for the current video. High-level includes semantic analysis, especially ROI detection, Low-level analysis. level includes video quality analysis of video compression, blur, and noise.

According to the analysis of the above dimensions, the decision results of the adaptive parameters can be obtained. According to this result, the narrowband HD will do the corresponding video repair and video enhancement. Specifically, video repair includes strong compression distortion, noise reduction, etc., and video enhancement includes detail enhancement, color enhancement, contrast enhancement, and so on.

02 Video Content Analysis

ROI

The main purpose of ROI is to allocate the bit rate as much as possible to the area that the human eye pays more attention to when the bit rate is limited or the bit rate is consistent. For example, in movies and TV series, the audience will pay more attention to the face of the protagonist.

There are two difficulties in ROI-based processing and compression: one is how to obtain a low-cost ROI algorithm, and the other is how to make code control decisions based on ROI, for example, while ensuring the subjective quality improvement of the ROI area, the subjective quality of the non-ROI area It will not drop significantly; at the same time, the time domain is continuous and does not flicker.

In terms of low-cost ROI calculation, Alibaba Cloud has developed an adaptive decision-making face detection and tracking algorithm, which is a low-cost, high-precision algorithm. In most of the time, only face tracking with minimal computation is required, and only a small amount of time is required for face detection, so as to achieve ultra-low cost and fast ROI acquisition while ensuring high precision.

As can be seen from the table below, compared with the open source face detection algorithm, Alibaba Cloud's self-developed algorithm basically has no loss in precision and recall, and at the same time, the complexity and calculation time are significantly reduced by orders of magnitude.

After the ROI algorithm is available, it is necessary to make decisions on the adaptive bit rate allocation of the scene and video quality. To solve this problem, the main consideration is to combine with the encoder to achieve a balance between the subjective and the objective, while ensuring the consistency of the time domain.

JND's traditional video compression method is mainly based on information theory, which reduces temporal redundancy, spatial redundancy and statistical redundancy from the perspective of predicting structure, but this is far from enough for visual redundancy mining.

In the JND algorithm, two algorithms are mainly used, one is the spatial domain JND algorithm, and the other is the time domain JND algorithm. After obtaining these JND algorithms, we then use the MOS adaptive code control algorithm to adaptively allocate QP. , and finally realize that in general scenarios and subjective situations, the bit rate can be saved by more than 30%.

03 Video Repair Enhancement

detail enhancement

When it comes to video repair enhancement, the most mentioned part is the detail enhancement part, and the effect will indeed be more obvious.

The usual detail enhancements are based on the UnSharp Mask framework. The self-developed detail enhancement algorithm of Alibaba Cloud Video Cloud has the following three characteristics: first, it has a more refined way of extracting image texture details, which can extract image texture structures of different sizes and different characteristics, and the enhancement effect is better; second, The algorithm can realize local area adaptive enhancement by analyzing the texture structure of the image content according to the regional texture complexity; the third feature is that the algorithm can be combined with the encoding, and the enhancement strategy can be adaptively adjusted according to the encoder's encoding information feedback.

color enhancement

Usually, the color of the captured video material may look dim due to the device or the brightness of the light. Especially in short video scenes, such videos lose their visual appeal and thus require color enhancement.

What are the difficulties in color enhancement? How to do color enhancement?

Like the EQ filter in Ffmpeg, the EQ filter will use the UV channel for color enhancement. In our self-developed algorithm, we actually do enhancement in the RGB color space, that is, we will do some local adaptation according to the saturation of the current color point. At the same time, it will also make an overall adaptation according to the overall situation of the current screen.

In the skin color protection, because the traditional color enhancement is completed, the face area will be red, which is subjectively unnatural. In order to solve this problem, we have adopted the method of skin tone protection, which is an extra protection for the skin tone area.

This is a comparison of the effect before and after color enhancement. You can see the enhanced green vegetables and meat, and the whole color will look fuller, which can arouse the audience's appetite for food videos.

contrast enhancement

In contrast enhancement, the classic CLAHE algorithm is used. The idea is to block a video frame, usually divided into 8x8 blocks, and count the histogram in each block. Then when the histogram is counted, a Clip is made on the histogram, which is the so-called contrast-limited histogram equalization, which is mainly to overcome the problem of excessively amplified noise. There is actually a difficulty in video contrast enhancement based on CLAHE, which is the problem of temporal flicker. This is also a difficult problem in academia, and so far, it has not been completely solved.

Noise reduction

There are many algorithms for noise reduction in ffmpeg, such as BM3D, BM4D, and NLM. These algorithms have good denoising effects, but the complexity is very high, which will lead to slow speed and high cost. It may also need to be used together with the noise estimation module.

In addition, there are some relatively balanced algorithms, which are relatively fast, but the effect is not strong. If you want to increase its denoising strength, it usually introduces some artifacts or loss of details.

Based on these investigations, our self-developed noise reduction algorithm adopts a filtering framework based on multi-resolution decomposition. The first is to perform wavelet decomposition on the input image to obtain high-frequency and low-frequency information. Do soft thresholding for high frequencies. For low frequencies, bilateral filtering is used for noise reduction. After such filtering or soft thresholding, and then re-synthesizing it, the purpose of denoising can be achieved. The core difficulty of the algorithm is how to accelerate so that the cost and operation speed can meet the transcoding requirements, especially in real-time transcoding scenarios, which have very high speed requirements.

accelerate

The algorithm team has made many attempts including shaping for wavelet transform acceleration, and there are always some accumulated errors. So we finally adopted the floating-point acceleration method, and the avx2 floating-point acceleration can achieve about three times the improvement.

The other is the acceleration of this part of bilateral filtering. The traditional bilateral filtering is based on the operation of neighboring pixels. This neighborhood-based operation is actually very slow. Therefore, we use the fast algorithm of RBF to decompose the two-dimensional filtering into one-dimensional ones, and at the same time use a recursive way to go from left to right, right to left, top to bottom, bottom to top, such a one-dimensional operation, you can achieve a similar effect to the original bilateral filtering. By using RBF, a fast algorithm, we can get about a 13x speedup. In addition, we have also done this optimization of AVX2 assembly, which can speed up about ten times additionally.

The above picture is the overall effect of SDR+. After SDR+ processing, the overall contrast, brightness and clarity of the picture will be greatly improved. The above are some of the work done for video enhancement.

CDEF to ring

The first is CDEF de-ringing. CDEF itself is a technology derived from AV 1. Before CDEF processing, there will be many glitches and ringing near strong edges. After CDEF processing, the noise in the picture is greatly removed.

The core step of the CDEF algorithm is actually a smoothing filtering process, but the weights and deviations of its smoothing filtering have undergone some special treatments. In particular, its filtering weight is related to the main direction of the 8x8 pixel area where the current pixel is located, which is shown here in the lower left corner of the figure, and it will do a search for an optimal direction. After the search is completed, the direction and weight of its filter tap are determined according to the main direction. In addition, CDEF has two parts of weights, one is the WP in the main direction and the other is the auxiliary direction WS. Then truncate the grayscale deviation between the neighbor point and the current point, which can avoid over-smoothing.

decompression distortion

In addition to using CDEF to de-ring based on traditional image processing algorithms, a deep learning-based decompression distortion algorithm is also used. This algorithm is based on a multi-frame scheme, which is more conducive to inter-frame continuity and less prone to inter-frame flicker. The "Narrowband HD" algorithm is divided into two parts: one is the quality detection module, and the other is the decompression module. The quality detection module can identify the compression degree of different quality video sources, and then output QP MAP as a measure of compression strength. The other is the decompression module, which inputs multi-frame video and the QP MAP of the corresponding frame, and uses the QP MAP to perform adaptive decompression.

Ultimate Repair Generation

The ultimate repair generation is mainly for scenes with poor image quality. While removing strong compression distortion, some details lost due to compression are generated. In the research and development of extreme repair generation, there are the following points: one is to construct training data (in the construction of training data, the second-order degradation idea of Real-ESRGAN is referred to); the other is to ensure face generation for more sensitive face areas Stability; third, when model compression is performed, the model calculation amount is low while maintaining good results; the fourth is model deployment.

Extremely repaired scene combat

During the June NBA Finals broadcast, Blockbuster TV wanted to improve the quality of their live broadcasts by using our Narrowband HD 2.0 Repair Generation technology. As shown in the screenshot in the middle, the upper part of the screenshot is the video effect directly pushed by the host, and the lower part is the effect after the ultimate restoration.

It can be seen that after the repair, the edges of the letters of Youtube will be clearer, cleaner and less frizzy. Other basketball scenes, such as the numbers behind the players and the outlines of the players' bodies, also become particularly clear. There are also some generation effects, such as some textures generated on the floor, which greatly improves the overall look and feel of the event.

In addition to self-developed algorithms, Alibaba Cloud also has some cooperative projects with universities, and subtitle restoration is the result of one of these cooperative projects. You can see an example of actually repairing subtitles in the lower right corner of the figure. This subtitle is taken from an old movie MV. The top line is the subtitle in the original MV. It can be seen that the horizontal strokes next to the word "hua" will have some sticking, and there is a lot of noise at the edge of the text. The following line is the effect after the subtitles are repaired, and it can be seen that it will become very clean and clear.

In the future, the narrowband HD technology will also continue to be upgraded, through algorithm capabilities to further improve the repair generation effect, reduce the bit rate and optimize the cost, open up the front-end and back-end processing, and explore more immersive scenarios, such as narrow-band HD for the VR field. At the same time, this technology will also be applied to more top-level events to achieve a new upgrade of the visual experience based on cost optimization and reconciliation.

"Video Cloud Technology", your most noteworthy public account of audio and video technology, pushes practical technical articles from the frontline of Alibaba Cloud every week, where you can communicate with first-class engineers in the audio and video field. Reply to [Technology] in the background of the official account, you can join the Alibaba Cloud video cloud product technology exchange group, discuss audio and video technology with industry leaders, and obtain more latest industry information.

CloudImagine
222 声望1.5k 粉丝