头图

In the NBA finals that ended half a month ago, BesTV, as the only live broadcast platform in the whole network that adopts the mode of "anchor watching the NBA with you", faced the competition of content differentiation with "accompanying watching" event commentary. At the same time, Blockbuster TV also used the "Narrowband HD 2.0" live transcoding technology to create a further experience improvement for the audience in terms of game picture quality.

To put it simply, "Narrowband HD" is a set of video coding technology with the goal of "best subjective experience". Let's take a look at a comparison chart and feel the effect of image quality improvement:


The picture above is the original picture of the main streamer, and the picture below is the restored picture

The upper part of the picture above is the original image pushed by the anchor, and the lower part is the image transcoded using Narrowband HD 2.0 technology. It can be seen that after transcoding with Narrowband HD 2.0 technology, the numbers on the jersey, the English letters on the floor, the nets, the boundary lines, etc. have become clearer. In addition, the overall clarity of the picture has been significantly improved, and even the texture of the floor and the outline of the audience outside the venue will be more clearly visible to the naked eye.
The following will give an in-depth interpretation of the "narrowband HD" technical principle behind the ultra-clear picture quality brought by NBA live events.

1. Narrowband HD technology

Alibaba Cloud has proposed the concept of "narrowband high definition" as early as 2015, and officially launched the narrowband high definition technology brand and commercialized it in 2016. Narrowband HD represents a video service concept that reconciles cost and experience, and is a video coding technology based on the optimal subjective experience of the human eye.

"Narrowband HD" schematic diagram

Narrowband HD is essentially a problem of quality improvement and compression, and the main goal is to pursue the optimal balance of quality, bit rate and cost. There are two versions in this direction, namely Narrowband HD 1.0 and Narrowband HD 2.0 (hereinafter referred to as "Narrow and High").

Narrow and high 1.0 is a balanced version. The main function is how to achieve adaptive content processing and encoding with the least cost, so as to save the bit rate and improve the picture quality. Therefore, in the narrow-height 1.0, the information in the encoder is fully utilized to help video processing, that is, low-cost adaptive content processing and encoding can be realized with a low-cost pre-processing method. At the same time, in the encoder, it is mainly based on subjective code control.

Compared with Narrow Height 1.0, Narrow Height 2.0 will have more, more adequate and more complex technologies to ensure adaptive capability, including JND adaptive content coding, ROI coding, SDR+, more natural detail enhancement, etc. At the same time, in Narrow Height 2.0, a repair capability that is more suitable for hot content has been added. While the quality is improved, the bit rate savings are also more.

2. Challenges of live broadcast

At present, narrowband high-definition technology is widely used in long video, short video, pan-entertainment, online education, e-commerce live broadcast and other scenarios.

Compared with scenarios such as long videos and e-commerce live broadcasts, NBA basketball live broadcasts often require high bit rate streaming due to fast screen switching and strong movement. However, high bitrate live broadcasts, especially NBA live broadcasts, may be subject to fluctuations in network quality during cross-border transmission, resulting in audio and video freezes and delays.

In order to ensure the stability of the live broadcast and a smooth viewing experience based on the playback side, Blockbuster TV chose a source stream with a smaller bit rate. Therefore, faced with multiple challenges in real scenarios:

Challenge 1: Low bit rate leads to blurred and distorted images of the field

Compared with the picture quality of the high bit rate stream, the low bit rate stream will have more obvious compression distortion, blurred details and weak texture loss. For the basketball match scene, it will cause many picture quality phenomena such as blurred text on the football jersey, blurred nets, boundary lines and text edge burrs on the ground, etc., resulting in a poor viewing experience.

Challenge 2: "Deinterlacing" Residues of Vigorous Motion Pictures

In addition to the blurred details of compression distortion caused by low-bit-rate streaming, sports scenes have a unique problem, that is, the original signal is generally collected by interlaced scanning, and "de-interlacing" needs to be done first during Internet transmission, but for violent For moving pictures, it is difficult to ensure perfect de-interlacing. Usually, there will be some "interlacing" that is not removed cleanly, resulting in some residual noise.

Challenge 3: Picture loss after several transcoding

In addition, based on the current business logic of enterprise customers, live video has undergone several transcodings from shooting to end users. Each transcoding will bring certain compression distortion and image quality loss.

In order to better balance the live broadcast fluency, stability and high-definition image quality experience, Blockbuster TV first chose a relatively low bit rate to achieve stable cross-border transmission during the broadcast of the NBA finals, and then pulled the source stream to China and then repaired it. In this process, Blockbuster TV used the "Narrowband HD 2.0" technology of Alibaba Cloud Video Cloud.

3. Solutions for sports events

For sports video, if you simply use Alibaba Cloud online conventional narrowband HD transcoding, there are two major drawbacks:

First, it is difficult to repair the unique noise in the video of sports events, and it may also amplify some noise, which will affect the viewing experience.

Second, conventional narrow-band HD cannot perfectly restore the unique elements of basketball scenes, such as numbers on jerseys, nets, and boundary lines.

To this end, Narrowband HD 2.0 optimizes the existing atomic algorithm capabilities for sports events, and some algorithms are optimized for basketball scenarios.

The final transcoding process is shown in the following figure:

Live Streaming Transcoding Algorithm Process

4. Analysis of key technologies

4.1 Video Processing

Ultimate Repair Generation

As mentioned earlier, the image quality of our input source itself is not high, and it has been transcoded many times. Therefore, the first processing step is repair generation, the main purpose of which is to repair various defects in the video, such as compression block effect, compression Artifacts, edge glitches, residual noise after de-interlacing, blurring, etc., while generating some texture details lost due to compression.

In academia, there are many researches that use deep learning to de-compress distortion and de-blur. For example, ARCNN[1] for image decompression in the early stage, MFQE[2] for video decompression, and DeepDeblur[3] for the early end-to-end deblurring algorithm.

Relatively new methods include: image decompression algorithm FBCNN [4] with its own compression degree estimation, video decompression algorithm STDF [5] based on deformable convolution, NAFNet [6] without nonlinear activation, and so on.

Most of these algorithms are model training for constructing a dataset and designing a network structure for a single task, and the resulting model can only handle a single degradation type. Contains a variety of "degradation", in addition to typical video compression, but also camera out-of-focus blur/motion blur, residual noise after de-interlacing, etc.

The network structure of image decompression algorithm ARCNN


Network Structure of Video Decompression Algorithm MFQE


The network structure of the end-to-end deblurring algorithm DeepDeblur

One way to address many of the above "degradations" is to train a model for each degradation and then run the models in sequence. The advantage of this method is that the task of each model becomes simpler, which is convenient for data set construction and training, but the effect is not good in actual use, because other degradations will bring great interference, resulting in a sharp drop in algorithm performance.

Therefore, we take the second approach, which is to use one model to deal with multiple degradations. The advantage of the second method is that it can achieve relatively better processing results. The difficulty is that the structure of the training data is relatively complex, and the requirements for network capacity are relatively high. It is necessary to take into account a variety of degradation methods at the same time, which can also have a variety of permutations and combinations. .

In terms of training data structure, we borrowed the data degradation methods in BSRGAN[7]/Real-ESRGAN[8] in the field of image super-resolution and RealBasicVSR[9] in the field of video super-resolution, and added some special sports event live scenes. The degradation mode is used to simulate defects such as jagged edges and white edges at the boundary line of the site.

In terms of network structure, in order to reduce the amount of calculation, we use a single image processing method, which can use the classic ESRGAN[10] model or the common UNet[12] structure, or the VGG-Style structure mentioned by ResSR[13] .

In terms of loss function, considering the need to repair the details lost due to various degradations, in addition to the common L1/L2 loss, percectual loss and GAN loss are also used.


Various image degradation methods proposed by BSRGAN

A major problem of GAN-based generative networks is that the robustness and temporal continuity are not good enough. The problem of robustness refers to whether it can stably generate more natural textures. For example, some GAN models sometimes generate strange and unnatural detailed textures, especially when some strange textures are generated in the face area.

The temporal continuity problem refers to whether the textures generated by adjacent frames are consistent. If they are inconsistent, flickering will occur and the viewing experience will be reduced.

In order to solve the robustness problem, especially the robustness of the face region, we borrowed the idea of LDL [14] to improve the generation effect of fine-scale details by detecting the fine-scale details region and imposing additional penalties. The face region is obtained by segmentation, and additional penalties are imposed on the face region generation effect to improve the robustness of face region detail generation.


face region segmentation

For the temporal continuity problem, we adopt the TCRnet network as an additional supervision signal to improve. The TCRnet network was originally used for super-resolution tasks, and can be used for repair tasks through simple transformation. The network uses IRRO offset iterative correction module combined with deformable convolution to improve the accuracy of motion compensation, and uses ConvLSTM to compensate for timing information to prevent information error, thereby improving the temporal continuity.


TRCNet network structure

The following two images compare the source stream and the effect after restoration.

As can be seen from the first comparison picture, the edges of the letters GARDEN on the repaired floor have become very clear and sharp, and the boundary lines, player outlines and the number 22 on the jersey have become clearer, and the floor texture has also been repaired.

The second comparison picture also shows that the outlines of the audience outside the venue and the lines on the clothes have become clearer. In addition, the original distorted and jagged floor boundary lines have also become straight.

Model acceleration

In order to obtain the ultimate repair generation effect, AI algorithms based on deep learning are usually the preferred algorithms. However, a problem with deep learning algorithms is the large amount of computation, and for low-level vision tasks such as video inpainting, the amount of computation is much larger than that of ordinary high-level vision tasks.

On the one hand, the input of the video inpainting generation model is usually the original resolution of the video, while the input resolution of a high-level processing model such as detection and classification can be much smaller than the original resolution, and it basically does not affect the performance of detection and classification. For the same network structure, the larger the input resolution, the greater the computational load, so the computational load of the video repair model is much larger.

On the other hand, the output of the video repair generation model is a video frame with the same resolution as the input video, which will inevitably make the calculation amount of the second half of the model very large, because the latter half also needs to be in a relatively high resolution feature map Compared with the high-level task of detection and classification, which only outputs semantic information such as the target frame or category, although the number of channels in the second half of the model is large, the overall calculation amount is much smaller because of the small resolution of the feature map.

In addition, for live sports events, the video frame rate is usually 50fps, and the resolution of the Blu-ray file is usually 1080p, that is, the deep learning model needs to run at least 50fps under 1080p input, which is a very big challenge for the deep learning algorithm. .

In response to this situation, we accelerate model inference from multiple dimensions.

First, compress the deep learning model, such as reducing the size of the model through Neural Architecture Search (NAS) or pruning. In order to compensate for the performance loss after the model becomes smaller, it is necessary to perform knowledge distillation training on the compressed model to improve The performance of small models can be further reduced by 8bit integer quantization or FP16 half-precision.

Secondly, the extreme speed improvement can be obtained by selecting the appropriate hardware and inference framework, such as using a high-performance GPU card and a supporting inference framework to achieve optimal configuration. In order to further improve the inference speed, multiple GPU cards can also be used for parallel computing.

Through the above-mentioned various acceleration methods, under the input of 1080p resolution, the processing speed is increased from 8fps to 67fps, which fully meets the needs of 50fps live broadcast transcoding.


Deep Learning Algorithms Accelerate Classification

sharpness enhancement

In order to improve the viewing experience, on the basis of the above-mentioned extreme repair generation, further sharpness enhancement processing has been done.

The simplest sharpening algorithm is to do sharpening. For example, unsharp and cas that come with ffmpeg are two simple sharpening algorithms. Both methods, unsharp and cas, are designed based on the USM (UnSharp Mask) framework. The USM framework can be described by the following formula [15]:

Among them, the original image to be sharpened, blurred is the blurred version of original, such as the Gaussian blurred version, which is also the origin of the name unsharp. (original - blurred) represents the detail part of the original image, multiplied by the amount and then superimposed on the original image to obtain a sharper and clearer image sharpened.

In addition to sharpening, you can also improve clarity by adjusting contrast, brightness, color, and more. In the live broadcast of a basketball game on Blockbuster TV, we use our self-developed algorithms for sharpening, brightness, contrast and color enhancement to further improve clarity.

Among them, compared with open source sharpening algorithms such as unsharp, Alibaba Cloud Video Cloud self-developed sharpening algorithm has the following characteristics:

• More refined image texture detail extraction method: it can extract image texture structures of different sizes and features, and the enhancement effect is better;

• By analyzing the texture structure of the image content, the local area adaptive enhancement is realized according to the regional texture complexity;

• Combined with coding, the enhancement strategy can be adaptively adjusted according to the coding information feedback from the encoder.

Detail enhancement (sharpening) algorithm flow

4.2 Bit Rate Allocation

JND

Through the previous extreme repair generation and sharpness enhancement, the detail information has been greatly increased, and we hope to retain this information as much as possible after compression encoding. We know that traditional video coding is based on information theory, so it has been removing temporal redundancy, spatial redundancy, statistical redundancy, etc., but the mining of visual redundancy is far from enough. The following picture is taken from a paper by Dr. Wang Haiqiang. Its idea is to do RDO traditionally, which is a continuous convex curve, but in the eyes of people, it is a ladder shape, then we can save the code rate as long as we find this ladder, At the same time, the subjective quality is not affected. JND (Just Noticeable Difference) mines visual redundancy based on this idea.


Bitrate vs Perceptual Distortion

The self-developed JND algorithm of Alibaba Cloud Video Cloud fully exploits visual redundancy from the two dimensions of space domain and time domain, and realizes the same subjective quality bit rate savings of more than 30% in general scenarios.

With this self-developed JND algorithm, the detail information obtained through ultimate restoration generation and sharpness enhancement can still be retained after encoding at a lower bit rate.


JND algorithm flow

ROI

The aforementioned JND algorithm can save more than 30% of the code rate through the mining of visual redundancy, but this code rate saving is completely based on the low level statistical information, and does not consider the high level semantic information.

For the close-up shots of the characters that the audience pays close attention to in the sports event scene, we hope to make the close-ups of the characters more clearly presented in front of the audience. In addition to obtaining clear close-ups of characters through extreme repair generation, there must be some way to make them still clear after encoding. Here, we need to use our self-developed ROI coding technology.

ROI (Region Of Interest) coding is a video coding technology based on regions of interest. Simply put, it is to allocate more bit rates to the regions of interest in the image to improve the image quality, and to allocate less bit rates to other uninteresting regions. The overall video viewing experience can be improved while the overall bit rate is basically unchanged.

The main difficulties of ROI coding are:

1) There must be an ROI algorithm with low enough cost and fast enough speed to meet the requirements of high-resolution and high-frame-rate live sports events;

2) How to make code control decisions based on ROI, so that the subjective quality of the ROI area is improved, the subjective quality of the non-ROI area is not decreased, and the time domain is kept continuously without flickering.

In terms of low-cost ROI calculation, we have developed a self-adaptive decision-making face detection and tracking algorithm, that is, most of the time only needs to do face tracking with minimal computation, and only a small part of the time needs to do face detection, so as to achieve ultra-low cost and fast ROI acquisition while maintaining high accuracy.

In terms of code control decision-making, on the one hand, it is combined with the encoder to achieve a balance between subjective and objective, and the time domain is consistent; on the other hand, it is combined with JND to achieve a subjective balance between ROI and non-ROI, so as to achieve scene and quality. Adaptive code rate allocation.


ROI algorithm process

4.3 Coding Kernel

For live sports events, in terms of the video coding kernel, we have optimized the subjective fast division and block effect to improve the subjective clarity of the compressed video and reduce the block effect, thereby improving the overall viewing experience.

Subjective block division

The block division mode decision of the encoder is based on the best rate-distortion model RDO (Rate Distortion Optimization):

where D is the distortion and R is the number of bits required to encode the current mode.

When making block division decisions, there are sometimes cases where the final decision is a large block, but the result of dividing into small blocks is subjectively better. This is because the large block mode, although the distortion D is larger, but the R is smaller, resulting in the final decision of the encoder for large block division.

In response to this situation, we modified the distortion expressions of different block division modes, and added different weight coefficients for blocks of different sizes, so that the final division result is more consistent with the subjective.


Before optimization After optimization


Block division before optimization Block division after optimization

Block effect optimization

The rate-distortion theory of video coding is more appropriate to the perception of the human eye. The encoder constructed according to the rate-distortion theory is also optimized for the subjective quality of the human eye. The only problem is the block effect, because the human eye enlarges the straight line and is very sensitive to the block effect.

We observed that in the objective-based RDO (Rate Distortion Optimization), the encoding part mode will amplify the blocking effect, while the deblock in the 265 protocol fails in this scenario. At the same time, we found that in the flat area scene, the effect of blur plus noise is better than the clear block effect.

Based on the above observations, we adopted the following block effect optimization strategy to minimize the block effect and improve the viewing experience.


Block effect optimization algorithm flow

The following figure is a comparison chart before and after we do block effect optimization. It can be seen that the block effect is significantly reduced in the optimized result on the right.


Before optimization After optimization

4.4 Video effect display

Through the aforementioned video processing, bit rate allocation optimization and encoding kernel optimization, the ultimate picture quality restoration and 50fps live transcoding under 1080p are finally achieved, providing viewers with a smooth, stable and high-definition viewing experience.

https://www.youku.com/video/XNTg4MjMyNzMwOA==
The left is the source flow effect, the right is the repaired effect

It can be seen that the cooperation with BesTV in NBA events fully embodies the important value of "Narrowband HD 2.0" technology in improving the visual experience in the live broadcast of basketball events. The commercial significance of higher definition is balanced with the sense of viewing.

In the future, narrowband high-definition technology will also continue to be upgraded to further improve the repair generation effect, reduce the bit rate and optimize the cost through algorithm capabilities. At the same time, this technology will also be applied to more top-level events to achieve a new upgrade of visual experience based on cost optimization and reconciliation.
references:
[1] ARCNN: Chao Dong, et al., Compression Artifacts Reduction by a Deep Convolutional Network, ICCV2015
[2] MFQE: Ren Yang, et al., Multi-Frame Quality Enhancement for Compressed Video, CVPR2018
[3] DeepDeblur: Seungjun Nah, et al., Deep Multi-scale Convolutional Neural Network for Dynamic Scene Deblurring, CVPR2017
[4] FBCNN: Towards Flexible Blind JPEG Artifacts Removal, ICCV2021
[5] STDF: Jianing Deng, et al., Spatio-Temporal Deformable Convolution for Compressed Video Quality Enhancement, AAAI2020
[6] NAFNet: Liangyu Chen, et al., Simple Baselines for Image Restoration, https://arxiv.org/abs/2204.04676
[7] BSRGAN: Kai Zhang, et al., Designing a Practical Degradation Model for Deep Blind Image Super-Resolution, CVPR2021
[8] Real-ESRGAN: Xintao Wang, et al., Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data, ICCVW2021
[9] RealBasicVSR: Kelvin CK Chan, et al., Investigating Tradeoffs in Real-World Video Super-Resolution, CVPR2022
[10] ESRGAN: Xintao Wang, et al., ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks, ECCVW2018
[11] ESRGAN: Xintao Wang, et al., ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks, ECCVW2018
[12] UNet: Olaf Ronneberger, et al., U-Net: Convolutional Networks for Biomedical Image Segmentation, MICCAI2015
[13] RepSR: Xintao Wang, et al., RepSR: Training Efficient VGG-style Super-Resolution Networks with Structural Re-Parameterization and Batch Normalization, https://arxiv.org/abs/2205.05671
[14] LDL: Jie Liang, et al., Details or Artifacts: A Locally Discriminative Learning Approach to Realistic Image Super-Resolution, CVPR2022
[15] USM: https://en.wikipedia.org/wiki/Unsharp_masking

"Video Cloud Technology", your most noteworthy public account of audio and video technology, pushes practical technical articles from the frontline of Alibaba Cloud every week, where you can communicate with first-class engineers in the audio and video field. Reply to [Technology] in the background of the official account, you can join the Alibaba Cloud video cloud product technology exchange group, discuss audio and video technology with industry leaders, and obtain more latest industry information.

CloudImagine
222 声望1.5k 粉丝