The Application of AI Algorithm in Video Scalable Coding

[Focus on Rongyun Global Internet Communication Cloud] It mainly includes five parts: the characteristics of three commonly used scalable video coding; the encoder and its application methods used by WebRTC; the application status of scalable coding in WebRTC; based on scalable coding The target detection and code rate allocation method; the application prospects and research directions of the combination of AI and scalable coding.

Features of three commonly used scalable video coding

After the video image is digitized, the data volume is very large. The existing network and storage devices cannot directly store the original video image. The video and image must be compressed. The existing mainstream video compression algorithms are H.264, VP8, VP9, HEVC , VVC, etc. On the one hand, from H.264 to VVC, the encoding complexity is getting higher and higher, and the compression efficiency is getting higher and higher; on the other hand, the transmission network bandwidth varies in size and changes at any time, and a single bit stream cannot adapt to multiple types. The network and equipment environment of different receivers. For example, 4G networks and 5G networks have different transmission bandwidths. If the same set of code streams are transmitted on 4G and 5G networks, the 5G network bandwidth may not be fully utilized, and ultimately affect the viewing effect of the video.

There are multiple different receivers in the current video application environment. The following two technologies can be used to solve this problem: Simulcast and Scalable Video Coding (SVC).

As shown in Figure 1, Simulcast is to transmit multiple streams at the same time. Different streams have different bit rates and are used to transmit streams under different bandwidths. When the terminal device is in a high-bandwidth network environment, it can transmit high-bitrate video to get a better video viewing experience; when the terminal device is in a low-bandwidth network environment, it can transmit low-bitrate video to reduce video The playback freezes. However, the types of code rates supported by Simulcast are limited, making it difficult to adapt to complex network environments. In response to this problem, researchers have proposed scalable video coding SVC, where video data is compressed only once, but can be decoded at multiple frame rates, spatial resolutions, or video quality. For example, with three-layer airspace gradability and two-layer time-domain gradability, there are six modes that can be combined. Compared with the Simulcast method, the adaptability of the system is greatly improved.

(Picture 1 Simulcast & Gradeable)

There are three commonly used scalable codes, namely: Spatial Scalability, Quality Scalability, and Temporal Scalability.

(Figure 2 Three commonly used methods of scalable coding)

Spatial scalable coding (Figure 3), that is, to generate multiple images with different spatial resolutions for each frame of the video, decode the low-resolution image obtained from the basic layer code stream, and if you add the enhanced layer code stream to the decoder, you will get Is a high-resolution image.

(Figure 3 Airspace can be graded)

The quality can be graded (Figure 4). A feasible approach is to perform a rough quantization on the original image after DCT transformation of the base layer code stream, and then form the base layer code stream after entropy coding. The rough quantized data is inversely quantized to form base layer coefficients, which are subtracted from the original image DCT transform coefficients to form a difference signal, and then this difference signal is fine quantized and entropy coded to generate an enhanced layer code stream.

(Figure 4 The quality can be graded)

Time-domain scalability (Figure 5), that is, the video sequence is divided into multiple layers without overlapping, and ordinary video coding is performed on the frames of the base layer to provide the base layer code stream with basic time resolution; for the enhancement layer, it is used The base layer data encodes the inter-frame prediction of the enhancement layer to generate enhancement layer data.

(Figure 5 Time domain can be graded)

Encoders used by WebRTC and their application methods

Encoders supported by WebRTC include VP8, VP9 and H.264. At the user experience level, the effects of the VP8 and H.264 encoders are basically similar. As the next generation encoder of VP8, VP9 is better than VP8 and H.264 in terms of high-definition video compression.

As shown in Figure 6, combining the performance of the encoder and the support of the browser encoder, the following conclusions can be drawn: VP8 and H.264 encoding effects are basically the same, under normal circumstances both can be used; VP9 is mainly used in Google's own various Among the video products, it should be particularly pointed out that VP9 supports multiple SVCs; HEVC can only be used in Apple systems at present and cannot be promoted and is not recommended; AV1 is also too new and can only be used in Google’s products. Good support, not recommended for the time being.

(Figure 6 Encoder support in the browser)

The Application Status of Scalable Coding in WebRTC

Before introducing the application of scalable coding in WebRTC, let's briefly introduce the communication and networking process of WebRTC.

As shown in Figure 7, client A and client B communicate in a direct connection mode or a server mode. In large-scale networks, a server-based mode will be used for forwarding and signal processing.

(Figure 7 WebRTC simple process)

In view of the characteristics of multiple receivers in multiple application scenarios, WebRTC provides three solutions: Mesh, MCU, and SFU.

Mesh scheme (Figure 8), that is, multiple terminals are connected in pairs to form a mesh structure. For example, three terminals A, B, and C perform many-to-many communication. When A wants to share media (such as audio and video), it needs to send data to B and C respectively. In the same way, if B wants to share media, he needs to send data to A and C respectively, and so on. This solution requires relatively high bandwidth for each terminal.

(Figure 8 Mesh scheme)

MCU (Multipoint Conferencing Unit) solution (Figure 9), this solution consists of a server and multiple terminals forming a star structure. Each terminal sends the audio and video stream that it wants to share to the server. The server will mix the audio and video streams of all terminals in the same room, and finally generate a mixed audio and video stream and send it to each terminal. The terminal can see/hear the audio and video of other terminals. In fact, the server side is an audio and video mixer, this solution will be very stressful on the server.

(Figure 9 MCU solution)

SFU (Selective Forwarding Unit) solution (Figure 10), this solution is also composed of a server and multiple terminals, but unlike MCU, SFU does not mix audio and video. After receiving the audio and video stream shared by a certain terminal, Just forward the audio and video stream directly to other terminals in the room.

(Figure 10 SFU scheme)

The different bandwidths of the three networks are shown in Figure 11. It can be seen that the maximum bandwidth of SFU reaches 25mbps, and the minimum bandwidth of MCU is 10mbps.

(Figure 11 Bandwidth of three different networks)

In terms of characteristics, the Mesh solution has poor flexibility; the MCU solution requires operations such as transcoding, merging, and splitting of the code stream; the SFU solution server has less pressure and better flexibility, and is widely welcomed.

Figure 12 is a schematic diagram of the forwarding modes of Simulcast mode and SVC mode. It can be seen from the upper and lower figures that the SVC-based code stream distribution method has greater modifiability for the PC side. No matter which networking method is adopted, the SVC method will be more robust than the Simulcast method.

(Figure 12 Simulcast and SVC mode forwarding mode)

The support situation is shown in Figure 13. It can be seen from the figure that H.264 only supports Simulcast, VP8 supports time-domain scalability, and VP9 supports SVC encoding in all aspects. VP9 is Google's main codec, but the promotion of H.264 codec optimization is not strong, which limits the application of WebRTC to a certain extent. For example, Apple's latest iPhone 13 mobile phone comes with H.264. Hardware acceleration function. If AV1 encoder is used, although the advantages of SVC can be obtained, hardware decoding cannot be performed. In WebRTC, Simulcast uses multi-threading technology by default to start multiple OpenH264 encoders at the same time, while SVC can call OpenH264 for time-domain and spatial-domain scalable coding.

(Figure 13 The support of scalable coding in WebRTC)

Target detection and bit rate allocation scheme based on scalable coding

For N-way SFU, SFU must consider the sum of the remaining N-1 terminal code rates. For most video conferences, the ratio of the bit rate to the total bit rate under a given time domain and space domain layer conditions is basically constant. As shown in Figure 14.

(Figure 14 Code stream distribution diagram of different layers)

According to the phenomenon in Figure 14, the video motion is used as a main measurement index to distribute the code stream. The specific project framework of related papers is shown in Figure 15.

(Figure 15 SVC encoder code rate allocation)

There are two room for improvement in this solution: the first is the difference between the current frame and the previous frame used by the motion measurement method, which is difficult to accurately reflect the situation of video motion changes. The second is to add features other than motion features to better reflect the changes in images and videos. The proposed solution is shown in Figure 16.

(Figure 16 The proposed solution)

In WebRTC, the H.264 encoder uses Cisco's open source OpenH264 encoder. The OpenH264 scalable encoding configuration file is shown below. This configuration file sets up two levels of time domain hierarchical levels.

(Figure 17 OpenH264 scalable encoding configuration file)

The characteristic of SVC code stream is that a set of code stream has a multi-layer structure. In actual use, the code stream needs to be extracted. For temporal scalability, the code stream is extracted by analyzing the Temporal ID in each NAL; for spatial scalability, the code stream is extracted by analyzing the Spatial ID in each NAL; for quality scalability, In other words, the code stream is extracted by analyzing the Quality ID in each NAL.

It can be seen from Figure 18 that the code stream of the base layer of OpenH264 can be directly decoded by the AVC decoder, and the SVC_extension_flag of the base layer is equal to 1.

(Figure 18 Decoding diagram of scalable coding base layer)

The NAL of the SVC enhancement layer code stream contains the SVC syntax, and the SVC code stream needs to be transcoded. The reference software JSVM for scalable coding has a special transcoding module. Figure 19 shows the transcoding process. Multiple NAL units can be found Was rewritten into AVC format.

(Figure 19: Scalable coding enhancement layer NAL layer transcoding)

Figure 20 shows the decoding effect of the code stream after conversion with JSVM, which can be decoded with a standard AVC decoder.

(Figure 20 Decoding diagram after NAL layer transcoding)

Application prospects and research directions of the combination of AI and scalable coding

The most frequently used method in scalable coding is spatial scalable technology, but when different resolutions are converted, the quality decreases significantly. At the ICME2020 conference, some scholars proposed a super-resolution model for video coding, which reconstructs high-resolution images by extracting images at different times for feature fusion. The experimental results show that the super-scoring effect is improved.

(Figure 21 Video super-resolution structure diagram)

Using this model in a scalable encoder can effectively improve the discomfort caused when switching between different resolution streams.

MPEG5 proposed Low Complexity Enhancement Video Coding (LCEVC). Compared with H.264, this coding method has higher compression efficiency under the same PSNR. The encoder is shown in Figure 22. Among them, the basic encoder Base Encoder can choose any existing encoder, such as H.264, VP8, VP9, etc.

The combination of WebRTC and LCEVC is a future development direction. As a new video coding standard, it has several features: it improves the compression capability of the base layer coding, has low coding and decoding complexity, and provides an additional feature enhancement platform.

It can be seen from Figure 22 that the encoding complexity mainly depends on the Base Encoder. If H.264, which is widely used in WebRTC, is enhanced by the LCEVC method, the encoding effect will be significantly improved as the complexity increases. Generally speaking, a real-time sports video stream with a high frame rate of 1080P encoded by H.264 requires a maximum bit rate of 8Mbps, while using LCEVC only requires 4.8Mbps.

(Figure 22 LCEVC encoder)

In view of the effect of LCEVC coding, it can be judged that the combination of LCEVC and WebRTC will be an important research and application direction.

The Application of AI Algorithm in Video Scalable Coding

Features of three commonly used scalable video coding

Encoders used by WebRTC and their application methods

The Application Status of Scalable Coding in WebRTC

Target detection and bit rate allocation scheme based on scalable coding

Application prospects and research directions of the combination of AI and scalable coding

融云RongCloud

引用和评论

融云 uni-app IMKit 上线，1 天集成，多端畅行

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

大模型时代，后端程序员如何避免被AI卷死？