视频 - Use VideoToolbox to Explore Low-Latency Video Coding | WWDC Speech Record - 网易云信技术小站

This article is based on Peikang sharing translations in WWDC 2021. The speaker Peikang comes from the Video Coding and Processing teams. The translator Tao Jinliang, a senior audio and video development engineer at NetEase Yunxin, has many years of end-to-side audio and video work experience.

Supporting low-latency encoding has become an important aspect of the video application development process, and has a wide range of applications in the fields of low-latency live broadcast and RTC. This sharing mainly shares how VideoToolbox (a low-level framework that provides direct access to hardware encoders and decoders, provides video compression and decompression services, and conversion between raster image formats stored in CoreVideo pixel buffers) is how Supports low-latency H.264 hardware encoding to minimize end-to-end delay and achieve a new level of performance, ultimately achieving the best real-time communication and high-quality video playback.

Share the recorded video: https://developer.apple.com/videos/play/wwdc2021/10158

Low-latency coding is very important for many video applications, especially real-time video communication applications. In this lecture, I will introduce a new encoding mode in VideoToolbox to achieve low-latency encoding. The goal of this new model is to optimize the existing encoder pipeline for real-time video communication applications. So what do real-time video communication applications need? We need to minimize the end-to-end delay in communication.

We expect to enhance interoperability by allowing video applications to communicate with more devices. For example: when there are multiple recipients in a call, the encoder pipeline should also be efficient, and the application needs to present the video with the best visual quality. Then, we need a reliable mechanism to recover communication from errors introduced by network loss.

The low-latency video coding I will talk about today will be optimized in these areas. Using low-latency coding mode , our real-time application can reach a new level of performance. In this talk, I will first outline low-latency video coding. We can have a basic understanding of how to achieve low latency in the pipeline. Then I will show how to use the VTCompressionSession API to build a pipeline and use low-latency mode for encoding. Finally, I will discuss the multiple features that we introduced in low latency mode.

Low-latency video encoding

Let me first outline low-latency video encoding. This is a simplified diagram of the video encoder pipeline on the Apple platform. VideoToolbox takes CVImagebuffer as the input image, and it requires the video encoder to implement compression algorithms such as H.264 to reduce the size of the original data. The output compressed data is encapsulated in CMSampleBuffer, which can be used for video communication through network transmission. From the above figure, we can notice that the end-to-end delay may be affected by two factors: processing time and network transmission time.

In order to minimize processing time, the low-latency mode eliminates frame reordering and follows a one-in-one-out encoding mode. In addition, the rate controller in the low-latency encoding mode is also faster to adapt to network changes , so the delay caused by network congestion can also be minimized. Through these two optimizations, we can already see a significant performance improvement compared with the default mode. For 720p 30fps video, low-latency encoding can reduce the delay up to 100 milliseconds. This saving is very important for video conferencing.

Through this operation, the delay is reduced, and we can implement a more efficient encoding pipeline for real-time communications such as video conferences and live broadcasts.

In addition, the low-latency mode always uses a hardware accelerated video encoder to save power. Please note that the video codec type supported by this mode is H.264, and we will introduce this feature on iOS and macOS.

Use low latency mode in VideoToolbox

Next, I want to talk about how to use low latency mode in VideoToolbox. I will first review the use of VTCompressionSession, and then show you the steps required to enable low-latency encoding.

Use of VTCompressionSession

When we use VTCompressionSession, we must first use the VTCompressionSessionCreate API to create a session. And configure the session through the VTSessionSetProperty API, such as the target bit rate. If no configuration is provided, the encoder will run with default behavior.

After the session is created and properly configured, we can pass the CVImageBuffer to the session by calling VTCompressionSessionEncodeFrame, and at the same time, we can retrieve the encoding result from the output handler provided during the session creation.

Enabling low-latency encoding in a compressed session is easy, the only thing we need to do is to modify it during the session creation process, as shown in the following code:

First, we need a CFMutableDictionary for the encoder specification, which is used to specify the specific video encoder that the session must use.
Then we need to set the EnableLowLatencyRateControl flag in the encoderSpecification.

Finally, we assign this encoderSpecification to VTCompressionSessionCreate, and the compressed session will run in low-latency mode.

The configuration steps are the same as usual. For example, we can use the AverageBitRate property to set the target bit rate.

Okay, we have introduced the basics of Video Toolbox low latency mode. Next, I want to continue to introduce the new features in this mode, which can further help us develop real-time video applications.

VideoToolbox low latency mode new features

So far, we have discussed the delay advantages of using low-latency mode, and the rest of the benefits can be achieved through the functions I will introduce.

The first feature is the new Profiles, we enhance interoperability Profiles I will also talk about Time Domain Layered SVC, This feature is very useful in video conferences. You can also use the maximum frame quantization parameter (Max QP) for fine-grained control of image quality. Finally, we hope to improve error recovery capabilities Long Term Reference (LTR)

New Profiles support

Let's talk about the new Profiles support. Profile defines a set of encoding algorithms that the decoder can support. Profile is used to determine the algorithm used for inter-frame compression in the video encoding process (for example, whether it contains B-frames, CABAC support, color space support, etc.). The higher the profile, the more The higher the compression feature, the higher the requirements on the codec hardware. In order to communicate with the receiver, the encoded bitstream should conform to a specific profile supported by the decoder.

In VideoToolbox, we support a series of profiles, such as Baseline Profile, Main Profile and High Profile. Today, we added two new profiles to the series: Constrained Baseline Profile (CBP) and Constrained High Profile (CHP).

CBP is mainly used for low-cost applications, while CHP has more advanced algorithms to obtain a better compression ratio. We can first check the decoder function to determine which Profile should be used.

To use CBP, just set the ProfileLevel session attribute to ContrainedBaseLine_AutoLevel . Similarly, we can set the Profile level to ContrainedHigh_AutoLevel to use CHP.

Time domain hierarchical SVC

Now let's talk about time-domain hierarchical SVC. We can use time domain layering to improve the efficiency of multi-party video calls.

For example: a simple three-party video conference scene. In this model, receiver A has a lower bandwidth of 600 kbps, while receiver B has a higher bandwidth of 1,000 kbps. Usually, the sending end needs to encode two sets of code streams to meet the downlink bandwidth of each receiving end. This approach may not be optimal.

This model can be implemented more efficiently through time-domain hierarchical SVC, where the sender only needs to encode one bit stream, but in the end, the bit stream output can be divided into two layers.

Let's take a look at how this process works. This is a sequence of coded video frames, where each frame uses the previous frame as a prediction reference.

We can pull half of the frames into another layer, and we can change the reference so that only the frames in the original layer are used for prediction. The original layer is called the base layer, and the newly constructed layer is called the enhancement layer. The enhancement layer can be used as a supplement to the base layer to increase the frame rate.

For receiver A, we can send the base layer frame because the base layer itself is already decodable. More importantly, since the base layer only contains half of the frames, the transmitted data rate will be very low.

On the other hand, receiver B can enjoy smoother video because it has enough bandwidth to receive base layer frames and enhancement layer frames.

Let's take a look at a video encoded using time-domain layered SVC. I will play two videos, one from the base layer and the other from the base layer and the enhancement layer. The base layer itself can be played normally, but at the same time we may notice that the video is not smooth. If we play the second video, we can immediately see the difference. Compared with the video on the left, the video on the right has a higher frame rate because it contains both a base layer and an enhancement layer.

The video on the left has an input frame rate of 50% and uses a target bit rate of 60%. These two videos only require the encoder to encode one bitstream at a time. This will be more energy efficient when we conduct multi-party video conferences.

Another benefit of time domain layering is error recovery capabilities. We can see that the frames in the enhancement layer are not used for prediction, so there is no dependency on these frames. This means that if one or more enhancement layer frames are lost during network transmission, other frames will not be affected. This makes the entire session more robust.

The method to enable time domain layering is very simple. BaseLayerFrameRateFraction in low latency mode, just set this attribute to 0.5, which means that half of the input frames are allocated to the base layer and the rest to the enhancement layer.

We can check the layer information from the sample buffer attachment. For the base layer frame, CMSampleAttachmentKey_ IsDependedOnByOthers will be true, otherwise it will be false.

We can also choose to set a target bit rate for each layer. Remember, we use the session attribute AverageBitRate to configure the target bit rate. After the target bit rate configuration is complete, we can set the new BaseLayerBitRateFraction attribute to control the target bit rate percentage required by the base layer. If this property is not set, the default value of 0.6 will be used. We recommend that the base layer bitrate score should be in the range of 0.6 to 0.8.

Maximum frame QP

Now, let us look at the maximum frame quantization parameter or maximum frame QP. Frame QP is used to adjust image quality and data rate.

We can use low-frame QP to generate high-quality images. But in this case, the image size will be large.

On the other hand, we can use high frame QP to generate low-quality but smaller images.

In the low-latency mode, the encoder adjusts the frame QP using factors such as image complexity, input frame rate, and video motion to produce the best visual quality under the current target bit rate constraint. Therefore, we encourage to rely on the default behavior of the encoder to adjust the frame QP.

However, when some clients have specific requirements for video quality, we can control the encoder to use the maximum frame QP. When using the maximum frame QP, the encoder will always select a frame QP smaller than this limit, so the client can have fine-grained control over the image quality.

It is worth mentioning that even if the maximum frame QP is specified, the regular rate control is still effective. If the encoder reaches the maximum frame QP limit but the bit rate budget is exhausted, it will start discarding frames to maintain the target bit rate.

An example of using this feature is to transmit screen content video over a poor network. We can make trade-offs by sacrificing frame rate to send clear images of screen content. This requirement can be met by setting the maximum frame QP.

We can use the new session attribute MaxAllowedFrameQP pass the maximum frame QP. According to the standard, the maximum frame QP value must be between 1 and 51.

Long-term reference frame (LTR)

Let’s talk about the last feature we developed in low-latency mode, the long-term reference frame. The long-term reference frame or LTR can be used for error recovery. Let's take a look at this picture, which shows the encoder, sender client, and receiver client in the pipeline.

Assuming that video communication passes through a poorly connected network, frame loss may occur due to transmission errors. When the receiving client detects frame loss, it can request to refresh the frame to reset the session. If the encoder receives a request, it usually encodes a key frame for refresh purposes, but the key frame is usually quite large. Large key frames take longer to reach the receiver. Since network conditions are already very poor, large frames may aggravate network congestion. So, can we use predicted frames instead of key frames for refresh? The answer is yes, if we have frame confirmation. Let's see how it works.

First, we need to decide which frame to confirm. We call these frames the long-term reference frame or LTR, which is the decision of the encoder. When the sender client transmits the LTR frame, it also needs to request confirmation from the receiver client. If the LTR frame is successfully received, it needs to return confirmation. Once the sender client obtains the confirmation and passes this information to the encoder, the encoder knows which LTR frames the other party has received.

Let's take a look at the bad situation of the network: when the encoder receives a refresh request, because this time, the encoder has a bunch of confirmed LTRs, and it can encode a frame predicted from one of these confirmed LTRs. Frames coded in this way are called LTR-P. Compared with key frames, the coded frame size of LTR-P is usually much smaller, so it is easier to transmit.

Now, let's talk about the API of LTR. Please note that the frame confirmation needs to be processed by the application layer, which can be done through mechanisms such as the RPSI message in the RTP control protocol. Here we only focus on how the encoder and the sender client communicate during this process. After enabling low-latency encoding, we can enable this feature EnableLTR

When the LTR frame is encoded, the encoder will signal a unique frame token RequireLTRAcknowledgementToken

The sender client is responsible for reporting the confirmed LTR frame to the encoder AcknowledgedLTRTokens Since multiple confirmations can be received at a time, we need to use an array to store these frame markers.

We can request to refresh the frame at any time through the ForceLTRRefresh Once the encoder receives this request, an LTR-P will be encoded. If no confirmed LTR is available, in this case, the encoder will generate a key frame.

to sum up

The above is the translation of all the content shared by Peikang at the WWDC 2021 conference. If there is any unreasonable translation, please correct me.

At present, NetEase Yunxin has implemented the software-encoded SVC and long-term reference frame scheme at the client level, and the server has also implemented the SVC scheme on forwarding. SVC provides the server with an additional means to control the forwarding bit rate of the video stream, combined with the size stream and bit rate suppression, as well as the downstream network bandwidth detection and congestion control of the client. In order to pursue the ultimate viewing experience, Netease Yunxin constantly The polished products and the content shared this time, I believe that it will be well used in Yunxin's products in the near future.

Share the recorded video: https://developer.apple.com/videos/play/wwdc2021/10158

For more technical dry goods, please pay attention to [Netease Smart Enterprise Technology+] WeChat public account

Use VideoToolbox to Explore Low-Latency Video Coding | WWDC Speech Record

Low-latency video encoding

Use low latency mode in VideoToolbox

Use of VTCompressionSession

VideoToolbox low latency mode new features

New Profiles support

Time domain hierarchical SVC

Maximum frame QP

Long-term reference frame (LTR)

to sum up

网易数智

引用和评论

InfoQ官媒报道|网易云信裴明明：云原生架构下中间件联邦高可用架构实践

三分钟掌握视频剪辑 | 在 Rust 中优雅地集成 FFmpeg

从0到1：Rust 如何用 FFmpeg 和 OpenGL 打造硬核视频特效

三分钟掌握音视频处理 | 在 Rust 中优雅地集成 FFmpeg

Rust 开发者必备：三分钟搞定视频缩略图生成

三分钟掌握视频分辨率修改 | 在 Rust 中优雅地使用 FFmpeg

从FFmpeg命令行到Rust：多场景实战指南