音视频 - Technical Practice | NetEase Yunxin Video Transcoding Speed-up Fragmentation Transcoding - 网易云信技术小站

In the media content dissemination industry, video, as a carrier of information dissemination, is becoming more and more important. Generally, in order to cope with different playback scenarios, the video needs to modify the media stream attributes such as the packaging container format, encoding format, resolution, bit rate, etc. These processing processes are collectively referred to as video transcoding .

NetEase Yunxin integrates NetEase’s 21 years of IM and audio and video technologies to provide industry-leading integrated communications cloud services. Among them, Yunxin's on-demand services, based on distributed processing clusters and large-scale distribution system resources, can meet the playback needs of all terminal devices, and provide enterprise users with extremely fast and stable cloud services such as video upload, storage, transcoding, playback and download. Function. Yunxin’s distributed task processing system carries the capacity of media processing. The main functions include audio and video transcoding, transcoding, merging, screenshots, encryption, adding and subtracting watermarks, etc., as well as image scaling, image enhancement, A series of pre-processing functions such as volume normalization.

video transcoding is the core function of media processing. When transcoding large video files, it usually takes a long time. In order to improve the quality of service, we will focus on increasing the video transcoding rate. This article will focus on sharding transcoding, introducing NetEase Yunxin's efforts and effect improvement in transcoding speed.

transcoding performance influencing factors and optimization

The common video transcoding process is generally as shown in the figure below:

In the transcoding process, the bottleneck is mainly in the video stream, so our discussion of speed improvement is mainly focused on the video stream processing , and the audio stream processing is temporarily out of consideration. For the video stream processing link, the discussion will be carried out from the following aspects:

Source video: Generally, the longer the source video, the longer the encoding and decoding time.

Encapsulation and encoding and decoding: For processing that does not need to be decoded or encoded, such as video conversion encapsulation, key frame clip cropping, etc., the amount of calculation required is small, and generally takes 1 to 2 seconds. If you need to re-encode and decode, the time will be different depending on the source video and encoding output parameters. Common encoding formats and bit rates, resolutions, and frame rates. For example, the compression rate and computational complexity of different encoding algorithms are different, resulting in different time-consuming requirements. For example, the AV1 encoding time is longer than that of H.264. Encoding time. The larger the target bit rate, resolution, frame rate and other parameters, the larger the calculation time is usually.

Horizontal and vertical scaling of computing resources: The stronger the single-core computing power of a general-purpose processor, the less time it takes to transcode. Using GPUs, which are more suitable for image computing and processing, is also beneficial to reduce the time-consuming transcoding. Increasing the degree of concurrent calculation of the transcoding execution stream is also conducive to reducing the time consumption of transcoding. The number of concurrent paths here can be multi-threaded and multi-process. Multi-threading refers to the promotion of multi-threading within a single process, and multi-process means that after the file is sliced, the multi-process calculates multiple sliced files.
Cluster task scheduling: Multi-tenant cloud service systems are usually based on the scheduling algorithm designed based on the integration of the priority of resource allocation among tenants and the priority of transcoding tasks within the tenants. The scheduling efficiency is mainly reflected in the following aspects: How to use less How to schedule more tasks with less time, how to achieve high throughput with fewer cluster resources, and how to design a trade-off between priority and anti-starvation.

In view of the above influencing factors, we propose the following optimization directions: improving hardware capabilities, optimizing coding, fragment transcoding, and improving cluster scheduling efficiency.

dedicated hardware acceleration

Multimedia processing is a typical computing-intensive scenario, and optimizing the overall computing performance of multimedia applications is crucial. The CPU is a general computing resource, and it is a common solution to offload video and image operations to dedicated hardware. At present, chip manufacturers in the industry such as Intel, NVIDIA, AMD, ARM, TI, etc. have corresponding multimedia hardware acceleration solutions to improve the coding efficiency of high bitrate, high resolution and other video scenes.

Our transcoding system is mainly based on the FFmpeg multimedia processing framework. The vendor solutions supported on the Linux platform include Intel’s VA-API (Video Acceleration API) and Nvidia’s VDPAU (Video Decode and Presentation API for Unix). It also supports the relatively more private Intel Quick Sync Video and NVENC/NVDEC acceleration schemes. At present, we mainly use the video acceleration capabilities of Intel HD graphics, combined with the QSV Plugin and VAPPI Plugin of the FFmpeg community, to implement hardware acceleration for the three modules of AVDecoder, AVFilter, and AVEncoder. Hardware acceleration related technologies, related manufacturers and communities are continuing to optimize, in our follow-up series of articles, we will also introduce further practices in this area in detail.

AMD big core server

This mainly refers to servers equipped with AMD EPYC series processors. Compared with our previous online servers, their single-core computing power is stronger and their parallel computing capabilities are better. The increase in single-core computing power allows us to have an overall general improvement in decoding, pre-processing, and encoding, while the super-large core makes the single-machine multi-process scene in our slice transcoding solution more powerful, greatly avoiding media Cross-machine IO communication of data.

Self-developed Codec

NE264/NE265 is a video encoder independently developed by NetEase Yunxin, which is used in Yunxin's NERTC and live on-demand. In addition to the improvement of encoding performance, the more important technical advantage of NE264 is low bandwidth and high image quality. It is suitable for high bit rate and high definition live broadcast scenes (such as: game live broadcast, online concert, product launch, etc.). Under the condition that the subjective image quality of the human eye remains unchanged, the average bit rate is saved by 20%~30%. I will not introduce it here. Those who are interested can pay attention to NetEase Smart Enterprise Technology + WeChat Official Account.

Fragmentation transcoding

If the above several performance optimization methods are vertical, then the slice transcoding mentioned in this section is horizontal. The essence of a video stream is a series of images, and is divided into a series of GOPs with IDR frames as the boundary. Each GOP is an independent set of images. This content characteristic of the video file determines that we can refer to the algorithm idea of MapReduce to split the video file into multiple pieces, then transcode the pieces in parallel, and finally merge them into a complete file.

Task scheduling

In addition to performing stream optimization for a single transcoding calculation, we also need to improve the overall scheduling efficiency of cluster resources. In the scheduling algorithm, the scheduling node must not only receive the task submission, but also complete the key process of task delivery. The algorithm design of this task delivery requires multi-tenant allocation, task priority preemption and multi-party resource utilization as much as possible. balance.

We have designed two mechanisms for issuing tasks:

Master node push task to compute node
Compute nodes take the initiative to pull tasks from the Master node

The former has the advantage of higher real-time performance, but the disadvantage is that the resource perspective of the master of the computing node is a snapshot. In some cases, the lag of the snapshot information will cause the overload of some nodes. The advantage of the latter is that nodes take tasks to execute on demand, and there will be no overloading of some nodes. At the same time, the programmability of task selectivity is more convenient, while the disadvantage is that the Master has real-time control of global resource allocation. Insufficient sex.

Fragmentation transcoding solution practice

media flow

The simple flow of media processing is shown in the figure below, which is mainly divided into four steps: segment forwarding and packaging (on demand), video segmentation, parallel transcoding, and video merging.

Fragmentation process

In the case of sufficient cluster resources, that is, task scheduling and distribution generally will not have backlogs and resource preemption phenomena. In this case, the processing and calculation of the video stream itself generally consumes 80%-90% of the entire task cycle time, so Optimizing for this stage can have more cost-effective benefits.

The two dimensions of improving hardware capabilities and optimizing encoding are aimed at improving the computational efficiency of a single transcoding process, but the resources that a single process can call are limited, and the rate increase for large video files is also very limited. Therefore, here we discuss how to use the idea of distributed MapReduce to shorten the time consumed by a transcoding task. The following chapters will describe in detail the technical solutions for implementing slice transcoding.

The fragmentation transcoding process infrastructure is as shown in the figure above. We first introduce the following concepts:

Parent task: Similar to the job in Hadoop, the transcoding job submitted by the client needs to split the video to be transcoded into multiple small fragments;
Subtask: Similar to Task in Hadoop, multiple small fragments are packaged into Task subtasks that can be independently scheduled and executed;
Parent Worker: the computing node that performs the parent task;
Sub Worker: The computing node that performs subtasks.

The main process of fragment transcoding:

Dispatch center dispatches a transcoding job to worker0, and worker0 judges whether to perform fragment transcoding according to strategies such as master switch, job configuration, and video file size;
If it is determined to perform segmentation transcoding, then proceed to segmentation into n pieces;
Packaged into n transcoding tasks and submitted to the Dispatch center;
Dispatch center dispatches these n transcoding subtasks to n workers that meet the requirements;
After worker1~n transcoding is completed, send a callback to worker0;
worker0 downloads the transcoded fragmented video from n workers, and after all the fragments are transcoded, merge the transcoded fragments together;
Send a callback to the client.
Subtask scheduling

In the scheduling system, each user's task queue is independent, and task quotas are set separately. When the Dispatch center receives the fetch job request of the computing node, the scheduling thread first selects the user with the smallest proportion of the used quota (a relatively simple algorithm model can be the number of scheduled tasks/total user quota) from multiple user queues Queue, and then take a subtask that meets the conditions of the computing node from the head of the queue and return. Subtask scheduling and ordinary task scheduling are different in scheduling priority and node selection, and need to be designed separately. Here we briefly introduce them.

priority
do not need to be re-queued in their respective user queues. The goal of subtask scheduling is to hope that 160fa7e1686ff3 can be scheduled to . The parent task has actually been scheduled, and the system is in the purpose of speeding up the execution of the task. In the system design, it will be dispatched again. If it has to compete with other unscheduled tasks, it is unfair to this task. , Also weakened the role of acceleration. Therefore, for the fragmented subtask, it will be placed in a global high-priority queue , and will be selected for scheduling first.

subtask scheduling node selection

影响到子任务调度节点主要有以下因素：


1. 机器类型

机器类型分为硬件转码机器与普通转码机器，由于两个环境中使用的编码器不一样，所以可能导致合并分片后的视频会有瑕疵，因此我们选择将子任务调度到与父任务相同的机器类型。


2. 代码版本

不同版本的代码可能导致转出的分片无法很好的合并在一起，所以当出现这样的版本迭代后，可以通过计算节点 worker 上的代码版本，来确定子任务能调度到哪些其他计算节点上。


3. 数据存储

当父 worker 上的任务并发大时，就会同时进行多个上传下载的网络传输，这样会导致分片文件 IO 阶段耗时增加，因此优先选择将子任务放在父 worker 上执行，就会节省网络 IO 与上传下载耗时。

Lagging behind

In the sharding transcoding scenario, straggler problems (straggler problems) means that in multiple subtasks, if most of the subtasks have been executed, but a few subtasks remain unfinished, the parent worker will not be able to enter for a long time. Go to the next process, which causes the task to be blocked. This is a relatively common phenomenon in distributed systems, and research papers in the field of systems aiming at this problem are not uncommon.

The solution to this problem will greatly affect the efficiency of the system. If the parent worker chooses to wait for the child task all the time, the task may wait too long, which goes against our original intention of speeding up. Therefore, based on the principle of ensuring that the task can be completed in a limited time, there are the following optimization directions:

1. Redundant scheduling

This solution refers to the MapReduce solution to the lagging problem in Hadoop: when the timeout criterion is reached, and the child task has not been completed, the parent worker will send a new Tsak child task to the Dispatch center again for the same shard file, and let it restart Schedule and re-execute. When one of the subtasks is completed, cancel the other one.

This approach is to use space for time, instead of placing hope on only one node, it adopts a horse racing mechanism. However, when such a situation occurs frequently, a large amount of task redundancy will be generated, and there is no guarantee that the newly-started subtask will not block.

2. Father worker succeeded

In order to solve the deficiencies in redundant scheduling, we optimized it. When the timeout criterion is reached and the child task has not been completed, the parent worker will select the least for transcoding. Similarly, the completion of one of the tasks cancels the other redundant task. If there are still unfinished subtasks, continue to select and complete the transcoding by yourself until all subtasks are completed.

The difference between the above-mentioned second scheme and the first scheme is that redundant tasks will not be rescheduled to other workers for execution, but priority is given to the parent worker for redundant execution. In this way, the parent worker will continue to transcode the fragments until the entire job is completely completed. The biggest advantage is that without unlimited resource consumption, it is guaranteed that the parent worker will not be in an infinite waiting state. Only in a few cases, when the parent worker has a high load, other workers with idle resources will be considered.

Subtask progress tracking

When the parent worker selects the child task to execute, it needs to collect the progress of the child task and then select the child task with the slowest progress for redundant execution. When calculating the task progress, we divide a transcoding into these four stages: waiting for scheduling, downloading and preparing, transcoding calculation execution, uploading and finishing.

The beginning of different stages means reaching different progress:

Waiting for scheduling 0% → Download and preparation 20% → Transcoding calculation execution 30% → Upload and close 90% → End 100%

Among them, the transcoding calculation execution accounts for 70%, which is also a stage where the execution speed cannot be guaranteed. Therefore, detailed calculation progress is required. The transcoding execution stream will periodically output metric logs and perform monitoring calculations. Current transcoding progress = transcoding duration ( time field)/The length of time needed to transcode.

HLS/DASH package

The difference between the HLS format and other packaging formats is that it will have multiple ts files and m3u8 files. The task of transferring HLS videos to slices and transcodes will increase the complexity of sliced video transmission and management. Therefore, our solution to this problem is to first convert the source video to mp4 video, then merge it on the parent Worker, and then convert the entire video to HLS encapsulation.

test data

By recording and comparing the rate at which the same video is converted to videos of different resolutions, we can find that each individual optimization measure has improved the transcoding speed to varying degrees. In actual online scenes, we usually decide to comprehensively use certain optimization methods based on user settings and video attributes.

Test video 1 attributes:

Duration: 00:47:19.21, bitrate: 6087 kb/s

Stream #0:0: Video: h264 (High), yuv420p, 2160x1216, 5951 kb/s, 30 fps

Stream #0:1: Audio: aac (LC), 44100 Hz, stereo, fltp, 127 kb/s

Test video 2 attributes:

Duration: 02:00:00.86,bitrate: 4388 kb/s

Stream #0:0: Video: h264 (High), yuvj420p, 1920x1080, 4257 kb/s, 25 fps

Stream #0:1: Audio: aac (LC), 48000 Hz, stereo, fltp, 125 kb/s

Conclusion

The above is the whole content of this article. The NetEase Yunxin transcoding team mainly cut through dimensions such as scheduling optimization, hardware capabilities, self-developed encoding, and segmented transcoding to improve the video transcoding speed. The test results show that the transcoding speed-up effect is significant. In addition, the main design of the slice transcoding module in the Yunxin transcoding system is introduced. We will continue to explore technology to achieve speed increase and more scene coverage. In the subsequent series of articles, we will also share other aspects of cluster resource scheduling algorithms, hardware transcoding practices, etc., and we are also welcome to continue to pay attention to us.

Author introduction

Luo Weiheng, senior server development engineer of Netease Yunxin, graduated from the School of Computer Science of Wuhan University, core member of Netease Yunxin transcoding team, currently responsible for the design and development of Yunxin media task processing system, and is committed to improving Yunxin video transcoding services quality.

Technical Practice | NetEase Yunxin Video Transcoding Speed-up Fragmentation Transcoding

transcoding performance influencing factors and optimization

dedicated hardware acceleration

AMD big core server

Self-developed Codec

Fragmentation transcoding

Task scheduling

Fragmentation transcoding solution practice

media flow

Fragmentation process

Subtask scheduling

Lagging behind

Subtask progress tracking

HLS/DASH package

test data

Conclusion

Author introduction

网易数智

引用和评论

InfoQ官媒报道|网易云信裴明明：云原生架构下中间件联邦高可用架构实践

三分钟掌握音视频处理 | 在 Rust 中优雅地集成 FFmpeg

三分钟掌握视频分辨率修改 | 在 Rust 中优雅地使用 FFmpeg

CVPR 2025 | 火山引擎获得NTIRE 视频质量评价挑战赛全球第一

三分钟掌握音视频信息查询 | 在 Rust 中优雅地集成 FFmpeg

【harmonyOS NEXT 下的前端开发者】WAV音频编码实现

什么是抖动以及如何使用抖动缓冲区来减少抖动？