Exploration and Practice of the Next Generation Video Engine Architecture of Soundnet

order to meet the challenges of increasingly diversified audio and video interactive scenarios, Agora began to design its own next-generation video processing engine. During the process, it summed up a lot of experience in engine architecture, performance tuning, and plug-in system design. Dear audio and video enthusiasts and industry practitioners share .

On the afternoon of May 26th, at the "Real-time Audio and Video Special" event of the QCon Global Software Development Conference, Agora Architect Li Yaqi brought a lot of dry goods to share with the theme of "Sound Network Next Generation Video Processing Engine". This article is for sharing Content organization.

Today’s sharing will mainly be divided into three parts:

1. Why do we want to build a next-generation video processing engine, and what are the design principles and goals of the engine;
Second, how do we achieve the design principles and goals;
Third, what is the actual landing effect of the next-generation video engine?

01 Design goals of the next-generation video processing engine

With the rapid development of audio and video technology, real-time audio and video interaction has been widely used and developed in various fields, such as social entertainment, online live broadcast, and medical treatment. . Since the epidemic, more and more scenes have migrated online, and many readers may also have the experience of watching live broadcasts, or accompanying children in online education.

So what are the pain points in these two scenarios? In the live video scene, the demand for processing multiple video sources is becoming more and more extensive. For example, in an e-commerce live broadcast scene, the anchor usually needs to use multiple cameras for multi-angle shooting to achieve a better delivery effect. This type of live broadcast may also be synchronized with the use of a director to perform a variety of real-time live broadcast combinations and seamless switching among multiple video sources.

For online education scenes, the more traditional way is to take photos of the teacher with a camera, and the teacher will share the screen. In order to enrich the means of online education, we can also add a video recording to capture the teacher's writing on the tablet, and even add an additional function to support the teacher to play local courseware or online courseware.

On the basis of multiple video sources, there will also be a need for real-time editing and image combining of multiple video sources. In the application scenario of the live broadcast assistant, the host may need to combine and edit the collection of multiple video sources in real time, and then add local materials and dynamic emoticons to enrich the live broadcast effect while reducing the upstream bandwidth pressure. In the multi-person interaction scenario, in order to reduce the bandwidth pressure and performance loss of the receiving end, it is necessary to synthesize multiple channels of host videos in the cloud and send them to each receiving end.

With the rapid development of AI technology in image processing, advanced video pre-processing functions that incorporate AI algorithms have also been increasingly used, such as some advanced beauty features, background segmentation, and background replacement. Combining these three scenarios, we can see that the flexible and expandable capabilities of the next-generation video engine of .

In addition, with the continuous growth of our company's business and team size, the number of users of the next-generation video engine is also increasing sharply, and different users have different needs for integrated development. For developers or individual developers with a small R&D team, what they need is the ease of integration of the engine, low code volume, and quick launch. For developers in the middle of the enterprise business, the engine will be required to open more basic video processing functions, so that the video processing business can be customized.

Facing the needs of various developer groups, 's next-generation video processing engine requires a flexible design to meet differentiated integration needs.

Not only is the flexible design, but the live video experience is also a very important indicator. With the advent of the 5G era, the network infrastructure is sufficient to support users' needs for clearer and smoother live broadcast experience. The next-generation SDK must achieve the ultimate in performance optimization. supports higher video resolutions and higher frame rates. Considering the continuous expansion of real-time interactive business scenarios, the distribution of users is becoming wider and wider. In order to provide a better live video experience in countries and regions with weak network infrastructure and low-end models with poor performance, the engine can provide a better live video experience. , We need to support better weak network resistance and optimize performance resource consumption.

Combine the above-mentioned scene richness, user diversity, and demand for live video experience. We can summarize the design principles and goals of the next-generation video processing engine into four aspects:

1. Meet the differentiated needs of different users for integration;
2. Flexible and expandable, it can quickly support the landing of various new business and new technology scenarios;
3. The core system of the video processing engine should provide rich and powerful functions, reduce the mental burden of developers, and be fast and reliable;
4. The performance is superior and can be monitored. It is necessary to continuously optimize the performance of the live video processing engine, and at the same time improve the monitoring means, form a closed loop and iteratively optimize the processing performance of the engine.

Next, we enter the second part, in view of the four design goals mentioned above, which software design methods are used by the sound network to implement the implementation.

02 Architecture design of the next generation video processing engine

The first design goal mentioned above, we need to meet the differentiated integration needs of different users: engine users are naturally hierarchical, and some users pursue low-code quick launches, and the engine needs to provide functions that are as close to other businesses as possible. ; Another part of users needs us to provide more core video basic capabilities, on which customers can customize video processing services according to their needs.

According to this user profile, our architecture also adopts a hierarchical service design, divided into High Level and Low Level. The low level part is to model the core functions of video processing, abstracting the video source processing unit, pre/post processing unit, renderer unit, codec unit, core infrastructure unit, etc. Through the combination and development of these basic modules, on the basis of Low level, we have abstracted the concept of customer-oriented business-oriented video source Track, the concept of network video stream and the concept of scenes, and encapsulated the business closer to the user. The High Level API.

Let's take a look at the difference in use between High Level and Low Level through practical examples: suppose we want to implement a very simple scene now, open the local camera to open the preview, and publish it to the remote end.

If you use the High Level API, you can complete the construction of this real-time interactive business scenario by just calling 2 simple APIs. First open the local camera and preview it through the StartPreview API, then join the channel through the JoinChannel API and publish the video stream. If users want to implement more customized business functions in this simple scenario, they can use the Low Level API. First, create the local camera acquisition pipeline CreateCameraTrack. This Track provides a variety of interfaces for morphological formation and state control. At the same time, we decouple the local media processing and network publishing nodes, and the video stream can be published to the RTC system developed by Shengwang or to the CDN network.

It can be seen from the above example. In order to meet the differentiated needs of users, we have adopted a layered design. High Level provides ease of use for business, and Low Level provides core functions and flexibility.

Let's look at the second goal, how did we achieve this point of flexibility and scalability? Before that, I briefly introduce the basic concepts of video processing. The process of video processing uses video frames as the carrier of video data. Take the local sending process as an example. After the video data is collected, it will go through a series of pre-processing units, then sent to the encoder unit for compression coding, and finally sent to the remote network after being encapsulated according to different network protocols. The receiving processing flow is to decapsulate the video stream after receiving it from the network, send it to the decoder, go through a series of post-processing units, and then to the renderer for display.

We call serialized video processing with video frames as data carriers as video processing pipelines, and each video processing unit is called a module. Each specific video processing unit can have different implementations. For example, the video source module can be a self-collected video source, a video source collected by a camera, or a video source shared by the screen. Different encoders can also have different extended functions according to encoding standards and encoder implementations. The network sending node can send to the self-developed RTC network or CDN according to different protocols. Different video services are actually formed by flexible arrangement of basic video processing units according to services. We hope that we can open up flexible orchestration capabilities as the basic capabilities of our video processing engine to developers, so that developers can build processing pipelines that meet their business needs through flexible and free API combinations.

In order to achieve this, the core architecture of our video processing engine adopts the Microkernel Architecture architecture, which separates the variables and invariants of the entire engine . As shown in the figure, it is divided into two parts, the Core System in the middle and the Pluggable modules on the periphery. The yellow part in the middle is the entire core system part, which corresponds to the invariants of the entire next-generation video processing engine. In the core system, we abstracted out the modules of each basic video processing unit, and provided a unified control plane and data plane interface. At the same time, the engine also provides a control interface for assembling and flexibly arranging these basic video modules. In addition, the core system also provides a series of infrastructure functions, such as video data format conversion related to video processing, basic video processing algorithms, memory pool optimized for memory management of video processing features, and thread models, log systems, and message buses. Wait.

Using the underlying capabilities of the core system, each module can be easily expanded. For example, the video source module can have a video source module in push streaming mode, a video source module in pull streaming mode, or even a special video source. That is, in the transcoding process, we can add the decoded video frame of the remote user's video as a new video source to the local transmission pipeline. The pre-processing module and post-processing module can also be expanded into various implementations, such as basic cropping and zooming functions, beautification, and watermarking functions. The codec module is more complicated. On the one hand, it must support multiple coding standards, as well as corresponding multiple implementations, such as soft and hard coding. At the same time, codec selection is still a relatively complex dynamic decision-making process. We have built-in the codec basic module an encoder selection strategy that dynamically selects and switches based on capability negotiation, model, and real-time video encoding quality.

Next, combined with actual application scenarios, see how we can flexibly build video processing pipelines to meet different business combination scenarios. Going back to the online education scene, suppose now in a complex online education scene, a camera is needed to take a picture of the teacher’s blackboard writing, and another camera takes a portrait of the teacher. At the same time, the teacher will share the courseware through screen sharing or use a media player. To play local or online multimedia video files. In some advanced scenes, in order to achieve better live broadcast effects, the teacher will turn on the background segmentation and background replacement mode to superimpose the teacher's avatar and the courseware to achieve better results. Teachers can also turn on the recording function locally, and record their live video of the class to the local.

For complex combined applications, it can be realized through pipeline construction. The above picture is a conceptual diagram of a local processing pipeline. For the photographing blackboard, teacher portrait and courseware sharing just mentioned, we can dynamically replace the acquisition source module. Do. Background segmentation is a special pre-processing module, which can analyze the teacher's portrait in real time, and then superimpose it on the collection source shared by the screen. Local recording is a special form of renderer module. It encapsulates and stores local video frames in a file format in a local path. We decouple the entire media processing from the final network transmission. It can dynamically choose whether to push to us It is still pushed to CDN in the RTC network.

Next, let’s look at a scenario of a receiving pipeline combination application. We have a background media processing center in the background, which can perform real-time streaming media processing services according to the user’s business processor requirements, including cloud recording (dumping the received video) Cloud for video pornography, low-code high-definition processing, combined image transcoding services, etc. There is also the Cloud Player function, which pulls down the remote video and pushes it to the RTC channel. As well as by-pass streaming, the received video stream can be forwarded to the CDN in our RTC network.

Then let's take a look at how to meet different application scenarios by building a receiving pipeline.

The first is the module of our network receiving source, which can receive video streams from the RTN network or CDN through dynamic switching. After passing through the decoder module, it is sent to a series of post-processing modules, including the yellow detection module just mentioned, the low-code high-definition post-processing module and so on. The number and location of the received renderer modules can be flexibly customized. For example, the function of cloud recording just now is actually a special renderer module.

What we just introduced is that we have achieved flexible expansion goals through micro-kernel architecture design, and the functions of each module can be quickly expanded. The video processing pipeline can also achieve flexible business orchestration through a combination of building blocks. Next, we look at the goal of fast and reliable. We want to say our core system 161a63c7d11a6a should provide rich and stable functions. On this basis, it can greatly reduce the mental burden of development staff and improve research and development efficiency.

Before introducing this, let's first think about a question. If we don't have a stable and reliable core system, what problems need to be considered if a developer wants to develop a beauty plug-in on our pipeline from scratch.

is undoubtedly to develop the business logic beauty itself. In addition, when integrating with the pipeline, we must first consider whether the module can be loaded in the correct position in the pipeline, what impact the pre-processing module has on it, and what impact its business module has on subsequent functions.

second is the data format problem . When the data format flowing on the pipeline is not the format needed by the beauty module, it needs to convert the data format. The business logic implemented by this data conversion algorithm also requires the developer of the module to fulfill.

next in the process of integration with the pipeline. It needs to understand the thread model and memory management model entire pipeline. In the state switching in coordination with the pipeline, the beauty module itself must also implement the corresponding state control business logic. At the same time, in a pipeline, if the subsequent nodes give feedback to the previous node based on the video quality, for example, the subsequent node says that you need to adjust your throughput, it also needs a mechanism to receive and process messages from the subsequent modules. At the same time, when the beauty plug-in is running, if there are some message notifications to be sent to the user, a set of message notification mechanism needs to be designed.

Since the beauty plug-in is integrated into the SDK that provides the core functions, the development of the plug-in becomes particularly simple. The plug-in author only needs to implement the relevant interface according to the agreement of the core system's interface protocol, and the core system will automatically follow it. From the perspective of global performance optimization, load it to the correct location, and then users of our SDK can use this plug-in.

To sum up, the fast and reliable piece has realized rich and powerful core system functions, which can greatly reduce the mental burden of module developers, thereby improving the efficiency of research and development.

Finally, let’s take a look at the superior performance that can be monitored. First, we optimize the data transfer efficiency of the entire video processing pipeline on the mobile terminal, and realize the full link support for the mobile terminal’s native data format, including the acquisition module, rendering module, The pre-processing module can achieve zero copy of the entire processing link when using hardware. At the same time, according to the processing characteristics of each module, the position of the corresponding module on the pipeline can be optimized to reduce the CPU and GPU span, so as to better improve Data transmission efficiency.

In addition, we mentioned earlier that the control plane and the data plane are separated to a certain extent through the basic video processing unit. This has certain advantages. For example, the user can get timely response to the module control. For the operation of devices such as cameras, it is It is a relatively heavy operation. When the user frequently switches the front and rear cameras, such operations will block the user UI and cause a large delay. By separating the control plane and the data plane, we can achieve fast-response camera operations while ensuring the correctness of the final state. And let the control path no longer block the flow of data. At the same time, the control path can not block the data flow, and we can edit and send the map source in real time.

In terms of reducing system resource consumption, we have built a memory pool suitable for video data storage formats to support inter-frame memory multiplexing of multiple video formats. At the same time, we can dynamically adjust to achieve dynamic balance according to system memory usage and pipeline load conditions. State, which can reduce frequent memory allocation and release, thereby reducing CPU usage.

Finally, in order to form a closed-loop feedback path for performance optimization, we have implemented a full-link performance quality monitoring mechanism. For each basic video processing unit, statistics and reports the resolution and frame rate of incoming and outgoing frames, and Some module-specific data. At the system level, time-consuming tasks are also monitored and reported. According to the needs of different problem investigations, we import this part of the data into the user’s local log on demand, and report the experience-related data to the online quality monitoring system to solve the problem. Quickly locate and optimize the effect of performance feedback.

To sum up, is superior in performance and can be monitored. We first optimized the mobile data processing link, separated the control plane and the data plane, and improved the overall video data transmission efficiency. In addition, a memory pool related to video processing features is built to reduce system resource consumption. Finally, a full-link video quality monitoring mechanism is implemented to achieve the effect of closed-loop feedback on video optimization performance.

In fact, our next-generation video processing engine has now entered the stage of landing and polishing. The superiority of the architecture has also been verified in practice. Let's now take a look at actual application cases.

next-generation video engine has a high degree of flexibility and scalability . Based on this video engine, a unified front-end and back-end picture combining and transcoding general framework is built through business combination. Based on this framework, we can quickly respond to various front-end and back-end picture combining requirements. For example, the online video blind date scene, which is a typical multi-person interactive real-time scene. Traditionally, a guest needs to subscribe to matchmaker and other guest video streams, which puts a lot of pressure on the downlink bandwidth and machine processing performance. In order to solve this problem, we quickly use the The general framework for image transcoding is launched on the cloud combined image project, and the video of each guest and matchmaker is combined in the cloud and then pushed to the audience. At the same time, the layout of the combined picture, the background picture, and the guest video interruption display strategy can be customized according to the user's business: such as displaying the last frame, background picture, placeholder picture, etc.

If the same combined image transcoding framework is applied locally, we have realized the function of local real-time video editing and mixing, which can be used in e-commerce live broadcast and other fields. The host can locally integrate various image sources, such as multiple cameras. , Multi-channel screen sharing, media player and remote user video and different picture materials for real-time combined push.

That’s it for us to share, thank you everyone!

Exploration and Practice of the Next Generation Video Engine Architecture of Soundnet

01 Design goals of the next-generation video processing engine

02 Architecture design of the next generation video processing engine

RTE开发者社区

引用和评论

ListenHub ：短播客内容生成和消费 Agent；Ollama 新引擎支持多模态推理模型，将支持语音生成丨日报

🔥全程不用写代码，我用 AI 程序员写了一个飞机大战

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！