Qcon real-time audio and video session: best practices and future prospects for real-time interaction

在这里插入图片描述

Interactive live broadcasts, online meetings, online medical treatment, and online education are important scenarios for real-time audio and video technology applications. These scenarios have stringent requirements for high availability, high reliability, and low latency. Many teams will meet in the audio and video product development process. Encountered various problems. For example: fluency, if there are frequent freezes in the video process, it is basically difficult to have a good interaction; echo cancellation, after the environment reflection is re-collected and transmitted by the microphone, this will also affect the interactive effect; domestic and international intercommunication, more and more The more products choose to go overseas, the interoperability at home and abroad is also a technical point that needs to be solved; massive concurrency, which is a great challenge to the stress resistance of audio and video products.

On May 29th, at the "QCon Beijing Global Software Development Conference", the "Real-time Audio and Video Special" initiated by Agora Technology VP Feng Yue, as the feature producer, invited people from New Oriental, Banyu English, and Agora. Technical experts shared with you topics such as the next-generation video engine architecture, the difficulties and jumps of large-scale implementation of audio and video systems, voice evaluation and localization practices, and the research and practice of front-end audio and video players.

01 Exploration of the next generation video engine architecture of Soundnet

With the rapid development of audio and video technology, audio and video real-time interaction has been widely used in many fields (social entertainment, online live broadcast, medical, etc.). At the same time, with the rapid development of AI technology in image processing, advanced video pre-processing functions that incorporate AI algorithms have also been used more and more. The variety of scenes puts forward high requirements for the flexible and expandable functions of next-generation video.

Yaqi Li, the architect responsible for the architecture design of the next-generation video engine of Agora, firstly brought you a sharing of "Exploration and Practice of the Next-Generation Video Engine Architecture of the Agora".

在这里插入图片描述

In order to better meet the richness of video experience scenes, user differentiation, and demand for live broadcast experience, Sonic.com summarizes the design principles and goals of the next-generation video processing engine into the following four aspects:

1. To meet the differentiated needs of different users for integration;

2. To be flexible and expandable, it can quickly support the landing of various new business and new technology scenarios;

3. To be fast and reliable, provide rich and powerful possibilities for the core system of the video processing engine, and greatly reduce the mental burden of developers.

4. To achieve superior performance and monitorability, it is necessary to continuously optimize the performance of the live video processing engine and at the same time improve the monitoring methods to achieve quality data transparency.

Aiming at the above four design goals, what software design methods did Soundnet use?

Because the users of the engine are naturally hierarchical, some users require low code to go online quickly, and the engine needs to provide APIs that are as close to other business functions as possible; while another part of users hope that the engine can provide them with more core video processing On top of this, you can customize video processing services according to your own needs. Therefore, according to this user form, the sound network also adopts a layered service design of service combination and core functions. The High Level API provides ease of use for services, and the Low Level API provides core functions and flexibility. In order to open up flexible orchestration capabilities as the capabilities of the video processing engine to developers, allowing developers to flexibly orchestrate according to different business needs through flexible and free API combinations, the core architecture of SonicNet’s video processing engine adopts the Microkernel Architecture architecture. Mode, which separates the variables and invariants of the entire engine. The goal of flexible and extensible is achieved through the micro-kernel architecture model: the functions of each module can be quickly expanded, and the video processing pipeline can also be combined with building blocks to achieve flexible business orchestration.

在这里插入图片描述

If we do not have a stable and reliable core system, if a developer wants to develop a beauty plug-in on the video processing pipeline from scratch, he needs to consider many issues beyond his own business logic: module location, data format conversion, thread model, For issues such as memory management, attribute configuration, etc., for this series of engineering-related integration issues, SoundNet solidifies the solutions into the underlying core system, providing users with rich and powerful basic functions. This set of video engine core system includes functions such as basic video processing unit, pipeline construction and control, basic video format algorithm support, and system infrastructure. With this core system, integration becomes very simple. The plug-in only needs to implement the relevant encapsulation interface in accordance with the core system interface protocol. The rich and powerful core system functions greatly reduce the mental burden of module developers, thereby helping developers improve overall R&D efficiency.

In the part with superior performance that can be monitored, Soundnet optimized the mobile data processing link, separated the control plane and the data plane, and improved the overall data and video transmission efficiency. In addition, a memory pool related to video processing features is built to reduce system resource consumption. Finally, a full-link video quality monitoring mechanism is realized, so that the video optimization performance can achieve the effect of closed-loop feedback.

02 Difficulties and challenges of self-developed large-scale real-time audio and video systems

As a long-term deep cultivator in the RTC field, Dong Haibing, an industry architect from Agora, introduced the basic concepts of RTC to everyone at the conference. At the same time, he also analyzed in detail the scene characteristics of RTC and the architectural design and difficulties in the self-research process. Finally, he also shared his own views on the future development direction of RTC.

Compared with the traditional Internet's more mature solutions for large-scale and high concurrency: caching, asynchronous, and distributed, the challenges faced by the real-time audio and video fields are actually more complicated. "Real-time" must be controlled within 1 second to be called "real-time". For example, for caching, the time is at the second or minute level, and the millisecond level rarely occurs. Real-time audio and video (RTC) needs to consider audio and video quality, fluency, low latency, scalability, and availability when dealing with large-scale and high-concurrency scenarios. This is very different from traditional Internet. Locality also means that its solutions will be more complicated.

在这里插入图片描述

During the development process, common challenges faced by users include development costs, network construction, quality monitoring, audio processing, and final testing. In the sharing, Dong Haibing cited an audio self-study example. First of all, there are three key problems to be solved in audio transmission: silent/low sound, echo, and noise/noise. Second, the ability to counter weak networks is also very important. When the network changes, how to adjust the bit rate and frame rate to alleviate the changes, and at the same time solve the problems of selecting and transmitting the optimal path in the intelligent routing algorithm. Another challenge is multi-dimensional quality assessment, and real-time assessment must be done, and a closed loop with dynamic adjustments must be formed at the same time. This is the best way to play a better role in weak network confrontation. As for the difficulties of using open source servers, Dong Haibing also discussed and shared several common solutions (Jitsi/Jitsi VideoBridge, Kurento, Licode/Erizo, Pion, Janus).

In addition to the development of the server, the real-time audio and video operation and maintenance and quality monitoring are also somewhat different from traditional Internet methods. For example, in terms of operation and maintenance, in addition to common disaster recovery planning, containerized deployment, automated operation and maintenance, performance analysis and log systems, real-time audio and video operation and maintenance also need to face global networks (cross-regions, cross-operators), Challenges such as Lastmile strategy.

If users choose the self-developed method, they may also face problems such as large-scale microphone connection, RTC recording/playback solutions, and operating cost control. But even if we need to face and solve so many difficulties and challenges, what cannot be ignored is that real-time audio and video technology is being applied in more and more scenarios and has more and more possibilities.

MetaVerse, translated as meta universe, is a relatively hot concept recently. In real life, we can understand it as a kind of role transformation, in the virtual world it is another brand new experience, which realizes the switching of multiple virtual world roles. VRCHAT is similar, using VR to do social networking or entertainment to help everyone interact better online. This is likely to be the future development and exploration direction of the Internet. Dong Haibing mentioned that as a self-research team, we should not build a car behind closed doors. We must keep up with the pulse of the times and industry development trends, and put our own efforts into our core business and areas of expertise as much as possible. Together, we can do better in the field of real-time audio and video.

03 New Oriental Cloud Classroom Web-side audio and video player practice

Online education should be one of the most familiar real-time audio and video application scenarios in the past two years. In this special session, we invited Li Bianru, the front-end interaction architect of the New Oriental Cloud Classroom, to share with you how New Oriental achieved offline to online Best practices for fast migration.

在这里插入图片描述

New Oriental started to build its own cloud classroom at the end of 2018. During the New Year in 2020, one week, it has made the leap from supporting 10,000 concurrency to supporting 300,000 concurrency.

New Oriental Cloud Classroom is a complete online class solution that provides saas services. One of its notable features is that the pace of update iterations is very fast. If you do native development on the end, such as PC, Windows, mobile and Android and iOS, then the update iteration must not catch up with the rhythm, so they set the strategy to embed the H5 page in the client, except for real-time audio and video, interaction The function is basically realized by H5. Web adapts to all sides, this is the fastest development mode.

The real-time audio and video (RTC) delay is one hundred milliseconds, and the maximum will not exceed 500 milliseconds, which is basically imperceptible to the human ear. In online education, there will be two different scenarios of small class and large class. Small class classes will have higher requirements for low-latency real-time interaction, but for some university courses and lectures, or large class scenes where famous teachers are public speaking, if RTC is used, the cost will actually be relatively higher. For large classes, New Oriental Cloud Classroom adopts the H5 super large class method, which supports millions of people in class at the same time. The teacher uses RTMP to push the stream, and the student side still uses HTTP to pull the stream.

在这里插入图片描述

Web broadcast player architecture diagram

For the future expandable part, if the video coding of the cloud classroom adopts the H.265 standard, the compression will be half smaller than that of H.264, and the network pressure will be reduced a lot. H5 has the advantages of a wide range of applications and support for cross-platform. It can realize the same set of solutions to adapt to different clients, quickly develop a set of products, and then be able to quickly go online. The self-developed universal player can change the input source stream, customize it or develop it quickly.

04 Voice evaluation and localization

In order to better provide education services, the online education platform has also implemented many new functions in combination with deep learning in the past two years. Voice assessment is one of them. Especially in English education, there is a huge demand for the number of oral assessments for children. How to reduce the evaluation delay, improve the evaluation service experience, and reduce the server pressure and cost at the same time? Huang Zhichao, head of AI algorithm from Banyu Technology Zhongtai, shared "Voice Evaluation and Localization".

在这里插入图片描述

Voice evaluation is a function of intelligently scoring children's spoken language by replacing humans with machines. The practice of voice evaluation in the companion fish mainly includes algorithm and frame selection, acoustic model training, effect and speed optimization. In terms of algorithms, Banyu chose to use deep neural networks and hidden Markov, mainly because the deep learning framework is currently very mature. The framework choice is kaldi, which has the largest number of users in the voice industry, and the information is complete.

在这里插入图片描述

The evaluation process of deep neural network and hidden Markov algorithm (dnn + hmm) is shown in the figure above. First, we must train a dnn acoustic model and train hmm topology parameters. After the training, we will compose the input text, extract the features of the audio, and then go through the acoustic model. After a scoring model, the sentence score is obtained.

In this process, data screening, acoustic model training, and optimization of evaluation accuracy are all key. In the subsequent sharing, Huang Zhichao also shared in detail the model volume optimization, the robustness of the evaluation service, and how to solve the problems and experiences of abnormal Case analysis difficulties in the localization process of the voice evaluation of Banyu.

In order to allow everyone to more conveniently and deeply understand the "behind the stage and behind the scenes" of real-time audio and video development, we will organize and interpret all the contents of this special in more detail in the follow-up. For details, please click 160f83d8517c7f [Read original text] , follow the latest developments in the community!

在这里插入图片描述

Qcon real-time audio and video session: best practices and future prospects for real-time interaction

01 Exploration of the next generation video engine architecture of Soundnet

02 Difficulties and challenges of self-developed large-scale real-time audio and video systems

03 New Oriental Cloud Classroom Web-side audio and video player practice

04 Voice evaluation and localization

RTE开发者社区

引用和评论

吴恩达：AI 被过度炒作，但语音 AI 产品却被低估；ChatGPT 升级语音翻译功能丨日报

一文掌握 MCP 上下文协议：从理论到实践

LRU算法，你别跑，我就要吃透你

AI Agent爆火后，MCP协议为什么如此重要！

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

MCP 协议为何不如你想象的安全？从技术专家视角解读