From open source models, frameworks to self-research, the virtual background algorithm of Shengwang Web side was officially released

According to research, in an average 38-minute video conference, about 13 minutes will be spent dealing with distraction-related matters. At the same time, research also shows that when participating in online conferences, people are more inclined to voice conferences. One of the key reasons is that people do not want their personal privacy to be exposed to the public eye.

How to highlight the speaker in the video conference, reduce the distracting information in the background, and improve people's enthusiasm for participating in the video conference has become a problem to be solved by the real-time audio and video technology, and the real-time virtual background is one such technology. Different from traditional technical means such as green screen, the virtual background uses machine learning inference to segment the portraits in the real-time video content to realize the replacement of the content outside the portraits. Therefore, the user can use it without arranging the background in the real environment, which has the advantages of convenience and efficiency.

In August 2021, Shengwang launched the first virtual background plug-in version based on the Web SDK , realizing background replacement and background blur functions. In the recent update of the virtual background plug-in, this function has been further improved. Currently, it can support image virtual backgrounds, solid color backgrounds based on CSS color values, and 3 levels of blurred backgrounds with different degrees. The machine learning inference engine has also been upgraded from a general machine learning framework to Agora AI implementation. Not only has the overall package increment decreased from 3M to 1M, the computing performance has been improved by more than 30%, and the new API is also easier to use.

Looking back at the development process of the virtual background function of Shengwang Web SDK, it mainly went through three stages:

The first stage of open source model + open source machine learning framework

At this stage, we completed the engineering practice of virtual background on the Web platform based on the MediaPipe selife portrait segmentation model and the TFlite machine learning framework. A complete pipeline from image acquisition, real-time processing to encoding and sending is realized. During this process, we conducted a lot of analysis on the key factors affecting processing performance, and made targeted optimizations to these performance bottlenecks. At the same time, we also customized and optimized the application scenarios of different machine learning frameworks in Web portrait segmentation, including the customization of MediaPipe and TFlite frameworks.

MediaPipe uses TFlite as the machine learning inference engine. The operators used in MediaPipe's TFlite portrait segmentation model include not only general operators supported by TFlite, but also special operators provided by MediaPipe. In practice, we directly transplant the MediaPipe special operator that the MediaPipe portrait segmentation model depends on to TFlite, and realize that the selife segmentation model runs directly on TFlite without the MediaPipe framework. At the same time, the self-developed WebGL algorithm is used to replace the graphics processing function provided by Mediapipe. This eliminates the project's dependence on MediaPipe, not only reduces the overall package increment brought by MediaPipe, but also decouples machine learning operations and image processing, making the overall solution more flexible.

Since TFlite is implemented by WebAssembly porting on the Web platform, there is a big difference between the WebAssembly VM environment and the real system architecture. This requires evaluating the performance of different matrix/vector operations backend frameworks that support TFlite operations. TFlite provides three matrix operation backends: XNNPACK, Eigen, and ruy. After analysis and comparison, their performance of single frame inference time under WebAssembly is as follows:

The performance of TFlite under different computing backends

According to the analysis results, after adjusting the computing backend of TFlite on WebAssembly to XNNPACK, the overall computing performance has been greatly improved.

The second stage self-developed model + open source machine learning framework

In the second stage of research and development, the focus is on the research and development and engineering of self-developed models and algorithms. Through the coverage of various scenarios by massive training samples, SoundNet has gradually realized the algorithm iteration of the self-developed portrait segmentation model, and the output self-developed models and Compared with the open-source model, the algorithm has great comprehensive advantages in portrait segmentation accuracy, picture stability, and computing performance. For the consideration of machine learning ecology and model compatibility, the machine learning framework used for engineering has also been switched from TFlite to Onnxruntime. In the process of porting the WebAssembly of onnxruntime, the Web team used a number of optimization methods including SIMD and multi-threading to improve the computing performance. It is worth mentioning that we submitted some of the work results in the Onnxruntime WebAssembly SIMD optimization process to the Onnxruntime open source community, and were incorporated into the main line of the project.

So far, Shengwang has released Web SDK v4.5.0 and launched an independent Web virtual background plug-in in npmjs, becoming the first service provider in the audio and video cloud toB industry to support this function in Web SDK products.

The third stage self-developed model + Agora AI machine learning framework

While the portrait segmentation model is gradually evolving, due to the infinite exploration of computing performance and the endless pursuit of user experience, the high-performance computing team of Shengwang has also carried out research on the engineering of portrait segmentation model based on the Agora AI framework. After a number of technical means including calculation graph optimization, automatic memory reuse, and operator WebAssembly optimization, the overall performance of the original model processing algorithm on the Web platform has been improved by about 30%.

■Single frame inference time of Agora portrait segmentation model on a test device

After the Web SDK team replaced onnxruntime with Agora AI, the overall package size of the virtual background plug-in was reduced from the previous 3M to 1M, which effectively improved the user's plug-in loading speed in the Web environment and greatly improved the user experience.

■Agora Web SDK media processing pipeline

In the recently released WebSDK v4.10.0, we also updated the virtual background plug-in. The new virtual background plug-in not only includes the above improvements, but also based on the new plug-in mechanism of WebSDK, providing an easier-to-use API. At present, the new virtual background plug-in is released through npm with a new package name. If you are interested in this function, you can click "here" to visit the Agora official website documentation to learn more.

The sound network web virtual background plugin on npmjs

Outlook

The technology is endless and the demand is eternal. In the future research and development process, the sound network virtual background will conduct research and breakthroughs in the effect of more abundant application scenarios such as strong light, background dark light, and complex background, and focus on the details of portrait edges and hair in high-resolution scenes. reserved for optimization.

■Detail preservation of high-definition image portrait segmentation

In terms of algorithm, it will realize the attempt from single frame image inference to video continuous frame inference. In order to meet the user's virtual background experience requirements in various more complex environments. let us wait and see!

Finally, if you want to experience the current virtual background on the web, you can visit videocall.agora.io, and enable it in the settings after creating a room.

Introduction to the Dev for Dev column

Dev for Dev (Developer for Developer) is a developer interactive innovation practice activity jointly initiated by Agora and the RTC developer community. Through various forms of technology sharing, communication and collision, and project co-construction from the perspective of engineers, the power of developers is gathered, the most valuable technical content and projects are mined and delivered, and the creativity of technology is fully released.

From open source models, frameworks to self-research, the virtual background algorithm of Shengwang Web side was officially released

The first stage of open source model + open source machine learning framework

The second stage self-developed model + open source machine learning framework

The third stage self-developed model + Agora AI machine learning framework

Outlook

RTE开发者社区

引用和评论

最新开源 TEN VAD 与 Turn Detection 让 Voice Agent 对话更拟人｜社区来稿

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

ReactNative开发笔记（持续更新...）

From open source models, frameworks to self-research, the virtual background algorithm of Shengwang Web side was officially released

The first stage of open source model + open source machine learning framework

The second stage self-developed model + open source machine learning framework

The third stage self-developed model + Agora AI machine learning framework

Outlook

RTE开发者社区

引用和评论

最新开源 TEN VAD 与 Turn Detection 让 Voice Agent 对话更拟人 ｜ 社区来稿

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

ReactNative开发笔记（持续更新...）

最新开源 TEN VAD 与 Turn Detection 让 Voice Agent 对话更拟人｜社区来稿