NetEase Yunxin Online 10,000-Million-Mai Technology Revealed

首图.jpg

Review of technical series of courses | NetEase Yunxin Online Ten Thousands Linked Wheat Technology Revealed

This article is based on NetEase Yunxin's senior audio and video server development engineer Chen Ce sharing and finishing online live broadcast of "MCtalk Live#5: Netease Yunxin Online Ten Thousands Lianmai Technology Secret".

Guide

Hello everyone, I am Chen Ce from NetEase Yunxin. Lianmai is a strong interactive scene, and high concurrency in a single room has always been a more complicated problem. This time I will share with you Netease Yunxin's exploration and practice in the scene of Ten Thousands Link.

Typical scenarios where 10,000 people want to connect with microphones include video conference seminars, low-latency live broadcasts, large online education classes, and Club House-like (multi-lingual chat rooms). For the demand scenario of tens of thousands of people with microphones, the common solution on the market is the "RTC+CDN" approach, which means that the number of broadcasters on the microphone is limited to a small-scale (about 50 people) for RTC interaction, and then reposted to CDN, large-scale The audience is then realized through CDN live streaming. This kind of solution limits the number of microphones in business, and there is a large audio-visual delay between the audience and the anchor, which cannot meet the business requirements of Yunxin. For example, in the game voice scenario, there will be a large number of users who open microphones, and everyone needs to be able to access the microphones, and there is no limit to the number of microphones. So, how does Netease Yunxin solve this problem? This article will share with you our exploration on this problem.

signaling technical difficulties

Signaling concurrency, weak network and high availability issues

Our discussion on this issue is divided into several aspects: signaling, audio technology, video technology, and network transmission between servers. Let's first look at the implementation of signaling.

RTC uses long connections for two-way notification. Assuming that there are 10,000 hosts in a language chat room, as long as one of the hosts gets on/off the microphone or joins/leaves the room, it will trigger a signaling notification to the other 9,999 people. This random user behavior will make the server very stressful instantaneously.

Obviously, the traditional centralized single-point server cannot support such a large concurrency in a room of 10,000 people, and a distributed architecture must be used to spread the pressure. If a 10,000-person room is distributed on multiple servers, the Nx(N-1) user status and publish-subscribe relationship must be synchronized between the servers in real time. At the same time, in order to achieve high availability, it is also necessary to support a server crash and restart. data synchronization.

On the other hand, media servers are generally made into a distributed mesh structure to reduce point-to-point latency. But if the signaling server also maintains a mesh structure where each node is equal, then each node must maintain a full cascade relationship. Among them, the intricate message synchronization problem will be extremely difficult to maintain.

NetEase Yunxin distributed tree structure

In order to achieve high concurrency and high availability, combined signaling is transmitted inside the room, and there will be no signaling interaction between rooms. We design the signaling server into a distributed tree structure. Each room is a distributed tree structure. A separate tree, as shown below:

Root node: room management server, manages and stores all user status and subscription relationships.
Child node: As an edge server, it is responsible for users' nearby access.

This kind of tree structure can effectively disperse the broadcast pressure of each node. The root node will try to allocate according to the principle of nearby users to avoid excessively long links between it and the child nodes. At the same time, the child nodes try to only act as message proxy and not participate in the business. The business is concentrated on the root node to avoid signaling problems. Synchronization disorder and other issues.

The root node uses a cache and a database to ensure the performance and reliability of business data storage. Since the child node does not involve the user's business status, after a crash and restart, only the long connection of the client signaling is required to reconnect, and there is no need to perform operations similar to rejoining the room, so that it does not perceive the user. In the case of a child node downtime, the client's timeout mechanism will be relied on to request the scheduling again. Through this series of means, the high availability of services is realized.

Signaling weak network problem

In the RTC architecture, due to the large amount of signaling and complex interactions, the strategy of separating signaling and media channels is generally adopted. But this will introduce a problem, that is, the two anti-weak network capabilities are inconsistent. Media generally use Udp transmission, and it is easy to watch smoothly even with 30% packet loss. However, signaling generally uses Tcp for transmission, and signaling is basically unavailable under 30% packet loss.

In order to solve this problem, we use Quic as the signaling acceleration channel to ensure the consistency of the weak network countermeasures of the signaling media. At the same time, when Quic cannot be connected (the user network may be unfriendly to some Udp ports), it can also be backed off to Websocket to ensure the high availability of signaling under weak networks.

technical difficulties of audio

mixing and routing

In the multi-person mic-link scenario, the most complicated issue is the audio concurrency problem in the scenario where multiple users are talking at the same time.

Assuming that every anchor in a 10,000-person room has a microphone, due to the characteristics of voice, each anchor theoretically has to subscribe to everyone else (video can be subscribed on demand, but audio should theoretically be fully subscribed). If each client subscribes to N-1 (9999) streams, it is not only a waste of traffic, the client cannot support so many downlinks, and in a real scene, as long as more than 3 people speak at the same time, the others basically I can't hear the content.

There are generally two ways to solve this problem: audio routing and server-side mixing. But in the scenario of a 10,000-person room, these two solutions have their shortcomings:

Audio routing is to select the loudest channels among the N channels of audio for delivery (generally 2~3 channels). This solution can indeed solve the above problems, but it has a precondition: the edge server directly connected to the client must collect the full amount of audio streams, so that the route can be selected. Therefore, even a 48-core server cannot support 10,000-way concurrency.
The server-side mixing is to decode N channels and mix them into 1 channel, or after selecting the channels, decode 3 channels and mix into 1 channel. The main problem with the former is that it is difficult for the MCU server to withstand such a large transcoding pressure, and it takes too long. The latter still has the above-mentioned defects of audio routing. In fact, the biggest shortcoming of MCU is a single point problem. After a crash, it will affect all users and easily become a system bottleneck.

NetEase Yunxin's distributed routing

In order to solve the above-mentioned audio problems, we have adopted the solution before the server is cascaded. Assume that a room with 10,000 people is evenly distributed on 20 edge servers, and each server has 500 upstream streams. If cascading preselection is not used, then after each server is cascaded to each other, the full amount of 10,000 streams must be pulled to provide Downlink routing is used.

When we use the cascading pre-selection scheme (default 3 channels), each server only needs to pull 3x(20-1) streams, and then add the cascaded 3x(20-1) streams from the local 500 streams. ) The second path selection is performed among the channels, and the final 3 channels with the largest volume are selected and sent to the client. In this way, full subscription of audio is realized. Assuming that N audio streams are transmitted on M servers, the magnitude of the amount of data transmitted by the servers drops from N^2 to M^2.

Technical difficulties of video

Simulcast/Svc and bit rate suppression in the limitations of large rooms

Due to client performance limitations, the number of video streams that can be decoded and rendered at the same time is generally not too many. The above-mentioned media concurrency problem can be circumvented by "subscribing to streaming only".

The main technical difficulty of the video technology in the room of 10,000 people lies in QoS. There are two main QoS methods on the server side in RTC:

Use Simulcast/Svc to cut the flow
Suppress the encoding rate of the sender through RTCP

The essence of Simulcast and Svc is to layer the user's downstream bandwidth to distribute the corresponding stream, but in a room with 10,000 people, the user bandwidth is often scattered, and the mechanical layering does not allow most users to have the best experience. RTCP code rate suppression is to compile the most suitable code rate according to the bandwidth of the receiving end and feedback to the encoder of the transmitting end. The most suitable scenario is 1v1, which makes it difficult to make decisions in a 10,000-person room.

QoS Policy 160f14399965ca

In order to allow as many users as possible to obtain video streams matching their network, we have formulated the following QoS strategy: First, we will divide users into 4 levels from high to low according to the downlink bandwidth, and the sender uses Simulcast+Svc encoding at the same time. , Such as 720p/30fps, 720p/15fps, 720p/8fps, 180p/30fps, the server allocates the corresponding data stream according to the user level.

The advantage of this method is that it can realize that each user can match the video stream corresponding to its bandwidth, but there is an obvious disadvantage that the distribution is not even enough. For example, in a room with 10,000 people, the bandwidth of most users hits the 720p/15fps layer, and a small number of other users are scattered on the other three floors. In fact, the video experience of most users in this room is not the best.

In order to solve this problem, it is necessary to combine the code rate to suppress on the basis of layered coding: first, the user's bandwidth is sorted from high to low, and topN% of the user's lowest bandwidth is taken as feedback to the sender to guide the highest level ( 720p/30fps) coding rate, so that topN% users can hit the best data stream experience. N can be set by the user, or it can be dynamically changed according to the downlink bandwidth of the user in the room.

The following figure takes Top60% as an example:

Another scenario is when the user's uplink bandwidth is insufficient, assuming that it is only 1.2M, it is impossible to realize Simulcast+Svc at all. In this case, we will let the client only encode one 720p single stream, and then introduce an MCU in the room and forward the single stream to it. The MCU will use Simulcast+Svc when transcoding and push it back to the SFU. In order to match our downstream QoS strategy.

Technical difficulties of network transmission between servers

Cross-operator and cross-border transmission

In the actual development process, we also encountered some network transmission problems between servers, here is also to share with you.

For example, in order to reduce the last mile distance, we will use some single-line computer rooms as edge nodes. If the single-line computer rooms of different operators are directly cascaded, their network transmission is obviously uncontrollable. To solve this problem from the architectural level, it is necessary to introduce a third-line/BGP computer room as a transit in the cascaded network, and this must also consider the node location allocation and single-point collapse of the transit server, which undoubtedly greatly increases the scheduling Complexity.

Another situation is machine cascading in a multinational scenario. The public network routing between servers may not be optimal, and the jitter may also be very large. The network in the middle one kilometer is completely uncontrollable.

WE-CAN

In order to solve similar problems, we abstracted the transmission module between servers and introduced our own public network infrastructure, a large-scale distributed real-time transmission network: WE-CAN (Communications Acceleration Network). Nodes are deployed in major regions of the world, and the network quality is continuously detected and reported between nodes. After receiving the report, the central decision-making module integrates the operator, real-time network quality, bandwidth cost and other information and calculates the shortest path between any two nodes and generates it The routing table is then issued to each node as a routing reference for the next hop. WE-CAN has nothing to do with business and is completely a transport layer solution. In this way, media cascading only needs to put the target address on the header of the packet, and then deliver it to the WE-CAN, and there is no need to consider transmission issues outside of the business.

summary

The above technical solutions are all the content shared this time. Through Yunxin's 10,000-microphone technology, the service is upgraded to stateless, without restricting the maximum number of people in the room and the number of people who can access the microphone at the same time, and supports flexible expansion of the level, which can easily cope with emergencies. The flow rate matches the user network in seconds.

Of course, the construction of any system is not accomplished overnight. At every point, we have countless pits. NetEase Yunxin will also continue to polish audio and video technologies to bring better services to the industry.

Author introduction

Chen Ce is a senior audio and video server development engineer at NetEase Yunxin. He is responsible for the construction and core development of Yunxin's global RTC network. He has rich experience in media data transmission and RTC full-stack architecture design.

NetEase Yunxin Online 10,000-Million-Mai Technology Revealed

Review of technical series of courses | NetEase Yunxin Online Ten Thousands Linked Wheat Technology Revealed

Guide

signaling technical difficulties

Signaling concurrency, weak network and high availability issues

NetEase Yunxin distributed tree structure

Signaling weak network problem

technical difficulties of audio

mixing and routing

NetEase Yunxin's distributed routing

Technical difficulties of video

Simulcast/Svc and bit rate suppression in the limitations of large rooms

QoS Policy 160f14399965ca

Technical difficulties of network transmission between servers

Cross-operator and cross-border transmission

WE-CAN

summary

Author introduction

网易数智

引用和评论

InfoQ官媒报道|网易云信裴明明：云原生架构下中间件联邦高可用架构实践

从0到1：Rust 如何用 FFmpeg 和 OpenGL 打造硬核视频特效

Rust 开发者必备：三分钟搞定视频缩略图生成

三分钟掌握视频分辨率修改 | 在 Rust 中优雅地使用 FFmpeg

CVPR 2025 | 火山引擎获得NTIRE 视频质量评价挑战赛全球第一

从FFmpeg命令行到Rust：多场景实战指南

Rust 中的高效视频处理：利用硬件加速应对高分辨率视频