Sharing of live broadcast technology dry goods: all aspects of back-end architecture design of tens of millions of live broadcast systems

This article was shared by the NetEase Yunxin technical team. The original title "How to guarantee a large-scale live broadcast of tens of millions?" has been revised and changed.

1 Introduction

This article takes the live concert of the seventh anniversary of TFBOYS "Sunlight Travel" as an example to share with you all aspects of the back-end architecture design of a large-scale live broadcast system, including: basic architecture, stability assurance, security failure, monitoring and alarm, emergency plans and other technologies category.

The concert in the case adopts online real-time interaction and multi-scene director switching on the concert site, providing the host and three exclusive camera streams for artists, and at the same time, each camera stream is transcoded in real time to four resolution levels. , users can choose the content they want to watch according to their preferences. The concert had a maximum of 786,000 simultaneous online users, breaking the world record for online paid concerts.

study Exchange:

Introductory article on mobile IM development: "One entry is enough for beginners: developing mobile IM from scratch"
Open source IM framework source code: https://github.com/JackJiang2011/MobileIMSDK

(This article is simultaneously published at: http://www.52im.net/thread-3875-1-1.html )

2. The author of this article

Feynman: NetEase Smart Enterprise Server Development Engineer. Graduated from the Department of Telecommunications of Huazhong University of Science and Technology with a master's degree, joined NetEase Yunxin in 2016, is keen on large-scale distributed systems and audio and video related technologies, and loves literature, sports and movies.

3. Architecture

3.1 Basic

The picture above is a schematic diagram of the live media structure of the TFBOYS online concert.

It can be seen that the technical solutions covered by the live broadcast of a large-scale event are very complex. In the following content of this section, we will use push-pull streaming links, global intelligent scheduling, accurate traffic scheduling and unitized deployment to expand this live broadcast solution. introduce.

3.2 Push-pull streaming link

As shown in the figure above, the live broadcast technical architecture is divided into several parts:

1) Live Video Center (LMS - Live Manage Service): Responsible for the logical management and operational control of live streams, including storage and delivery of configuration information for media processing such as real-time transcoding and encryption;
2) Real-time interactive live service: It consists of two parts: Lianmai interaction and live broadcast. The audio and video data of the anchor and Lianmai are synthesized into one stream by the interactive live broadcast high-performance server and then pushed to the live streaming media server;
3) Live Source Service (LSS - Live Source Service): Netease Yunxin's self-built live streaming media server node, combined with the global intelligent scheduling system, provides the best link selection for the first kilometer, while integrating support for multiple access. CDN vendors;
4) Media Processing Service (MPS - Media Processing Service): Provides powerful streaming media processing capabilities such as real-time watermarking, real-time transcoding, and media data encryption;
5) Integrate CDN and Global Intelligent Scheduling (GSLB - Golabal Server Load Balancing): Provide agile and intelligent CDN scheduling policies and allocation algorithms, combined with full-link, end-to-end streaming media control, to achieve an excellent user experience on the terminal side ;
6) Client SDK: Provides push, pull, and upstream and downstream scheduling capabilities, allowing users to quickly access and use the NetEase Yunxin platform's one-stop audio and video solution.
3.3 Integrating CDN and Smart Scheduling

This is an end-to-end service, which implements a scheduling similar to HTTPDNS through the platform's SDK, so as to achieve the nearest access according to the user's IP.

In view of the relatively complex network environment of domestic operators, the selection of network links can be controlled more precisely through the BGP network and cooperation with relevant operators in network access in live broadcast uplink.

For the downlink, the SDK access of the player is also provided, and the appropriate downlink is selected nearby through the end-to-end scheduling strategy.

The accuracy and final effect of scheduling depend on timely and accurate data support.

We have a full-link, three-dimensional data monitoring system. On the one hand, we use some real-time logs on the CDN, and on the other hand, we combine the self-built nodes and client-side reporting to collect the data detected on the link, and then integrate them into a real-time calculation. The strategy that underpins the entire scheduling.

Integrate CDN solutions to solve CDN network problems through scheduling, monitoring, high availability and other technologies and means. But for technicians, there is no big difference just like using a traditional CDN network, and these technical details are transparent and imperceptible to technicians.

3.4 Accurate Traffic Scheduling For large-scale concert live events, especially in the stage of entering the venue when the broadcast is officially launched, the peak burst traffic will be very high, which requires real-time and accurate intelligent scheduling strategies.

The intelligent scheduling of the integrated CDN includes two parts: CDN allocation scheduling and node scheduling.

Node scheduling: The more common ones are DNS protocol resolution scheduling and IP scheduling (302/HTTPDNS). Due to the DNS protocol, the former has a slower scheduling effect time, while the latter can achieve request-level scheduling, that is, it supports load balancing in any proportion, which is more timely and accurate. In our intelligent scheduling scenario, IP scheduling will be followed under normal circumstances. When IP scheduling resolution fails, the client will start the loacl DNS resolution logic. The combination of the two ensures accurate, stable and reliable scheduling.

Don't put all your eggs in one basket.

"Never put your eggs in the same basket".

From the perspective of risk management and control: CDN vendor resources guaranteed by large-scale events cannot usually be met by a single CDN resource. The converged CDN solution integrates multiple CDN vendors and allocates and schedules traffic.

Usually in a large-scale live broadcast, the capacity (regional bandwidth, maximum bandwidth) and quality provided by multiple CDN manufacturers will vary. By dynamically adjusting the scheduling ratio, we accurately allocate traffic in proportion to the maximum bandwidth and ensure the best possible experience on the premise of ensuring that the maximum bandwidth is not exceeded.

We designed a set of scoring algorithms for CDN vendors: the impact factor includes current bandwidth, guaranteed bandwidth, maximum bandwidth, bandwidth prediction, and bandwidth quality.

The algorithm follows the following principles:

1) The bandwidth that does not exceed the minimum guarantee will have a higher score than the bandwidth that exceeds the minimum guarantee;
2) When the minimum guarantee is not exceeded, the larger the remaining minimum guarantee and the remaining total bandwidth, the higher the score;
3) When the guarantee is exceeded, the larger the remaining total bandwidth, the better the quality and the higher the score.
The ratio of the scores of each CDN determines the scheduling ratio. The CDN scoring algorithm is continuously iteratively updating the calculation, maximizing the allocation and use of the bandwidth of each CDN, and then allocating resources beyond the guarantee of each CDN manufacturer. At the same time, priority is given to choosing manufacturers with better quality to avoid over-allocation of CDN manufacturers with a unit price.

3.5 Unitized deployment As mentioned above, in large-scale live events, a large influx of user requests in a short period of time also poses higher concurrent processing challenges for related non-media streaming link applications based on global intelligent scheduling services.

In addition to the upstream push link, we have deployed the active and standby units, and the services on the non-media data link have also adopted a unitized deployment scheme.

Under this deployment scheme, the availability of any unit room fails without affecting the overall availability, that is, multiple activities in different places.

Unitized deployment follows the following principles:

1) Unity dependencies must also be unitized (core business);
2) The unit granularity is application, not api;
3) The unitized technology stack should try to avoid invasiveness to the application.

As shown in the figure above: non-unitized services are deployed in the main engine room, and unitized services are deployed in the main engine room and unit computer room.

4. Stability guarantee

4.1 Uplink Stability The core requirement of the super-large live broadcast solution is the live broadcast stability. Below we will take this online concert as an example to focus on the full link stability architecture of live broadcast.

The above picture is a schematic diagram of our live media streaming link: the overall solution can withstand the failure of any single node, single line, and single computer room network outlet.

For example, the source station part of the live broadcast: a multi-line strategy is adopted to collect traffic, including the dedicated line for the computer room and the 4G backpack solution, with one main and one standby. At the same time, the source cluster of each unit has a 4-layer load balancing, and the downtime of one machine will not affect the overall availability. LMS, LSS, and MPS are all deployed across computer rooms. All service modules can be configured with dedicated resource pools for use, ensuring that they will not be affected by other tenants.

The entire push-stream link: It adopts two-way heat flow, which are active and standby, and are deployed as two independent units, which can support Rack-level failure disaster recovery. The dual-channel heat flow realizes automatic active-standby switching, and there is no need to add circuit switching logic at the application layer on the terminal. When there is a problem with any link, the live stream of the audience will not be affected, and the average lag perception time on the terminal is within 1s.

In addition to the disaster recovery of the overall active and standby units of the push link, the services of each unit will also have disaster recovery measures. For example, UPS access can accept power failure for 30 minutes. For example, when there is a problem with the real-time interactive flow, the director station will push the shim flow to ensure that the link data is not interrupted.

4.2 Downlink Stability In the live broadcast event, the global intelligent scheduling service will bear a large peak pressure. On the basis of unitized deployment, we have undergone multiple rounds of stress testing and performance tuning, and the model can support tens of millions of levels. All users enter the live room within half a minute.

In addition to the high availability of the push link mentioned above, the downlink also has related disaster recovery strategies. When the GSLB intelligent scheduling service is not available as a whole, the local DNS disaster recovery logic and proportional configuration of the integrated CDN are embedded in the client SDK to fail-over the global intelligent scheduling of the cloud to the local bottom scheduling of the client, and maintain the statistical level of big data. The traffic distribution of each CDN vendor is balanced.

At the same time, the client will also have disaster recovery strategies for playback experience, such as resolution degradation and line adjustment.

5. Security guarantee

In addition to the stability of the entire live broadcast link, live broadcast security is also very important.

In this live broadcast, a security guarantee mechanism (such as anti-leech authentication, IP black and white list, HTTPS, etc.) is provided for multiple links of the TFBOYS activity link, as well as dynamic restrictions on downlink scheduling by regions and operators to realize the whole chain. road safety.

Building on this: The event uses end-to-end encryption of video streaming data.

There are several basic requirements for encryption of live broadcast scenarios: constant compression ratio, real-time performance, and low computational complexity.

In addition: in the context of integrating multiple CDNs, the encryption of video streams must take into account the compatibility of CDN manufacturers.

For example, the following requirements must be met:

1) Do not destroy the streaming media protocol format and video container format;
2) The header part of metadata/video/audio tag is not encrypted;
3) The avcSequenceHeader and aacSequenceHeader tags are not encrypted as a whole.

For specific encryption algorithms, some stream encryption algorithms can be used, and we will not repeat them here.

6. Monitoring and alarm

6.1 Overview A large-scale live broadcast will involve a large number of computing nodes, in addition to various server nodes for media data processing and distribution, there are also a large number of clients distributed at home and abroad.

Our perception of the health and quality of network links, service nodes, and equipment is inseparable from the data monitoring system.

At the same time, in the failure scenario where the existing system cannot automatically fail-over, we need manual intervention, and the latter's decision-making and judgment also strongly relies on a complete full-link data quality monitoring and alarm system.

6.2 Full link monitoring The monitoring of the entire live link includes:

1) Stream quality of the upstream push link;
2) Real-time transcoding of media streams;
3) On-device playback quality;
4) Availability of intelligent scheduling system;
5) Related monitoring data such as business volume and water level.

Common QoS indicators for uplink are: frame rate, bit rate, RTT, etc. The dimensions include active and standby lines, egress operators, CDN vendor nodes, etc.

The QoS indicators on the terminal include: streaming success rate, first frame duration, freeze rate, httpdns cache hit rate, and dimensions cover CDN manufacturers, countries, provinces, operators, live streams, resolution levels, and customers. end and so on.

In this live broadcast: the content supports a variety of camera streams and transcoding output streams of multiple resolutions, and is distributed through multiple CDN manufacturers at the same time. It is displayed on a single market page through N indicator cards, and abnormal display and pop-up message alarms are performed by increasing the warning value. On the scene of the active war room, we used multiple large-screen displays to display the real-time frame rate and bit rate of the current active and standby dual push streaming links very intuitively, providing strong data decision support for on-site command and support.

The following figure is an example: blue indicates the uplink frame rate, green indicates the normal uplink bit rate, red indicates that the bit rate value is too low, and N/A indicates that there is currently no uplink streaming data.

In the downlink playback link, the more commonly used indicator is the stall rate.

Here's our description of Caton:

1) A freeze: the buffer is empty in the player for 2s, that is, the player does not pull the stream for 2s;
2) One-minute user freeze: within a 1-minute window, the user only needs to be stuck once, then the user is counted as a stuck user;
3) One-minute user stuck rate: the number of stuck users/total number of users within a 1-minute window;
4) One-minute user zero freeze rate: within a 1-minute window, (total number of users - number of stuck users)/total number of users.

Why choose the indicator of user freeze rate instead of using the overall freeze sampling point/total number of samples?

It's because: we want to see how many users have not experienced a freeze, which can more intuitively reflect the overall proportion of high-quality networks. By observing the zero freeze rate of users in each province, the ranking of the number of users, and the observation of the user freeze rate in each province, we can intuitively find areas with severe freezes, so that we can focus on and optimize resource scheduling.

7. Emergency plan

Any system, no matter how robust you claim it was designed, will still have downtime.

Hardware failures, software bugs, human error, etc., all inevitably exist. They are not necessarily a problem that must be completely solved within a certain period of time. They are a fact of coexistence that we must recognize and accept.

Therefore, plan management is an indispensable part of the guarantee of large-scale live events.

We follow the following planning principles:

1) The plan information is clear: the automatic monitoring of the market is not ambiguous, ensuring that the source of the plan information is correct, and the conditions for triggering the execution of the plan are clear and numerically constrained;
2) The operation of the plan is simple: all plan operations have simple and clear (switch type) operation inputs;
3) Safety of pre-plan operation: All pre-plans must be fully rehearsed, and the rehearsal operation itself needs to have a clear confirmation mechanism to ensure that it will not be triggered by mistake under normal circumstances;
4) Validation of the impact of the plan: clearly clarify the impact of the plan operation, and QA needs to fully verify the relevant impact during the rehearsal stage.

In the early preparations for this event, we conducted a total of 3 live-streamed full-link immersive drills, and 2 rehearsals of the full-process level of the event at the joint interactive site and on the director's site. Various types of risk plan drills. All problems found during the exercise will be specifically resolved.

The risk plan covers scenarios such as various resource failures, uplink and downlink quality, regional network failures, and abnormal CDN traffic levels. Among them, resource failures include machine downtime, overall rack power failure, stacking switch downtime, and unavailability of the external network exit of the computer room. We have carried out risk plan drills to cover them.

The following is a list of some of the pre-planned mechanisms in the live broadcast solution:

1) If abnormal decryption occurs due to misoperation, etc., the stream encryption can be dynamically terminated without interruption of the push stream, and the client has no perception impact;
2) A certain cdn in a certain area is paralyzed by a large area of the operator, the QoS index of the corresponding operator's line in the area will drop significantly and trigger an alarm, blacklist the faulty CDN in the area operator, and dynamically stop it The scheduling of traffic is dispatched to the cdn vendor that normally provides services;
3) In the case that the two heat streams are normal, but there is a quality problem in the one being distributed, the solution can support manual triggering of the master/slave switch, allowing the other stream with better monitoring data quality to participate in the distribution, and the client perception time is within 1s ;
4) Due to some force majeure factors, a large area of failure in a certain computer room is not available as a whole, triggering a link alarm. At this time, we will urgently switch the flow to another computer room, and the fault detection and recovery time will be within one minute.

8. Related articles

[1] Detailed explanation of mobile real-time audio and video live broadcast technology (1): opening
[2] Detailed explanation of real-time audio and video live broadcast technology on mobile terminals (2): collection
[3] Detailed explanation of mobile real-time audio and video live broadcast technology (3): processing
[4] Detailed explanation of real-time audio and video live broadcast technology on mobile terminals (4): Coding and packaging
[5] Detailed explanation of real-time audio and video live broadcast technology on mobile terminals (5): streaming and transmission
[6] Detailed explanation of real-time audio and video live broadcast technology on mobile terminals (6): Delay optimization
[7] Taobao live broadcast technology dry goods: high-definition, low-latency real-time video live broadcast technology decryption
[8] iQIYI Technology Sharing: Easy and humorous, explaining the past, present and future of video codec technology
[9] Getting started with zero basics: a comprehensive inventory of the basics of real-time audio and video technology
[10] Necessary for real-time audio and video viewing: quickly master 11 basic concepts related to video technology
[11] Some optimization ideas of NetEase Yunxin real-time video live broadcast in the TCP data transmission layer
[12] Talking about several key technical indicators that directly affect user experience in real-time audio and video live broadcast
[13] First disclosure: How did Kuaishou manage to make it possible for millions of viewers to watch the live broadcast in seconds without being stuck?
[14] Chat technology of live broadcast system (1): The road of real-time push technology practice of the live broadcast barrage system of the millions of online Meipai
[15] Chat technology of live broadcast system (2) Alibaba e-commerce IM messaging platform, technical practice in group chat and live broadcast scenarios
[16] Chat technology of live broadcast system (3): The evolution of the message structure of WeChat live chat room with 15 million online messages in a single room
[17] Chat technology of live broadcast system (4): The evolution practice of real-time message system architecture for massive users of Baidu Live
[18] Chat technology of live broadcast system (5): Cross-process rendering and streaming practice of WeChat mini-game live broadcast on Android side
[19] Chat technology of live broadcast system (6): technology practice of real-time chat message distribution in live broadcast room with millions of people online
[20] Chat technology of live broadcast system (7): practice of architectural design difficulties of massive chat messages in live broadcast room

(This article is simultaneously published at: http://www.52im.net/thread-3875-1-1.html )

Sharing of live broadcast technology dry goods: all aspects of back-end architecture design of tens of millions of live broadcast systems

1 Introduction

2. The author of this article

3. Architecture

4. Stability guarantee

5. Security guarantee

6. Monitoring and alarm

7. Emergency plan

8. Related articles

JackJiang

引用和评论

小红书APP的全新鸿蒙NEXT端性能优化技术实践

即时通讯安全篇（一）：正确地理解和使用Android端加密算法

全民AI时代，大模型客户端和服务端的实时通信到底用什么协议？

融云数据监控平台「北极星」教程，聊天室洪峰、连接异常、消息未达正确解法

极致出海友好，融云 IM 支持消息免打扰设置时区

Bilibili直播信息流：连接方法与数据解析

直播间互动框架性能优化与稳定性实践