Taobao Yang Kuan: Taobao live broadcast low-latency architecture evolution and practice丨ECUG Meetup review

This article is based on the sharing of Yang Kuan (Alibaba Amoy technology audio and video technology expert) on June 26, 2021 ECUG Meetup Phase 1 | 2021 Audio and Video Technology Best Practices·Hangzhou Station. This article is more than 5500 words long, and the main structure is as follows:

Pain Points of Traditional Live Broadcast Technology
Low-latency architecture evolution
Interactive experience upgrade
Key technology

To get the "full version of the lecturer PPT", please add the ECUG assistant WeChat (WeChat ID: ECUGCON) and note "ECUG PPT". The sharing of other lecturers will also be released in the future, so stay tuned.

following is the share text:

good afternoon everyone! Let me first introduce myself. Since my work, I have been mainly in audio and video related fields, such as security monitoring, interactive live broadcast, CDN, etc. Currently, I am doing live broadcast low-latency architecture and QoS related work on Taobao.

1. Pain Points of Traditional Live Broadcast Technology

Look, this is some live broadcast scenes online. The first picture is mainly a costume scene, below is the comment interaction between the audience and the anchor. In the traditional live broadcast scenario, the audience may be watching the picture after 5-10 seconds. When the audience comments and asks questions, the anchor may not introduce this product, so there will be a delayed Gap, and communication is more difficult. The second scene is the jewelry scene. After the introduction of the bracelet, the next one may be changed, but the audience may still look at the previous bracelet. The third is the live pet scene.

There are three main pain points of traditional live broadcasting:

First, the delay is high, 5-10 seconds delay;
Second, the interaction between the anchor and the audience is not timely;
Third, it is complicated to switch between live broadcast and continuous microphone scenes.

2. Low-latency architecture evolution

This is a traditional live broadcast architecture. In the middle is the CDN distribution network, and on the far left is the self-built SFU and MCU cluster. On the far right are some of the supported systems, such as logging, monitoring, configuration, and scheduling. The bottom picture is the SDK for push-pull streaming. At the top is a live broadcast center with services such as screenshots, transcoding, recording, and slicing.

The traditional live broadcast distribution network is a tree structure. Because of the large amount of live broadcasts, tree structure distribution can control costs. The uplink protocol is mainly RTMP/WebRTC/private RTC, pushed to self-built cluster, and then pushed to CDN through rtmp. The downstream protocol is mainly HTTP-FLV/RTMP/HLS.

This is the transformation of the low-latency live broadcast architecture jointly built by us and CDN. The main point of the transformation is the use of RTC-related technologies between the CDN and the audience player, reducing the delay from 7 seconds to about 1.5 seconds. In terms of business, not only the delay is reduced, but also the communication between viewers and anchors is facilitated. Moreover, the online sub-industry AB shows that it has also achieved good results in promoting the conversion of e-commerce transactions.

This architecture is probably the architecture that we started with CDN, video cloud, enterprise intelligence, and XG Lab in 2019, and it is currently being used on a large scale. It can be seen that the intermediate CDN framework is all through the RTC link, which is no longer a tree structure, but a decentralized architecture. L1 and L1 can communicate with each other. If there is a problem with the communication between L1 and L1, you can take the L2 relay. The MCU confluence service can directly connect to the RTC network like a client, pull down the stream that needs to be processed, and then push it back to the RTC network after processing.

The host can communicate directly with the network through RTC. If it is needed to do some real-time stream processing, such as adding some AI special effects, it can be processed through the real-time stream, and then directly pushed to the entire RTC network after processing. The audience also directly interacts with this network through the RTC link.

The entire full-link RTC architecture has several advantages. It is a shared network for live broadcasts, calls, mics, and conferences. Different business peak usage time periods are different. For example, live broadcasts generally have more people watching at night. The meeting is basically a daytime meeting. In this way, different businesses can use this network staggered.

Because the CDN is generally billed based on the daily peak bandwidth, so that you can run the business of the call conference during the day, and some live broadcast bandwidth will come up at night. Through peak shifting, different business peak shifting to reduce the overall cost. For two-way real-time communication, because RTC can both push and pull streams, multiple streams can be used in the same link, which is not restricted. For example, it is possible to push one stream or pull ten streams.

Third, the interactive experience upgrade

The interactive physical examination based on the live broadcast scene mainly upgraded the two major parts.

The first point is to improve the efficiency of interaction. When consumers communicate with the live broadcast through the live broadcast room, the old system audience saw the picture 5-10 seconds ago, so the time point when the anchor interacts with the audience is not correct. After the upgrade, the delay is optimized to about 1 second, making the interaction between consumers and the anchor more timely.

The second point is index optimization. In terms of delay, it can be achieved within 1 second, and it can be achieved within 200 milliseconds in a call conference scene. Compared with the original traditional live broadcast system, the second opening rate can be increased by 32%, the freeze rate is reduced by 79%, and the freeze rate is reduced by 44%.

Here I want to talk about the difference between the original structure and the current structure of Interactive Lianmai Live.

It can be seen that the original architecture anchor used the RTC self-built cluster to forward the RTC stream to ensure low latency, and then connect to the microphone. The audience uses the CDN to pull the stream to watch, and the rtmp link is taken in the middle. If it is a connection between the host and the host, it is basically a process that the host pulls the other party's stream, then merges and pushes it to the CDN, and then distributes it to the audience. If the host and the audience are connected to the microphone, the audience is originally on the CDN link. On a traditional link, there is a delay difference in the middle, which is about 6-7 seconds. When connecting to the microphone through RTC, connecting to the self-built cluster through RTC, and then communicating with the host, the picture is delayed, so the experience is very bad, there is some logic to wait for switching in it.

Under the new architecture, self-built clusters and CDNs can be completely integrated. In other words, the CDN also takes full-link RTC, whether it is the communication between the host and the host, or the communication between the viewer and the host, it is completely the RTC link. The RTC link can be guaranteed by some transmission optimizations, as well as some other technical optimizations, which can ensure that all its experiences are unified, such as delay, freeze, fluency, and so on.

Below is an MCU, which is a confluence server. If the host or audience device is a low-end device, its performance is not very good, so it needs to be merged through the MCU. The merged server is also similar to the client to access the network. Then by pulling down the stream and pushing it out after closing, because our delay optimization is very low, we can basically achieve a senseless switching.

This architecture obscures the roles of viewers, anchors, and other services. As for this network of RTC, it is actually a unified accessor role, which can both push and pull streams. In terms of agreement, they are all unified into a private RTC agreement. Our RTC's private protocol can achieve a RTT to quickly establish a connection. For the scene, the current live broadcast scene, microphone connection scene, conference scene and call scene can be seamlessly switched. By configuring the system, different strategies will be enabled for different scenarios.

Another advantage is that the self-built cluster and CDN are unified into one cluster, which saves the cost of self-built clusters. The overall delay is relatively low. Business convergence network, just introduced a part, different businesses can run on the same network, because the main business time period is different, we can reduce the overall bandwidth cost through the use of staggered peaks.

Fourth, the key technology

We have a co-construction relationship with CDN. RTC Module provides RTC-related core processing capabilities and is embedded in the system in a modular manner.

The internal modules are roughly as follows: RTP/RTCP, BWE, QOS, PACE, PACKER, trickle/jitter, frame buffer, SRTP, setting, etc. BWE, mainly some congestion control algorithms, QOS will have some QOS strategies, such as retransmission, SVC, FEC and related to large and small streams. The main function of trickle/jitter is debounce and framing. PACKER, there are some scenes that need to be packaged from video frames or audio frames into RTP format. SRTP mainly provides encryption and decryption modules. Currently, it supports the interaction between the grtn private protocol and the webrtc standard protocol.

At the management level, there are mainly the following four types:

Callback management
Subscription Management
session management
Push-Pull Stream Management

Callback management is between the system and the RTC core processing layer, through some callback functions to handle the interaction between them, to achieve isolation between the two layers.

To summarize the basic abilities, there are mainly five points:

First, the framework of multiple agreements. Support webrtc, grtn protocol. If you access through the grtn private protocol, you can ensure the overall access effect. There are other private extension functions, which are much more efficient than the webrtc standard. rtmp and http-flv are also supported here.

Second, the audio and video Pub/Sub atomic capabilities are easy to use and expand.

Third, Data Channel is managed separately by Pub/Sub, and the business is flexibly customized. Data Channel is currently mainly used in control signaling and some cloud game scenarios. For example, the audience and the host will send some control signaling through the Data Channel to control the actions of some cameras or the control of the characters in the game. Its flexibility is also considered when designing, and it also provides the ability of a separate Pub/Sub, which can be used alone. In other words, if there is only one data channel under one URL, this is no problem. The characteristics of the Data Channel, we will provide a separate transmission guarantee. The delay can be controlled at about 100-200 milliseconds nationwide. If it is in the same area, it is basically an RTT. Online, the average RTT is about 20 milliseconds.

There is also an operation scenario that is the news of some live broadcast rooms. If it is through the traditional messaging system, it is basically a second-level subscription or a push-pull model. We can now achieve within a hundred milliseconds delay. The business section can be flexibly customized, for example, a server that can connect a message channel separately, and each end of this server can subscribe to the message channel separately. Do some business activities, as well as some real-time message delivery, which can be individually and flexibly customized.

Fourth, the media and signaling channels are unified, and signaling is reliable. We will ensure the reliable arrival of the signaling. Everyone knows that UDP is easy to lose on the public network, so we have some strategies to ensure that it is quickly reachable, and there will be some fast retransmission logic and redundant logic.

Fifth, the network-wide flexible stream cutting capability, Url/Stream level. Everyone knows that in the case of connecting the microphone, the anchor pushes the stream in the live broadcast room. When he and others connect the microphone, the original way is to use the RTC self-built cluster, and the RTC self-built cluster will handle the action of cutting the stream. Because all the anchor streams will go to self-built clusters, when he cuts the stream, because the data source is controllable, he can cut the stream without affecting all subsequent playback. In the case of full-link RTC, it is actually a distributed system. When the host and the host want to connect to the microphone, the access point is different. The overall system is a distributed system. The ability to cut the stream will ensure that it can be able to switch during the handover process. Ensure that the audience of the whole network cuts to the confluence picture at the same moment. Url and Stream levels can be achieved, which means that different Streams can also be switched freely.

In the traditional live broadcast link, TCP is used. Because TCP is at the kernel layer, the system will provide some strategies and it is difficult to customize and optimize. We now use full link RTC and UDP, which is unreliable. The whole link QoS strategy is very important.

The audio and video system has a characteristic, from acquisition to pre-processing to encoding, to transmission and then to decoding and rendering, it is actually a serial process, any problems in the middle will affect the overall experience.
In terms of indicators, there are a series of indicators such as success rate, freeze rate, second opening, delay and so on.

The scenes include live broadcast, microphone connection, phone calls, and conferences. Meetings are also to be done, and multi-person interaction within the live broadcast e-commerce. When we do QoS, we must consider multiple scenarios, the overall transmission link of RTC, and the optimization of some core indicators.

The algorithms currently used are mainly BBR and GCC algorithms. These two algorithms were originally used from WebRTC modularization. Later we almost rewrote, refactored, and improved performance. And for the live broadcast scene, deep customization and optimization. The live broadcast emphasizes throughput, and the algorithm emphasizes smoothness in the conference scene is somewhat different.

In addition, we will do large-scale AB in different business scenarios for different algorithms. According to the different indicators and delays collected by the data system, it dynamically adjusts its parameters and continuously optimizes and improves. Bandwidth allocation algorithm, for example, the server has audio bandwidth, retransmission bandwidth, video bandwidth, and some layered bandwidth of SVC, as well as small streams. How to allocate these bandwidths, what strategies to use in what scenarios, and bandwidth allocation algorithms to solve these problems.

The strategy control area mainly includes FEC, ARQ, SVC and so on. RED is to ensure low audio latency, JitterBuffer has also made some optimizations, and Pacer has also made improvements for the large throughput of live broadcasts. For example, we will implement the strategy of using multi-packet transmission, as well as the overall improvement of the intermediate link, as well as NetEq, delay control and so on.

Let me talk about the staged delay optimization.

For delay optimization, if it is just a simple scene, such as some scenes with a relatively small amount of data, this is relatively simple, but our number of users, including the number of users of the anchor and the number of viewers, is hundreds of millions of views every day , How to ensure that the average delay of the entire network is reduced, this challenge is great.

We think about the delay in this way: the live broadcast system is mainly divided into the push end and the player, and the transmission network is in the middle, which needs to be optimized separately for each stage. In the transmission network part, when the network is good, the delay is relatively controllable. If the network is not good, some QoS policies will be activated. Acquisition, pre-processing, coding, coding is divided into hard coding, soft coding, and some self-developed encoders.

There are also some strategies for sending buffers dynamically. On the receiving side, the larger delay is mainly the receive buffer. In the receive buffer, if there is packet loss or jitter with the upstream network, the receive buffer will be dynamically adjusted, and the entire delay is added to combat some of the previous problems. . Decoding this piece, including decoding speed, decoding buffer and different decoders, such as hard solution, soft solution and the like. There are also some post-processing, some algorithm processing delay after decoding the number of post-processing. Finally, there is the delay of rendering.

Different platforms are different, IOS, Android, PC. For example, what is the pre-processing and coding of the IOS platform, and what the approximate delay distribution is, there must be a data report, which is collected online for targeted optimization. The same goes for Android and PC.

There will be targeted optimization at each stage. For example, for coding, we will optimize together with the algorithm team, and will also process the coding part on different platforms. The encoding will also be placed online, and will be displayed on different platforms, encoders, and versions according to the data embedding system.

The second point, the intermediate link part, mainly applies some QoS strategies. There are also some scheduling, scheduling is in cooperation with CDN. Plan out a shortest transmission path according to the network quality and cost between different links. It is a comprehensive strategy of quality and cost. Coupled with the QoS strategy, the delay of the transmission is guaranteed.

The third point is that the number of users is relatively large, and the delay distribution of the market at each stage will be statistically analyzed, and different reports will be displayed every day.

The data system part mainly introduces four:

First, the quality display analysis of the full link tracking. According to this system, the servers of each hop passing through each link can be tracked, and the link status between them, including error codes and bit rates, is a full link analysis. From each user to the anchor link, the quality of each hop on this link can be checked based on a unique ID, so that different online problems can be analyzed in detail.

Second, segment delay analysis and display. This system will be delayed according to different platforms, different hosts, and different stages, such as encoding, decoding, caching, rendering, and transmission.

Third, configure the AB system. Whether it is the push end or the player end on the end, there are a large number of configurations, such as some delay algorithms and some business strategies, which can be analyzed and compared according to different services. For example, to prove whether our optimized delay has a positive effect on e-commerce transactions. We will choose some live broadcasts of jewelry, and some live broadcasts of clothing, etc. According to different industries, we will do AB, what is the improvement effect?

Fourth, the RTC quality market. There will be an overall quality display for different regions and different domain names. Including streaming quality, streaming quality, and the quality of some intermediate links, etc.

The above is what I shared today, thank you all!

About Qiniu Cloud, ECUG and ECUG Meetup

Qiniu Cloud: Qiniu Cloud was established in 2011. As a well-known domestic cloud computing and data service provider, Qiniu Cloud continues to be intelligent in massive file storage, CDN content distribution, video-on-demand, interactive live broadcast and large-scale heterogeneous data In-depth investment in core technologies in the fields of analysis and processing is committed to fully driving the digital future with data technology, and empowering all walks of life to fully enter the data age.

ECUG: Fully known as Effective Cloud User Group (Effective Cloud User Group), CN Erlounge II was established in 2007 and initiated by Xu Shiwei. It is an indispensable high-end frontier group in the field of science and technology. As a window of the industry's technological progress, ECUG brings together many technical people, pays attention to current hot technologies and cutting-edge practices, and jointly leads the technological transformation of the industry.

ECUG Meetup: A series of technology sharing activities jointly created by ECUG and Qiniu Cloud. It is positioned as an offline gathering for developers and technical practitioners. The goal is to create a high-quality learning and social platform for developers. We look forward to every participant. The co-creation, co-construction and mutual influence of knowledge between developers, generating new knowledge to promote cognitive development and technological progress, and promoting the common progress of the industry through exchanges, creating better for developers and technical practitioners

Taobao Yang Kuan: Taobao live broadcast low-latency architecture evolution and practice丨ECUG Meetup review

1. Pain Points of Traditional Live Broadcast Technology

2. Low-latency architecture evolution

Third, the interactive experience upgrade

About Qiniu Cloud, ECUG and ECUG Meetup

七牛云

引用和评论

AI for All，Code for All｜七牛云 AI 开源项目扶持计划全面启动

Bilibili直播信息流：连接方法与数据解析

得物业务参数配置中心架构综述

分析型数据库入门指南：如何选择适合你的实时分析工具？

字节跳动开源 Godel-Rescheduler：适用于云原生系统的全局最优重调度框架

HarmonyOS NEXT hiLog日志类封装

最近爆火的MCP究竟有多大魅力？MCP开发初体验｜得物技术