Editor's note: **Recently, the Global Software Case Study Summit was held in Beijing. The Global Software Case Study Summit ("TOP100Summit" for short) is an annual case study list in the science and technology industry. It selects 100 best cases worth learning from each year, and aims to unveil the practices and thinking behind excellent R&D teams and refine them for readers. The best learning path sorts out and thinks about the long-tail value of the case.
On the topic of "Architecture Evolution/Engineering Practice/Open Source Landing" at the Yibai Case Summit, Liu Yong, chief architect of Agora, delivered a speech on "QOE Driven Distributed Real-Time Network Construction: The Evolution of Agora SD-RTN". He focused on sharing how the SD-RTN and Agora RTC systems ensure gradual upgrades, capacity expansion, and continuous improvement in the quality of real-time interactive experience while maintaining online no downtime and no major failures, supporting billions of minutes of communication time per day for customers .
Yong Liu graduated with a bachelor's degree from Zhejiang University in 2001; graduated with a Ph.D. from Tsinghua University in 2013; joined Agora in 2014 and is engaged in the design and development of the overall system. Proposed, designed and led the development of the SD-RTN system, and built the Agora RTC system based on the SD-RTN. A loyal fan of the C++ language, and an eternal recruit in the Internet field.
Abstract
With the development of network and video technology, real-time interactive applications based on real-time audio and video have brought new challenges and demands to the low-latency and high reliability of long-distance network transmission. Different from traditional network architecture and SDN technology, SD-RTN has built a low-latency and high-reliability network with overlay network as the main idea, and cooperated with the UDP-based multiplexing transmission protocol AUT to provide real-time services such as RTC The underlying network guarantee ensures the end-to-end experience of Agora RTE users on a global scale.
• By layering and decoupling RTC services and network transmission, it is different from traditional RTC protocols such as RTP protocol that heavily couple audio and video media streams and network transmission protocols, providing a scalable, flexible and professional system architecture to ensure Quality of service
• Using the idea of overlay network and SDN to ensure the in-network transmission quality of the real-time network cloud of a certain scale of global networking
• Proposed a 4-layer multiplexed real-time transmission protocol AUT, based on this protocol, enhanced the experience quality of Agora RTC/RTE on the lastmile side, and provided an abstract and flexible control mechanism for upper-layer applications
SD-RTN and Agora RTC service architecture
RTC system architecture
Real-Time Communication, as the name implies, since it is communication, two or more parties in the communication must initiate a connection or handshake. In a common two-person scenario, both parties in communication can initiate a P2P connection through a signaling service to directly establish a data channel.
RTC system architecture evolution (P2P/Mesh architecture)
Advantages of server is lightly loaded and only performs signaling logic
Disadvantages:
• Rely on the network environment of both parties to achieve interconnection. Penetration success rate is low, availability cannot be guaranteed
• The two parties are in different network environments, and the quality of communication depends on the quality of the Internet between the two parties. The quality is unstable when cross-regional and network autonomous domains, and stable QoE guarantees cannot be provided.
• In the multi-person conference scenario, it is necessary to establish P2P channels in pairs, and the availability will be significantly reduced; at the same time, there will be uplink waste and performance problems
• Poor scalability of system architecture
Conclusion: combined with the above factors, the P2P architecture is not suitable for the underlying architecture of RTC basic service providers around the world. It can only be applied to small and restricted specific areas, or as a useful supplement to improve the coverage quality of specific scenarios.
RTC System Architecture (MCU Architecture)
The MCU (Multipoint Conferencing Unit) solution consists of a server and multiple terminals forming a star structure. Each terminal sends the audio and video stream to the server, and the server will mix the audio and video streams of all the terminals in the same room, and finally generate a mixed audio and video stream and send it to each terminal.
Disadvantages:
• The mixed-stream server consumes a lot of resources
• Large delay
• Poor scalability
• Poor flexibility
RTC system architecture (SFU architecture)
The SFU (Selective Forwarding Unit) solution consists of a server and multiple terminals. SFU does not mix audio and video. After receiving the audio and video stream shared by a terminal, it directly forwards the audio and video stream according to the subscription result of the terminal. To other terminals in the room.
Using this publish-subscribe model, flexible multi-person interaction scenarios can be realized. In the simplified model of the above figure, there are still two problems: 1) In a multi-person scenario, if a server is used to serve a communication party that is geographically distributed, the coverage quality cannot be guaranteed. At the same time, there is a single Due to the limitation of the number of concurrent participants in the session, the scalability of the system is poor.
So the industry naturally thought of a distributed multi-server collaboration solution, and try to use the nearest access method. It has:
• Good scalability
• Good coverage
• but poses a challenge to the network quality between servers
SFU will do distributed architecture extensions, which have the following advantage :
• Close to the access person, avoiding the impact on lastmile caused by long-distance access
• Due to the regional characteristics of RTC communication, the convergence of communication sessions in this area to the edge service center in this area is beneficial to reduce the communication delay
Disadvantages:
• There is inter-network traffic between different edge clusters, which brings increased costs
• Data interaction between different edge clusters brings challenges to network quality: for example, across regions, countries, and operators.
Using the architecture of distributed edge computing, through efforts to optimize the network quality between edge computing centers, the impact of the instability of the public internet network on user experience can be minimized. Based on this idea, Shengwang put forward the concept of SD-RTN.
Agora SD-RTN
design goal:
Different from protocols such as RTP/RTCP and webRTC and its derivative server-side architectures, in design, we hope to reduce the complexity caused by system coupling through a horizontally layered system design. Through a layer of network transmission protocol and service architecture independent of audio and video media protocols, audio and video RTC services can focus on the business logic itself, allowing network algorithm and protocol design and network hardware architecture engineers to use their respective areas of expertise to meet the upper-layer service pair QOS requirements:
• Protocol decoupling
• Service decoupling
• Fully and flexibly use the existing network infrastructure, such as public Internet, dedicated lines, etc.
• Safety
Agora SD-RTN abstracts the requirements for network transmission (low latency, high reliability) under the RTC distributed architecture, adopts a layered protocol design, decouples RTC services and network transmission, and realizes the layering of protocols, modules and services And decoupling:
• SD-RTN presents a Layer 3 interface of an overlay network to the upper layer
• SD-RTN is a distributed network system based on UDP that runs on a heterogeneous network and does not rely on specific hardware and software. It can perform real-time routing and traffic scheduling for different qos requirements
SD-RTN and Agora hierarchical service architecture
The following figure shows the service architecture of the entire Agora. We can see that SD-RTN and the 4-layer transmission protocol AUT form the network foundation of Agora's real-time cloud:
Agora SD-RTN architecture
The SD-RTN system also includes a control plane and a forwarding plane:
• Control surface
Link detection and capacity assessment system
Edge node information collection system
Routing scheduling system
Management system
• Forwarding side
The link detection and capacity assessment system, and the routing scheduling system are detailed below.
1. Link detection and capacity evaluation system: periodically tests the network quality data between different server clusters according to a certain scheduling strategy, analyzes the network model, especially the quality under the lossy network, and summarizes and evaluates
2. Routing scheduling system: routing analysis and scheduling system is similar to SDN-Controller. The SD-RTN scheduling system is a set of real-time and intelligent parallel computing services that undertake routing planning and load balancing. According to the link quality of the entire network, the real-time transmission bandwidth between nodes, the QOS requirements and the load of the forwarding node, etc., Calculate and distribute the routing of data flow in the network
Agora SD-RTN and SDN
In the design and continuous evolution of the RTN system, some ideas have been borrowed from existing network design practices, especially the architecture of SDN.
The design ideas of SD-RTN and SDN are generally similar, mainly as follows:
• Separate the complex control plane logic of the router from the forwarding plane logic
• The calculation of the routing strategy of the control plane is configured or calculated by the centralized control center (SDN-controller)
Difference
• The SDN forwarding plane needs to rely on the flow table to control the forwarding logic. With the increase of the network scale, the query, maintenance and update of the flow table become complicated, especially in the case of multi-hop; SD-RTN uses SR and other technologies to simplify Forwarding logic
• SD-RTN uses FEC or multi-channel redundancy technologies at the bottom layer based on the network link assessment status and the required qos level to achieve real-time and reliable packet-level delivery
• SD-RTN is an overlay network design that does not rely on specific hardware and software, and can simultaneously use public Internet and private lines for link calculation and traffic distribution
Agora SD-RTN
The evolution process is divided into three stages:
• Initial stage
SD-RTN and RTC services are heavily coupled. In addition to link evaluation and routing algorithms, the protocol itself and services are integrated in the RTC access and repeater
• More mature stage
RTC/RTN protocol is layered and modularized, most services are decoupled, and dedicated services are provided for Agora RTC
• Current and future directions:
RTN service-oriented, providing service-oriented interfaces for Agora cloud services (in progress)
Agora SD-RTN
• Development efficiency
The introduction of SD-RTN and AUT (see below) makes the upper-layer business no longer need to care about the quality of the underlying network transmission, and can focus on the development of the business logic itself, which reduces the complexity of the system, simplifies the business model, and shortens the RTC Iterative cycle of business development
• Transmission quality
In response to different QOS requirements, SD-RTN provides corresponding different transmission strategies, and cooperates with the AUT protocol to complete the corresponding quality requirements.
SD-RTN focuses on and optimizes two technical indicators of network quality:
• Time delay
• Packet delivery/delivery power
Agora SD-RTN quality index (time delay)
The indicator of packet delivery success rate is further subdivided. For common Agora RTC requirements, RTN focuses on the following indicators and continues to optimize them:
1. The standard service time when the packet delivery arrival rate within 2s delay is above 99.9%. This indicator is aimed at the delay requirements of the audience for general live broadcast services. When this indicator is up to the standard, most of the live broadcast viewers will be able to flow smoothly without any other factors. (Already basically better than CDN-based live broadcast technology solutions)
2. The standard service time when the packet delivery arrival rate within 800ms delay is above 99.9%. This indicator is aimed at the quality requirements of the audience in Agora's extremely fast live broadcast business scenario
3. The standard service time that the packet delivery arrival rate within 200ms delay is above 99.9%. This indicator focuses on the communication requirements of ordinary RTC. When this indicator is up to the standard, the two parties in communication can have a smooth conversation without delay or rush
Agora SD-RTN quality index (jitter 200ms arrival rate)
Challenges and problems faced by Agora SD-RTN
• Scalability (through the cooperation with the RTC system to achieve horizontal expansion)
• Link quality assessment and capacity assessment under lossy divergent networks
• Fast traffic scheduling algorithm (NP problem)
• Security (Ipsec)
Agora RTC over RTN architecture
The Agora RTC system has the following main services on the transport layer:
• AP/LBS service
• RTC SFU service
1、Native
2、webRTC
3、RTMP
• Channel synchronization service and subscription service
• Capacity negotiation and arbitration services
Agora Universal
Agora Universal UDP-based Transport Protocol (Aut)
In the RTC scenario, you need:
• A reliable network channel to send and receive control messages
• Need multiple real-time channels that are as reliable as possible to meet multiple data streams (audio and video, etc.) receiving and sending
• When the bandwidth is limited, the priority management problem of the upstream flow needs to be solved
• In the To B scenario, it is necessary for customers to independently and flexibly determine stream priority and transmission degradation strategies.
Challenges to network transmission in RTC scenarios
For example, when the bandwidth is limited, it is often necessary to ensure the high-priority transmission of control commands; in scene-based applications, such as guaranteed audio transmission; ensure that teachers are better than students. Consider a conventional implementation in which the control instructions go through the TCP channel; and each audio and video stream goes through one RTP/RTCP scheme. In this case, under the competition of multiple streams:
• If the RTP/RTCP channel adopts a TCP-friendly control strategy, the audio and video streams are the same as other data streams on the network, and high priority guarantees cannot be obtained
• If an aggressive congestion control strategy is adopted, it may block the RTC control command channel
• How to adjust the congestion control strategy of multiple RTP/RTCP channels to ensure high-priority streams
In this scenario, we need a multiplexed transmission channel, under the same congestion control module, to perform priority management on the flow, and make overall arrangements:
• The biggest problem with using TCP channels for logical multiplexing such as RTMP is that the implementation of TCP will cause the problem of blocking the head of the line
• Quic aims at web application scenarios. It implements multiplexing, priority management, and anti-clogging of transmission channels from the protocol level, but it does not support real-time unreliable data streams.
• Users of real-time streaming have more requirements for underlying control than reliable data streaming. How to design and implement media and network transmission layers is not a trivial issue.
• The contradiction between flexibility and the customization needs of major customers: If you can't do a good job of designing flexibility, then many major customers' needs become customized needs, which have to be solved through a large number of hard code methods
design goal
• Versatility: use a set of protocol design to meet the needs of different scenarios, not only RTC, but also reliable data channels
• Native stream support in the transport protocol:
1. Multiplexing, flexible priority management
2. By piggybacking custom Stream Meta information into the stream, users can make stream management decisions
• Flexible congestion control module interface, which can be extended to realize different congestion control algorithms
• The underlying network interface is able to support SD-RTN, udp socket and any virtual network, etc.
The Aut protocol design refers to the QUIC protocol design, but has undergone a lot of redesign.
• Removed some version management and negotiation mechanisms
• Added information mechanisms such as Stream Option/Meta in support of real-time streaming scenarios
• Designed the interface and implementation of real-time streaming
In addition to the support for real-time streaming, the Aut protocol also includes:
• Encryption
• Connection migration
• FEC support
• MultiPath (in experiment)
Aut in RTC
The Aut protocol has been technically verified as the underlying transmission technology in the Agora RTC SDK Nasa2 (current 3.0.0.18) version, providing high-quality transmission guarantees and flexible control mechanisms for upper-layer applications. Sensitive control and feedback mechanisms provide the possibility for upper-level engine or application optimization.
Aut in RTM
Aut Over Aut: Point-to-point network acceleration (RTNS service)
Summary
• The system architecture follows a gradual evolution, gray-scale iteration, and must be adapted to customer needs and production scale, and the most reasonable and cost-effective solution is adopted at different stages, while ensuring the continuity and consistency of the evolution of the technological direction
• System design should pay full attention to and investigate existing systems and design implementations, and track the latest technological evolution (academic and industrial)
• In the To B industry, one must try to meet the needs of customers and products, but also to avoid projectization and outsourcing of technical products. Together with the product, try to find out how to extract the common pain points of customers and take them into consideration in the overall evolution of the system
• In the process of system iteration, the system needs to be inspected:
1. Iterative linearity. Control the iterative complexity of the online system
2. Observability. Observe and ensure the effectiveness of system improvement based on a data-driven approach
ROI analysis
• SD-RTN and Agora RTC systems have achieved continuous upgrades, capacity expansion, and continuous improvement in the quality of real-time interactive experience while maintaining online no downtime and no major failures for more than 6 years. Supports billions of minutes of communication time per day for customers
• With the gradual delivery of SD-RTN and AUT, it provides a quick and consistent solution for the Agora cloud business building system; and uses the AUT protocol capabilities to provide flexible self-service solutions for customers' customized needs through the RTC SDK
The above content comes from the sharing of Teacher Liu Yong.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。