头图

Communication is the eternal pursuit of mankind. We are always eager to break through the limitations of time and space and narrow the distance between people.

With the maturity of technologies such as RTC and live broadcasting, more real-time and higher-quality communications have become more and more accessible. Combined with traditional IM messaging, "converged communications" has become a hot area in recent years, and Yunxin is committed to building the industry's "first brand of converged communications cloud." To achieve this goal, the transmission quality of communication data is of utmost importance, but it has always been a problem to ensure transmission quality under long-distance and complex network conditions. And on the premise of pursuing higher quality, lower cost and more versatile and flexible architecture are also important indicators for examining whether a communication system is better.

In these contexts, a new generation of large-scale distributed transmission network developed by Yunxin-WE-CAN (Communications Acceleration Network) was born. It can not only greatly improve the end-to-end communication quality and reduce communication costs, but also can be applied to a variety of application scenarios.

WE-CAN not only has ambitious goals, advanced architecture design and excellent engineering realization, it has also been fully implemented and verified in the actual online business of Yunxin:

  1. Transmit hundreds of billions of messages and hundreds of millions of minutes of media stream data every day;
  2. There are multiple nodes in major countries in the Asia-Pacific region, India, the Middle East, Europe, North America, North Africa and other regions also have node coverage. Each provincial unit in China has a large number of edge nodes, covering 200+ regions around the world;
  3. In the domestic audio and video transmission, it has achieved a high-quality transmission rate of more than 99.9% in the network, and an end-to-end high-quality transmission rate of more than 99%;
  4. Transnational communications are close to the quality of dedicated lines, and the global delay does not exceed 250 ms.

This article will analyze the architecture design and key technical difficulties of WE-CAN from various angles.

Definition of WE-CAN

WE-CAN (Communications Acceleration Network) is a complex network system that is set up on the public Internet to achieve the goal of improving data transmission quality and reducing data transmission costs through intelligent scheduling of various resources.

WE-CAN's goals

As the base for building Yunxin's "first brand of integrated communications", WE-CAN's fundamental goal is to establish a universal transmission network that can stably, quickly and efficiently send any data from any point in the world to any other corner of the world, and This network is set up on the public Internet-that is, without the help of any special hardware equipment or dedicated lines, but through software solutions to achieve the goal.

Compared with similar products, WE-CAN's goals are:

  • Faster than CDN
  • Cheaper than SD-WAN
  • More versatile than RTN

Advantages of WE-CAN

As a transmission network, Yunxin is not the first and will not be the last. Compared with other similar products, WE-CAN has its unique advantages: Compared with CDN, WE-CAN can not only achieve large content Large-scale, marginalized distribution, and faster; for RTN, WE-CAN can not only support streaming media RTC business, but also support a variety of other transmission modes.

  • WE-CAN can transmit streaming media with high arrival and low latency, and WE-CAN can perform optional ARQ, FEC and other redundancy strategies that are transparent to services in addition to the various QoS strategies of the media itself. The strategy is also common to all other transmission modes of WE-CAN;
  • WE-CAN can also distribute live video on a large scale, eliminate the bottleneck of the number of people in the room through path cascading and multiplexing, reduce bandwidth costs, achieve cost close to CDN, close to RTC in real-time, and better support low-latency live broadcast scenarios ;
  • WE-CAN can also reliably transmit signaling, IM or other data. The so-called "reliable transmission" means to ensure that the data will arrive and the order of data delivery;
  • WE-CAN's services and protocols have industry-leading decoupling and layered design, which are elegant, simple to use, and flexible. For example, it abstractly encapsulates the reliable transmission protocol and provides a minimalist interface. We call it MessageBus . The goal of MessageBus is to provide a globally deployed distributed message queue service.

The effect of WE-CAN

The following are the actual online data of the WE-CAN production environment. Some channels of Yunxin RTC are used for WE-CAN, and some channels are directly connected to media servers (not using WE-CAN) for comparison results of A/B testing.

The figure below is the end-to-end high-quality transmission rate change curve in the past 13 days. It can be seen from the figure that the network transmission arrival rate of the WE-CAN channel has been significantly improved (our high-quality transmission rate refers to: the arrival rate in all statistical windows is greater than 95% ratio):

image.png

rate comparison

The following figure shows the change curve of the freeze rate of Yunxin RTC audio disk in the past 13 days:

image.png

delay comparison

The following figure shows the 24-hour market delay gradient statistics:

image.png

WE-CAN architecture

Consider the process of sending a message from A to B. From the perspective of WE-CAN, this process is divided into 3 stages:

  1. From client A to server A'connected to it;
  2. From server A'to server B'of B;
  3. From server B'and finally to client B.

image.png

Therefore, WE-CAN needs to optimize two different transmission scenarios:

  1. S2S (Server to Server), which is intranet transmission;
  2. Last-mile is the edge access of

The optimization of these two transmission scenarios can be further divided into two dimensions- quality and cost . Therefore, intra-network transmission and edge access have different means to optimize quality and cost, and find a balance between the two. From this perspective, WE-CAN is to provide high-quality, low-cost edge access and in-network transmission solutions for various types of data transmission requirements.

Intranet transmission

Core node

The core system of the network transmission part of WE-CAN is mainly divided into three types of nodes, access node (Edge), relay node (Relay), control node (Controller).

image.png

access node: Edge, as an access node, is the closest part of WE-CAN to the client in terms of physical deployment.

transfer node : Relay is responsible for the transfer of data within the network, and Relay nodes form a (approximate) Full-Mesh network with each other:

image.png

  • Data transfer is more common in long-distance transmission, especially in cross-border transmission, and sometimes it may even require multi-hop transfer. For example, when the data of the Guangzhou computer room is sent to Los Angeles, WE-CAN may send it to Hong Kong first, and then from Hong Kong to Los Angeles, or even take the transit route of Guangzhou-Hong Kong-Singapore-Los Angeles. The actual path is determined by the current network conditions;
  • The second situation that needs to be transferred is when the single-line access node crosses the ISP (network provider), such as the data on the Jiangsu Mobile access node to be sent to the Zhejiang Telecom node, which needs to be transferred through the Jiangsu third-line (or Zhejiang third-line) Relay;
  • The third common situation that requires transfer is when a line in a multi-line computer room fails or the network is jittered. For example, when the Unicom port of a third-line node in Hangzhou fails, other domestic Unicom computer rooms can send data to the nearest third-line node Relay first. Send it to the telecom/mobile port of the third line of Hangzhou, so that it can continue to provide services when the Unicom port of the third line of Hangzhou node fails.

Except for a few preset rules (such as the need to transfer when two nodes cross ISP), whether the data between any two points in WE-CAN needs to be transferred, and how the transfer path is dynamically routed and deployed according to the current network conditions .

control node : Relays will perform link quality detection between each other and report to the Controller.

image.png

If there is direct data traffic between two Relays (no need to transfer), these traffic packets and their Ack will be used to measure the link quality; for Relay-Relay links without direct traffic, WE-CAN will follow The specific mode constructs artificial detection traffic, and uses these detection traffic to count the link quality; whether any pair of Relay-Relay needs to perform link quality detection and statistics is determined by the control node Controller. Some specific links are not detected. For example, because there will be a transit between China Telecom and China Unicom nodes, link detection is meaningless. Therefore, the Controller will calculate the Relay's networking/detection strategy according to the preset configuration and deliver it to the Relay.

Quality optimization

WE-CAN improves the transmission quality within the network through real-time intelligent routing between transit nodes.

Each WE-CAN transit node will conduct quality detection with each other regularly, and report the detection results to the control node. After the control node receives the quality detection information of the entire network, it will calculate the best route for the entire network graph, and then deliver the route to each transit node. When the quality of the direct connection between the two access nodes is poor, the quality of the connection between them will be poor. Traffic will be forwarded via transit nodes; and "routing" determines which transit nodes to go and the serial relationship between transit nodes.

Because the carrying capacity of each transit node is limited, it is necessary to avoid overloading the transit node during routing. When a transit node fails or the network jitter deteriorates, its traffic should be evenly distributed to other transit nodes to prevent other transit nodes from being blown up. Causes the network avalanche effect; and the control node collects quality detection information and calculates the route distribution is periodic. When a transit node has network jitter, there will be a certain lag in routing switching and avoidance, so the WE-CAN transit node will Dynamically modify the routing table based on the current detection results to quickly respond to network congestion and node failures. All in all, WE-CAN improves the transmission quality in the network and reduces the packet loss rate and delay through the close cooperation of each node, the periodic update of the routing table issued by the control node, and the real-time modification of the routing table by the access/transit node. .

WE-CAN will also do message-level ARQ (overtime retransmission) and FEC (packet loss recovery) between nodes to improve transmission quality. This ARQ and FEC strategy is transparent to the transmission content, because these strategies will Brings additional bandwidth consumption, so whether it is opened or not and the degree of opening are optional. WE-CAN can also provide multiple redundant transmissions between two points to improve transmission quality with multiple transmission bandwidths. This strategy is also optional, and WE-CAN will ensure that multiple redundant transmission paths are mutually exclusive. overlap.

Cost optimization

WE-CAN reduces the cost of intra-network transmission through public network transmission and edge node sinking.

WE-CAN does not rely on dedicated lines to ensure the quality of transmission between nodes, but uses intelligent real-time routing to reduce bandwidth costs through ordinary public network transmission; at the same time, WE-CAN does not use expensive BGP nodes or three-wire (multi-operator) ) Node for access, that is to say, access nodes are generally only deployed in single-line computer rooms (specific operators), and transit nodes are mainly deployed in third-line computer rooms, which can greatly reduce bandwidth costs; in addition, WE-CAN is calculating During routing, the historical peak bandwidth of the transit node will be considered to avoid higher peaks for different transit nodes in the same billing period (month), and the bandwidth will be allocated to each transit node as stably as possible without affecting the routing quality. WE-CAN supports the delivery of a single message to multiple destinations. The multiple destination access nodes of the multicast message and the transit nodes on the path will be organized into a tree-like cascade structure, and the traffic will be reduced through path multiplexing. It has a good effect in low-latency live broadcast and other scenes.

Edge access

Access node

From the perspective of WE-CAN access node transmission in the network, it is a service called Edge, but it is actually a service cluster (Edge Cluster) composed of a set of services:

image.png

  • Gateway: Responsible for receiving and caching data, as well as various possible protocol conversions in the future, so that downstream services can be hot-plugged and upgraded without worrying about data loss;
  • Broker: Responsible for Topic management, MessageBus reliable transmission protocol encapsulation, etc.;
  • Driver: Responsible for various processing of data, including unpacking, grouping, retransmission, sorting, session management, etc.;
  • Edge: Responsible for managing the status of other Edge and Relay;
  • Monitor: Responsible for the monitoring of the entire Edge Cluster, including health status, load status, etc. At the same time, the quality of the intra-network links between nodes will be evaluated and reported;
  • ENS-Registrar: Responsible for Topic registration and query. ENS (Edge Name Service) is a part of Edge Cluster and Registrar is a centralized service.
Control platform

The WE-CAN control platform (Dashboard) includes a Web page, a configuration database and a set of API interfaces, which is responsible for the configuration, monitoring and management of various resources of WE-CAN.

Unified scheduling

The unified scheduling system obtains the static configuration information of each edge node through the configuration database of the management and control platform, and combines the dynamic load information reported by each Edge through MessageBus to schedule and distribute the edge nodes according to preset rules and historical data feedback.

The long-term goal of the unified scheduling system is to manage and allocate all Yunxin resources in accordance with general rules.

Quality optimization

WE-CAN improves the quality of edge access through real-time intelligent scheduling of access nodes, and deploys enough access nodes around the world for the scheduling system to allocate nearby.

The access node will report its own real-time load information and aggregation information to the dispatch system. For each access request, the scheduling system will assign the best access node of the same operator to it, so that the access quality can be guaranteed. Generally, the best access node is the closest geographical location or the same province/country ( It may not be the node with the closest straight line distance.

In the process of selecting the optimal node, the scheduling system will refer to the aggregation information. For example, people on the same channel will try to allocate them to the same server, provided that this server is relatively "close" to each user.

For small operator users or overseas users (without corresponding single-line nodes of the same operator), the scheduling system will assign BGP nodes to them. The dispatching system will revise the distribution results with reference to historical data, which can be divided into three types: positive feedback correction, negative feedback correction, and static look-up table correction. Small operator users can also access single-line nodes after passing the correction to avoid expensive and expensive A BGP node that may not work well.

Cost optimization

WE-CAN reduces the access cost through the convergence of the distribution results and the gradient selection of access nodes.

Distribution and convergence can not only improve transmission quality (avoid transmission between nodes), but also reduce transmission costs. The access node closest to the user/best effect may have a higher cost. WE-CAN reduces the access cost by selectively assigning some lower cost nodes to different business models. For example, in a low-latency live broadcast scenario, Sacrificing a certain delay can not only increase the degree of distribution and convergence, but also select nodes with lower costs.

Hierarchical decoupling

The transmission protocol of WE-CAN is fully decoupled in layers. First, it is decoupled from the business, that is, it can support various data and transmission modes, and then the various functions and service guarantees provided by WE-CAN itself are also divided into three layers. To achieve, these three layers are responsible for different services, each using a different protocol header:

  • application layer : The application layer currently provides MessageBus protocol encapsulation, including Topic subscription, consumption and other mechanisms, and will be expanded according to different business scenarios in the future.
  • transport layer : The transport layer is responsible for Session management, message sorting, retransmission, slicing, reassembly, etc. The WE-CAN transport layer has developed a set of reliable transmission mechanisms based on the UDP protocol.
  • Network layer : The network layer is responsible for data routing, traffic scheduling, and congestion control. At the same time, the network layer also has hop-by-hop ARQ, FEC and other redundancy strategies among forwarding nodes to increase the arrival rate and reduce delay. Because WE-CAN abstracts and encapsulates the protocols of each layer, it can not only make each layer work independently without affecting each other, improve system stability, but also promote the rapid iteration of functions and reduce the difficulty of development. Thorough hierarchical abstraction also enables WE-CAN to provide more flexible and diversified hierarchical services. For example, the network layer can also provide dedicated line services in addition to the software-defined intelligent routing network, and can even switch between dedicated lines and public networks flexibly. Another example is that the transport layer can provide multiple message retransmission strategies and data redundancy strategies.

Concluding remarks

Relying on advanced design concepts such as "layered decoupling, hierarchical services, and path multiplexing", WE-CAN has become the industry's first transmission base independent of business logic. However, Netease Yunxin's pursuit does not stop there. As an expert in integrated communication cloud services, Netease Yunxin not only continues to polish its audio, video and instant messaging technical capabilities, but is also committed to becoming a global leader in intelligent routing networks. In-depth in every direction to achieve the ultimate, NetEase Yunxin will use continuous technological innovation to empower customers' internal growth and realize value.

For more technical dry goods, please pay attention to [Netease Smart Enterprise Technology+] WeChat public account


网易数智
619 声望140 粉丝

欢迎关注网易云信 GitHub: