头图

Guide: On October 21, 2021, the "QCon Global Software Development Conference" will be held in Shanghai. As the producer, Chen Gong, VP of NetEase Intelligent Enterprise Technology, launched the "Converged Communication Technology in the AI Era" special session and invited multiple technologies Experts share relevant technical topics with everyone.

Starting from this issue, we will introduce and share the four lecture topics one by one. This issue is our first issue, converging the trend and evolution direction of communication technology.



Guest introduction: Jiajun, a senior server development engineer at NetEase Yunxin, graduated from the Chinese Academy of Sciences, joined NetEase after graduating with a master's degree, and is responsible for the server development of NetEase Yunxin IM/RTC signaling and other services. Focusing on instant messaging, RTC signaling and related middleware and other technologies, he is the author of the cloud letter open source project Camellia.

Foreword

In recent years, communication technology has shown a mature trend, closely integrated with artificial intelligence and 5G, and has been applied in various industries under the urging of the new crown epidemic. NetEase Yunxin, which is positioned as a one-stop converged communication cloud service, has brought a new atmosphere to the business level. This QCon2021 Yunxin special session will bring you the cutting-edge exploration and landing practice of converged communication technology.

The content of this sharing is as follows, first introduce the concept of converged communication, secondly introduce the technological evolution direction of convergent communication and some technical explorations made by Yunxin in the process of technological architecture evolution, and finally introduce Netease Yunxin's multi-industry scenario in converged communication Landing practical experience.

The concept of converged communication

The concept of converged communication was first expounded as unified communication in the corporate office scene. Unified communication refers to a new communication method produced by the integration of traditional communication technologies (such as SMS, fax, e-mail, etc.) and Internet technology. The core concept of unified communication refers to The network and any device can obtain the information we want, such as text, pictures, sound, video, and so on. With the development of communication technology, the concept of converged communication is also expanding. At present, a very rich concept with IM and RTC as the core and integrating multiple communication methods has been gradually formed.

For converged communications, its core concept is of course convergence. For integration, we can understand it from two aspects. The first is the integration of business. For an application, whether it is entertainment or social networking or finance, the realization of business logic often relies on the integration of some underlying communication technologies. The second is the integration of technology. Only if we do a good job in the integration of technology and get through the underlying technology, then businesses can more easily and efficiently implement their own business logic through our integrated communication technology.

What are the specific contents of converged communications? From the perspective of NetEase Yunxin or converged communications cloud service providers, converged communications can be divided into six aspects:

is the business scenario first. so-called business scenarios of 1618e11003514a refer to scenarios such as online education, music teaching, and corporate office. We will provide a variety of one-stop solutions for all walks of life to reduce customer access costs.

followed by the application component layer. application component can be understood as a further sublimation of a general scenario-based solution. For common business scenarios, such as video conferencing, NetEase Yunxin will provide a one-stop complete component service from client to server. By shielding some underlying technical details, it reduces the cost of customer access and improves R&D efficiency.

third layer of 1618e1100351bc is the client layer. NetEase Yunxin provides an SDK that covers all platforms, and our SDK is committed to providing consistent functions and a consistent experience.

fourth layer is the service layer. I simply divided several types, including instant messaging, RTC, live broadcast, and operator capabilities. For service modules, this is not a simple division of functions, but a process of deep integration.

For example, for instant messaging and RTC, instant messaging provides RTC with signaling capabilities; the integration of RTC and live broadcast on demand brings interactive live broadcasts and low-latency live broadcasts, such as integrated communication capabilities; another example is operations The integration of business capabilities and RTC capabilities, through the connection of the underlying capabilities and combined with instant messaging capabilities, can effectively improve the success rate of calls in audio and video call scenarios.

Below the service layer is our core competency layer. Here, I would like to mention the AI part in particular. NetEase Yunxin has established a dedicated AI algorithm team to be responsible for the implementation of AI algorithms in various business service scenarios.

Take the call center in the operator as an example, which may contain intelligent robots, which rely on the NLP algorithm capabilities of the underlying AI; for example, this year NetEase Yunxin launched a heavy product called “Secure Communication”, which is a convergent communication oriented One-stop content security solution, through security communication, we can enjoy the convenience of converged communication technology while adding a content security guarantee.

Finally, there is infrastructure. Including the self-developed components and platforms of NetEase and NetEase Yunxin, as well as the introduction of open source projects. In addition, for some underlying common capabilities, NetEase Yunxin is also planning to do some open source to give back to the technical community. You can follow our WeChat public in the future. We will release relevant information as soon as possible.

 

The technological evolution direction of converged communications

The core content of this sharing is the technological evolution direction of convergent communication. Looking at this topic alone is a bit big, I have listed some key words here. For example, the technological development of 5G and AI obviously drives the further development of converged communication technology; for example, the Internet of Things technology. With the rise of smart hardware, the Internet of Things has become a converged communication. New technology arena. And this sharing will focus on the two keywords of globalization and unitization, and introduce some technical explorations of NetEase Yunxin in the evolution of the converged communication technology architecture.

Why do unitization and globalization be

For this question, three key words are listed here, which can partially answer this question.

is the capacity first. With the expansion of the overall market scale of converged communications, the scale of NetEase Yunxin continues to grow. With the continuous expansion of the system scale, a single computer room or even a single city may become a bottleneck restricting the horizontal expansion capability of the system. For this problem , Unitization and globalization are a good solution.

followed by risk . Also with the continuous expansion of the system scale, the coverage of the terminal coverage group is getting larger and larger. If our system produces arbitrary jitter, its impact will be very large. For example, during this year’s National Day, I don’t know if you have followed or not. Facebook’s business sites were down for more than six hours, and the impact was very large. For this global risk, unitization is a good strategy, which can avoid single points of failure and prevent the failure from spreading to the entire system.

is the quality at the end. For communication services, whether it can be delivered faster is one of the important indicators for us to measure the quality of a communication service. If the communication between the terminal and the terminal is transnational or far away, the physical distance is actually a difficult problem to overcome. For this problem, unitization provides a very good solution. Since it can't be overcome, get closer to you.

After talking about the pain points, what are the specific advantages of unitization and globalization?

Improve the overall carrying capacity of the system. Under the unitized architecture, the horizontal expansion capability of multiple units will effectively improve the overall carrying capacity of the system.

can spread the risk. of the design principles of the 1618e1100354a7 unitized architecture is to maintain logical or even physical isolation between the units. The unitized architecture can effectively avoid a single point of risk, and can prevent the risk from spreading in the entire system. If the unit is down, then it can be cut away.

disaster tolerance capability. In the unitized architecture, we provide remote disaster tolerance through the mutual backup of units, thereby improving the overall availability of the service.

optimizes the last mile. only the unitized architecture can achieve real nearby access.

and globalization

Specifically, how to make modularized and globalized communications services? Before I talk about this issue, let me first talk about two points that I think are critical, or the preconditions for modularization and globalization.

The first is the transmission communication network. Under a unitized architecture, our edge nodes all over the world and data centers on all continents, the network transmission problems between them will directly affect the transmission quality of the entire communication network.

The second is the service quality monitoring system. In a unitized global communication system, terminals are spread all over the world and are massive. Servers are also spread all over the world. How to collect data from massive terminals and service nodes around the world is very important for monitoring and improving the quality of our entire service.

Transmission network

Regarding the transmission network, NetEase Yunxin has constructed a global real-time transmission network-WE-CAN. WE-CAN is a complex network system set up on the public Internet to achieve the goal of improving data transmission quality and reducing data transmission costs through intelligent scheduling of various resources. The figure below is a simple architecture diagram of WE-CAN.


 

In order to let everyone understand its working principle, give a simple example. On the left side of the picture is a terminal, and the right side is also a terminal. For example, from Shanghai to Singapore, assuming a message needs to be transmitted, what is its transmission path?

First, the client needs to obtain the edge acceleration node through the unified scheduling system dispatcher. We call him edge. After the edge obtains the message, it will be transmitted to the delay node in the same computer room or the same unit, and then pass through one or more routes. It reaches the edge node of the opposite end in Singapore, and finally transmits it to the target terminal.

The entire transmission process can be divided into two parts, one is from the terminal to the edge, we call it the last mile. It is mainly through the scheduling system mentioned above and the heartbeat load information of the edge to achieve the best access; the second part is the intra-network transmission, that is, the intelligent routing between the relay nodes. This mainly depends on the contrller service, which will Collect network detection data between relay nodes to form a dynamic intelligent routing table.

The design principles and goals of WE-CAN can be summarized in a few sentences:

Faster than CDN

Checker than SD-WAN

More versatile than RTN

The first two are easier to understand. What does the last one mean? What is multi-function? This is actually the difference between WE-CAN and the general RTN network. The WE-CAN network can not only transmit streaming media data, but also Can provide a reliable transmission mode, we call msgbus. The so-called reliable transmission, first of all, is to ensure that the message must be delivered, and secondly, to ensure the order of message delivery. msgbus has been widely used in the transmission of Yunxin's messages, data, and signaling.

Regarding unitization and globalization scenarios, WE-CAN solves two key problems. One is network acceleration and dynamic routing. The intelligent node scheduling inside WE-CAN can help us choose between nodes and nodes. The edge path between the edge node and the central computer room, and can perform an automatic obstacle avoidance function for a single point of failure on the network.

There is also the problem of the last mile. WE-CAN realizes nearby access through static configuration and reporting of dynamic load information to improve the quality of service access. In addition, in some special scenarios, such as a large RTC room, The scheduling system will also have the function of node aggregation to reduce network transmission across nodes, thereby reducing costs.

Service quality monitoring system

Under a globalized architecture, there is a very important point when building a service quality monitoring system, which can also be said to be a difficult point, that is, how to collect data from terminals and servers all over the world in real time. The figure below is a simplified diagram of the entire architecture.


 

The data source is shown on the left, one is the SDK, and our servers, including edge nodes and remote data centers.

For SDK, we will choose http or websocket protocol to report data according to the type and characteristics of data. The server collects data mainly through log + agent, and then reports through websocket protocol.

For the data source, before reporting, the dispatch system will be requested to obtain the access address of the edge collection cluster. After the data reaches the edge collection cluster, the data will be routed to the central cluster through the msgbus of WE-CAN, and then the data will be cleaned. Diversion to different data processing units. For example, enter our indicator calculation system, calculate the indicators of relevant data in real time, and connect to our monitoring and alarm system; others also include some offline and online systems for data analysis and problem location.

Through the service quality monitoring system, we can understand the health status of our entire communication system in real time, just like the dashboard of a sports car, which allows us to intuitively understand the operating status of the system, when to refuel, and when to turn. It's all clear at a glance.

IM/RTC server modularization/globalization solution

After talking about the two major prerequisites of modularization and globalization, the following describes how we build a modular and globalized system. I will introduce the modularization/globalization architecture of the two types of communication systems, RTC and IM, respectively.

RTC server

The first is the unitization of the RTC server. I will expand it through two parts: a unitized deployment architecture and a disaster recovery plan.

deployment architecture


 

This is a simplified diagram of the RTC server architecture. For its deployment architecture, the three keywords on the right can vividly describe its overall design concept.

The first is layered decoupling. The entire RTC server can be divided into three levels. The first is the signaling access layer, which is the entrance to the entire RTC server. The second is the media signaling layer. This layer is the control center of the RTC server and will conduct a large number of signaling interactions with the underlying media service layer.

For each service layer, it can be seen that the deployment of multiple units is supported. For the signaling access layer, its main functions include the authentication requested by the client, and the issuance of some global or application-level configurations. The unit division of the signaling access layer is at the application level. That is to say, each application will only be processed by one signaling access unit. After receiving the request, the signaling access layer will forward the request to a suitable media information according to some of the requested parameter information (such as client ip, etc.) Command unit, media signaling unit is divided into room dimensions. The main functions include room management, stream management, scheduling and distribution, etc. Each media signaling unit has a one-to-one correspondence with a media unit. For media, each The media unit is logically independent, but the physical resources are shared.

The second is the isolation and synchronization of data. For media signaling and media services, the units are all data isolation in the room dimension, while for the signaling access layer, because it is the entrance of the request, it also involves data synchronization and forwarding.

There are two types of data synchronization. The first is the synchronization of some configuration information at the global and application level. The one-write-multiple-read method is adopted to ensure the consistency of the data between each unit; the second is the synchronization of room-related data, mainly the room. Some of the authentication information and the one-to-one mapping relationship between each room and the media signaling unit. These data are synchronized in pairs, and each unit has a logical backup unit;

In addition, the signaling access layer also involves data forwarding, which ensures that each application request will only be processed by one unit through request forwarding.

The last is the mutual preparation of the units. It can be seen that each service layer supports unitized deployment, and the mutual backup between units can prevent a single point of failure from affecting the overall situation.

disaster recovery solution

For the RTC server, there are three levels of disaster tolerance, namely the signaling layer, the media layer, and the link disaster.

The signaling layer can be subdivided into a signaling access layer and a media signaling layer. For the media signaling layer, because it is a room-dimension unit division, when a unit is unavailable, the signaling access layer will be directly shielded to the unit, and all new requests will not be forwarded to the unit. Quick recovery

For the signaling access layer, each unit has a logical backup unit. When a unit is unavailable, the handover request entry (including gateway configuration, dns, etc.) is used to quickly spread the handover information in coordination with data changes. For all units, so as to carry out the disaster recovery switching of the units;

For the media service layer, each media server will periodically report heartbeat information to the media signaling service. When a node fails, the media signaling service will quickly detect and automatically offline the node.

The last is the disaster recovery of the link. For a communication system, all your service nodes may be fine, but the network link is down. For the link failure, we mainly use the intelligence of the WE-CAN network Routing to perceive and handle this kind of case.

To give a practical example, we have a computer room. The backbone network of a certain operator in the province where it is located fails. As a result, the computer rooms and terminals of the operator in other places cannot connect to the computer room. This is before WE-CAN. It may be a major accident, but WE-CAN's intelligent routing automatically perceives this situation, and through the routing of other three-line computer rooms, it bypasses the lines of other operators to avoid accidents.


 
IM server

After talking about RTC, let’s talk about IM. Unlike RTC, IM is more dependent on the data center, so its modular architecture will be somewhat different. I will expand it in three parts. The first is how to build a global instant messaging network. , Followed by how to support multiple data centers on the basis of the global communication network, and finally how to perform disaster recovery and redundancy.

Global communication network


 

In the IM global communication network, our services can be divided into two categories, one is the edge node, on which our long connection server is mainly deployed, our name is link, link can be deployed in multiple units, and distributed nearby through the scheduling system; The other is our data center, which carries the core capabilities of instant messaging.

The edge node and the data center are interconnected through the WE-CAN big network. When the link server receives a request from the client, it will be routed to the bridge cluster in the central computer room through the WE-CAN msgbus, and then forwarded to our protocol routing service. Route to our service cluster, including message service, push service, etc.

What are the benefits of such an architecture? First, through the pre-deployment of long connections, the quality of access can be effectively improved; secondly, the broadcast message is pushed down to the edge link node, which can reduce the bandwidth pressure of the central computer room and improve the horizontal expansion capability of the system. The effect is particularly obvious in chat room scenes or large group scenes.

Multi-data center support

How to support multiple data centers under such a global instant messaging network architecture?

In this architecture, in order to support the requirements of multiple data centers, we have extracted two specialized services, one of which is a tenant service. The content is mainly some global configuration information, application-level configuration information and switches, etc., each data The center and each edge computer room will deploy tenant services. When a client request arrives at a link node, the link service will access the nearest tenant service, obtain the unit of the application to which the request belongs, and then use the msgbus of WE-CAN Topic mechanism to route to the correct data center;

There is also the lbs service. The lbs service will also be deployed in multiple computer rooms, and uniform resource allocation and scheduling will be carried out through unified configuration information.

In the multi-data center architecture, a very important point is unit isolation.

It can be understood from two aspects. First, each application belongs to a unit, which means that the request of an application will only be routed to a certain data center. Therefore, the data of each data center belongs to different applications. isolation.

There is also the global uniqueness of data. Take sending messages as an example. Each message has a globally unique message id. This uniqueness is not only unique within a single data center, but also global uniqueness. This has the advantage It is that there will be no data conflict when the unit is split and merged in the future.

Disaster recovery and redundancy

How to achieve disaster recovery and redundancy in a multi-data center architecture? The most important thing is data synchronization. In our architecture, each application belongs to a data center. In fact, each application also has a logical backup unit, and there will be a data center between the main and standby units. Synchronize, so that you can switch.

We have adopted two strategies for data synchronization. Different strategies are adopted according to the type and characteristics of data. The first is dual-write synchronization. This is mainly for processing some time-sensitive data, such as online status, roaming, cache, etc. The logic of dual-write is carried out through proxy services to reduce business intrusion. For a write request Will be forwarded by proxy to msgbus, and finally routed to the backup unit for double-write synchronization;

Secondly, for the synchronization of persistent data, cross-unit replication is mainly carried out by subscribing to binlog by dts. In order to reduce the delay, it can be seen that the subscription of binlog is remote, and the synchronization operation is performed inside the target unit.

At present, our synchronization is switching between one-way synchronization and automated scripts, and will be transformed into a two-way synchronization mode in the future, which will further reduce the cost and complexity of the entire operation and maintenance operation.


 

IM/RTC server modular architecture

What are the specific benefits of the IM and RTC server unitized architecture of NetEase Yunxin?

Optimize the last mile. For RTC, whether it is signaling or media, the entire process of nearby access will significantly reduce the delay, for example, it can increase the speed of the first screen; for IM, whether it is an access node or a data center, a unitized solution Next, it will be closer to the terminal, thereby effectively reducing the delay of the message.

Improve usability. The modularized/globalized architecture can effectively improve the anti-risk ability of the entire communication network to deal with the computer room and even the city level.

Data isolation and risk isolation. Each unit is logically independent and physically isolated, which can reduce risks and avoid the spread of faults; in particular, the isolation of data between units can also avoid policy risks in some cases.

Practical experience of converged communication
After talking about the unitized/globalized technical architecture, what are the landing practices?

Take the large-scale live event solution as an example. The solution integrates multiple communication methods. Under the modular architecture, the live broadcast service ensures the stable and high availability of live broadcast through the deployment of multiple computer rooms; for the interactive scenes with microphones, globalization is adopted. The nodes in the world can interact in real time in many places around the world; and for barrage interaction, under a unitized architecture, through decentralized access nodes, Yunxin can support up to tens of millions of online.

The following picture shows the event at the end of August last year. This Netease Cloud Music TFBOYS 7th Anniversary Concert, Yunxin provided the underlying communication capabilities. The maximum number of simultaneous online users reached 786,000, breaking the Guinness record of online paid concerts. Fully verify the horizontal scalability and stability of the modular/global architecture.


 

The following picture is a case of a social scene. This is a social app that goes overseas. Through the access of the nearest unit, the quality and stability of communication services are effectively improved.

In terms of IM messages, compared to connecting to the central computer room, the message delay is reduced by 30%, and the first screen speed in the RTC point-to-point call scenario is also increased by more than 20%.


 

Summary and outlook

With the continuous development of converged communication technology, the market is also expanding. As a converged communication cloud service provider, globalization and unitization are a must for Yunxin. Netease Yunxin will continue to polish its technology to better serve our customers.

The technological evolution direction of converged communication is by no means only unitization and globalization. The development of technologies such as AI, 5G, and the Internet of Things are all important directions of the technological evolution of convergent communication.

The above is all the content shared today, thank you all!


网易数智
619 声望140 粉丝

欢迎关注网易云信 GitHub: