头图

Preface

Chat room is a very important type of IM system. Unlike single chat and group chat, chat room is a large-scale real-time message distribution system.

There are a variety of technical implementation solutions for chat rooms, and there are also some open source implementations in the industry. Each implementation has its own characteristics and application scenarios. As a PaaS platform, NetEase Yunxin has several outstanding features in its chat room system architecture and solution:

  • Horizontal expansion capability: It is mainly reflected in two aspects, one is the number of chat rooms, and the other is the number of people in a single chat room.
  • Rich functions: As a platform, the chat room provides low-level communication capabilities and a rich set of functions to adapt to a variety of business scenarios. Users can use them on demand according to their business requirements.
  • Support for globalization: Yunxin currently provides a global communication network. By connecting to the WE-CAN network developed by Yunxin, the global delay does not exceed 250ms.

In this article, we will introduce in detail NetEase Yunxin large-scale chat room system and the practical application case .

Netease Yunxin chat room system architecture

First of all, let's take a look at the detailed technical architecture of NetEase Yunxin's current chat room and some of the things we have done in the process of architecture upgrade and optimization.

Overall architecture diagram

The following figure shows the technical architecture of NetEase Yunxin chat room:

image.png

It mainly includes the following parts:

  • ChatLink at the access layer
  • WE-CAN, WE-CAN bridge of the network transport layer
  • Dispatcher of the scheduling layer
  • Callback, Queue, Presence, Tag, History, etc. of the service layer
  • CDN Manager, CDN Pusher, CDN Source of the CDN distribution layer

Below, we launch a detailed analysis for each layer.

Access layer

image.png

The access layer will be implemented differently depending on the type of client. For example, common clients (iOS, Andriod, Windows, Mac, etc.) are based on a private binary protocol, and the Web terminal is based on the Websocket protocol. As the access layer is the last mile away from the client, its access speed, quality and data security are all critical:

Access speed and quality

At present, we have built edge nodes covering all provinces across the country and all continents around the world, shortening the last mile, reducing uncertainty, and improving service stability.

Data Security

Based on symmetric + asymmetric encryption, the client and server implement 0-RTT to complete key exchange and login, and also support various encryption algorithms such as RSA/AES/SM2/SM4.

In addition to accepting requests from clients, the access layer is also responsible for unicasting and broadcasting messages. Therefore, the access layer needs to manage all the long connections of the node, including the connection of each chat room and the label attributes of each connection. In addition, the access layer will report its own load information to the back-end service to facilitate reasonable scheduling by the dispatch layer.

When traffic peaks come, the access layer is often the most stressed because of the need to broadcast messages. In order to ensure the stability of the service, we have made many optimization strategies:

Adaptive flow control strategy

  • Single-machine flow control: The access layer service monitors the overall network bandwidth usage of the machine and sets two thresholds T1 and T2. When the bandwidth usage exceeds T1, flow control is triggered. If it further exceeds T2, it will not only trigger the flow The control will also continuously adjust the intensity of the flow control. The ultimate goal is to stabilize the bandwidth usage between T1 and T2.
  • Single connection flow control: In addition, the access layer service will also record the message distribution speed of each long connection, and make fine-grained adjustments to avoid single-machine coarse-grained flow control causing too little or too much single connection distribution. The smoothness of message distribution reduces the spikes in bandwidth traffic fluctuations and improves the end-side experience.

performance optimization

When ChatLink is running under high load, in addition to network bandwidth, every link on the call link may become a performance bottleneck. We have significantly improved the service performance by reducing the number of encoding and decoding (including serialization, compression, etc.), multi-threaded concurrency, reducing memory copy, and message merging.

Network transport layer

The initial architecture of the NetEase Yunxin chat room system is to deploy the access layer and the back-end service layer in the same computer room. Most users are directly connected to the ChatLink of the BGP computer room. For remote areas or overseas, it is deployed through dedicated lines. The agent node completes the acceleration. The obvious disadvantage of this scheme is that the upper limit of service capacity is restricted by the capacity of a single computer room. In addition, the dedicated line is also a considerable expense.

After we are connected to the WE-CAN network, the access layer ChatLink can enable the client to access nearby, improving the quality of service and reducing the cost. In addition, the structure of multiple computer rooms also makes our service capabilities to a higher level.

image.png

In order to adapt to the WE-CAN big network, we designed the WE-CAN Bridge layer, as the bridge layer of the big network access protocol and the chat room protocol, responsible for protocol conversion, session management, forwarding and receiving. Through this layered architecture, the access layer and back-end business layer can be modified with little or no modification, reducing the cost of transformation of the existing system, and also reducing the risk of architecture upgrades.

Scheduling layer

The scheduling layer is the prerequisite for the client to access the chat room system. The client needs to obtain the access address before logging in to the chat room, and the distribution service is called Dispatcher.

image.png

Dispatcher is centralized and will accept heartbeat information from WE-CAN and ChatLink, and select the best access point according to the heartbeat situation. The key points of the dispatch system design are:

Scheduling accuracy

The scheduling system will determine the region and operator information based on the requester’s IP, compare the area of ​​each edge node, the operator, and the load of the node itself (such as CPU, network bandwidth, etc.), and also consider the link from the edge node to the central computer room Situation (from WE-CAN), calculate the comprehensive score, and take the optimal number of nodes as the scheduling result.

Scheduling performance

In the face of high concurrency scenarios, such as a large chat room, the initial stage of the event is often accompanied by a large number of people entering at the same time. At this time, the dispatch system needs to respond quickly. To this end, we will optimize the above-mentioned scheduling rules and raw data for local caching. In addition, in order to avoid unreasonable allocation of heartbeat information and cause node overload, we will dynamically adjust the load factor when allocating services to ensure scheduling performance. , Try to make the distribution result smooth.

Service layer

The service layer implements various business functions, including: online status, room management, cloud history, third callback, chat room queue, chat room tag, etc. The most basic ones are online status management and room management:

  • Online status management: manage the login status of an account, including which chat rooms are logged in, which access points are logged in, etc.
  • Room management: Manage the state of a chat room, including which access points the room is distributed to, and which members are in the room, etc.

The difficulty of online status management and room management lies in how to effectively manage a large number of users and rooms. The features of Netease Yunxin PaaS platform allow us to divide regions according to different tenants to achieve horizontal expansion. In addition, due to the rapid change of state data (short TTL), when certain core users or a certain customer report a large-scale event, Yunxin can quickly split and isolate related resources in a short period of time.

In addition to supporting massive customer access and horizontal expansion, the service layer also has a very important capability, which is to provide various extensibility functions to adapt to various application scenarios of customers. For this reason, Yunxin provides a variety of rich functions, such as:

  • Third-party callback 160c87a3281cda: It is convenient for customers to intervene in core links such as login and message Because it involves service calls, and this call is across computer rooms or even across regions, in order to avoid third-party service failures leading to abnormal cloud services, we have designed isolation, circuit breakers and other mechanisms to reduce the impact on key processes;
  • chat room queue : It is convenient for users to realize some business scene requirements such as wheat sequence and wheat grab;
  • Chat room tag : As the first characteristic feature of the cloud letter industry, it supports the personalized distribution of messages. The realization principle is to define the rules of message distribution and reception by setting the label group when the client logs in and setting the label expression when sending a message. The label information will be stored in the service layer and the access layer at the same time. By pushing some label calculations down to the access layer, the bandwidth and computing resources of the central service are saved.

CDN distribution layer

When we evaluate a chat room system, a commonly used word is unlimited. The architecture support without upper limit does not mean that there is no upper limit. A chat room system, logically, each component unit can be horizontally expanded, but the physical machines, switches, and computer room bandwidth that each service depends on have an upper limit on the capacity. Therefore, it is possible to rationally deploy multiple computer rooms in multiple regions, or even other external resources, to truly reflect the upper limit of the capacity that a chat room system can support.

In NetEase Yunxin's chat room system, all user access points are located in computer rooms everywhere, which naturally integrates resources from all over the country, and the upper limit of the capacity that can be supported is naturally higher than the deployment model of a single computer room or multiple computer rooms in a single area.

Furthermore, when faced with a larger-scale chat room, it is a suitable choice if you can make use of some external general-purpose capabilities. The converged CDN barrage solution is such a technical implementation solution. It can use the edge nodes deployed by major CDN vendors in various places, and use the general capability of static acceleration to support the distribution of ultra-large-scale chat room messages.

Based on the integrated CDN barrage distribution solution, its core point is the distribution and management of barrage. This is an optional module, which is encapsulated in Yunxin. You can choose whether to open it according to different business characteristics without modification. Any business code.

When the converged CDN barrage distribution solution is started, all barrage broadcasts will be divided into two links:

  • Important messages that need to be delivered in real time will reach the client through a long connection
  • Other massive messages will enter the CDN Pusher, aggregated through various strategies, and then delivered to the CDN Source

The client SDK will adopt a certain strategy to periodically obtain barrage messages from CDN edge nodes. The SDK aggregates messages from different sources, sorts them and calls them back to the user. The App layer does not need to know where the messages come from, and only needs to process them according to their own business needs.

image.png

The above figure shows the message flow process of the CDN barrage distribution link: CDN Manager is responsible for managing the distribution strategies of different CDN vendors (it will be issued through a long connection when logging in and can be dynamically adjusted). In addition, he is also responsible for the opening and closing of the integrated CDN mode of each chat room on the management platform, as well as the deployment and recovery of the corresponding CDN Pusher resources. CDN Pusher is actually responsible for receiving messages from the client, and according to the message type, message priority, etc., assemble them into static resources one by one, push them to the CDN Source, and wait for the CDN to pull them back to the source.

Landing practice case

Below, we introduce the typical application scenarios of the NetEase Yunxin chat room system.

Large-scale application cases

In August 2020, the 7th anniversary online concert of NetEase Cloud Music TFBoys is a typical case of large-scale chat room application. In this event, a world record of 78w+ online paid concerts was created, and its barrage interaction was realized by using Netease Yunxin's integrated CDN barrage distribution solution. In fact, in the preparatory process, our chat room system reached a performance index of 20 minutes from 0 to 1000w online, and the uplink message tps reached 100w.

image.png

As shown in the figure above, it is an architecture diagram that supports the distribution of the barrage of this event. The ordinary barrage and gift messages reach the cloud letter server through the client SDK and server API respectively, and finally enter the barrage broadcast service, and then diverted to the long connection and CDN. , And then delivered to the client through a mixed pull/push method.

Featured function-chat room label application case

In recent years, with the development of the Internet, online education has become more and more popular, and the "super small class" model has recently emerged. The so-called super small class refers to the combination of large-scale multi-person class and small class interactive mode.

In the online live broadcast scenario, text interaction is an important part of it, and it is a typical application scenario for chat rooms. However, in the mode of super small class, the conventional chat room system has various problems, whether it is to establish multiple chat rooms or filter messages in a single chat room, there are some serious problems.

The chat room tag function pioneered by NetEase Yunxin perfectly supports the above-mentioned business scenarios. Based on the chat room tag, we can flexibly support personalized functions such as the directional sending and receiving of chat room messages, the directional management of chat room permissions, and the directional query of chat room members. Realize group interaction in multiple scenes under large-scale live broadcasts, such as grouping and labeling students to facilitate teaching in accordance with their aptitude; group discussions, internal discussions between groups, and PK between groups, etc.

image.png

The picture above shows a scene of a super small class: 1 lecturer + N interactive small classes + N teaching assistants. All students are divided into small classes one by one. The corresponding assistants complete the preview reminder, Q&A after class, and homework. Supervision, feedback on students' learning situation, etc., and at the same time receiving live images from the main lecturer, achieved the scale of the large class and the effect of the small class.

to sum up

The above is all the sharing of this article, which mainly introduces the main technology and architecture principles of NetEase Yunxin to build a large-scale chat room system. The construction of any system is not accomplished overnight. Yunxin will continue to polish the underlying technology, just like the introduction of WE-CAN to improve the network transmission effect, and will continue to enrich and improve our function map, such as the industry's first chat room tag function. NetEase Yunxin will continue to deepen its efforts in the IM field to provide users in various scenarios and industries with the best quality services.

about the author

Cao Jiajun, a senior server development engineer at NetEase Yunxin, joined NetEase after graduating from the Chinese Academy of Sciences, and has been responsible for the development of IM servers at NetEase Yunxin. He has rich practical experience in IM system construction and related middleware development.

For more technical dry goods, please pay attention to [Netease Smart Enterprise Technology+] WeChat public account


网易数智
619 声望140 粉丝

欢迎关注网易云信 GitHub: