头图

This article was shared by the Xiaomi technical team. The original title "Xiao AI access layer single machine million long connection evolution" has been revised.

1 Introduction

The Xiaoai access layer is the first service that Xiaoai Cloud is responsible for device access, and it is also one of the most important services. This article introduces some optimizations and attempts made by the Xiaomi technical team on this service from 2020 to 2021. , and finally increased the number of long connections that a single machine can carry from 30w to 120w+, saving 30+ machines.

Tips: What is "Little Love"?

Xiao Ai (full name "Xiao Ai") is an artificial intelligence voice interaction engine under Xiaomi. It is installed in Xiaomi mobile phones, Xiaomi AI speakers, Xiaomi TVs and other devices. It is used in personal mobile, smart home, smart wear, smart office, children's It is used in eight categories of scenarios: entertainment, smart travel, smart hotel, and smart learning.

(This article was simultaneously published at: http://www.52im.net/thread-3860-1-1.html )

2. Thematic catalogue

This article is the seventh in a series of articles on the topic. The general catalogue is as follows:

"Long Connection Gateway Technology Topic (1): Summary of Production-level TCP Gateway Technology Practice of Jingdongmei"
"Special topic on long-connection gateway technology (2): Zhihu technology practice of high-performance long-connection gateway with tens of millions of concurrent connections"
"Long Connection Gateway Technology Topic (3): The technological evolution of the mobile terminal access layer gateway of 100 million yuan level"
"Long Connection Gateway Technology Topic (4): iQIYI WebSocket Real-time Push Gateway Technology Practice"
"Long connection gateway technology topic (5): Himalaya self-developed billion-level API gateway technology practice"
"Long Connection Gateway Technology Topic (6): Graphite Document Single Machine 500,000 WebSocket Long Connection Architecture Practice"
"Long Connection Gateway Technology Topic (VII): Architecture Evolution of Xiaomi Xiaoai Single Machine 1.2 Million Long Connection Access Layer" (* this article)

3. What is the Xiaoai access layer

The entire Xiaoai architecture is layered as follows:

The main work of the access layer is the authentication and authorization layer and the transport layer, and it is the first service for all Xiaoai devices to interact with Xiaoai Brain.

From the above figure, we know that the important functions of the Xiaoai access layer are as follows:

1) Secure transmission and authentication: maintain a secure channel for devices and brains, ensure effective identity authentication and secure data transmission;
2) Maintain long-term connection: maintain the long-term connection between the device and the brain (Websocket, etc.), do a good job in connection state storage, heartbeat maintenance, etc.;
3) Request forwarding: Forward every request of Xiaoai device to ensure the stability of each request.

4. Technical realization of early access layer

The earliest implementation of the Xiaoai access layer was based on Akka and Play. We used them to build the first version. The features of this version are as follows:

1) Based on Akka, we have basically achieved initial asynchrony to ensure that the core thread is not blocked and the performance is acceptable.
2) The Play framework naturally supports Websocket, so we can quickly build and implement it with limited manpower, and can guarantee the standardization of the protocol implementation.

5. Technical problems of early access layer

As the number of Xiaoai long connections exceeded the 10 million mark, we found some problems with the early access layer solutions.

The main problems are as follows:

1) After the number of long connections increases, more and more memory data needs to be maintained. The GC of the JVM becomes a performance bottleneck that cannot be ignored, and once the code is not well written, there is a risk of GC. After the previous accident analysis, the upper limit of the number of long connections per instance of the Akka+Play version of the access layer is about 28w.

2) The implementation of the access layer of the old version is relatively arbitrary, and there are a lot of state dependencies between Akka Actors instead of immutable message passing, which makes the communication between Actors into function calls, resulting in poor code readability and Maintenance is difficult and does not take advantage of Akka Actors in building concurrent programs.

3) As an access layer service, the old version has a strong dependence on the analysis of the protocol, which causes it to go online frequently with version changes, and its launch will cause long-term connection reconnection, and there is a risk of avalanche at any time.

4) Due to the reliance on the Play framework, we found that its long-term connection management is inaccurate (because the data of the underlying TCP connection cannot be obtained), which will affect our daily inspections to evaluate the service capacity, and rely on other frameworks for long-term connection. After the number of connections increases, there is no way for us to do more detailed optimization.

6. Design goals of the new access layer

Based on various problems of the early access layer technical solutions, we intend to reconstruct the access layer.

Our goals for the new version of the access layer are:

1) Stable enough: Connect as much as possible online and keep the service stable;
2) Extreme performance: the target single machine has at least 100w long connection, and it is best not to be affected by GC;
3) Maximum controllability: Except for the system calls of the underlying network I/O, all other codes must be self-implemented/or internally implemented components, so that we have enough autonomy.

So, we started the long practice road of single-machine million-long connection. . .

7. Optimization ideas for the new access layer

7.1 Access Layer Dependencies
The relationship between the access layer and external services is clarified as follows:

7.2 Functional division of the access layer
The main functions of the access layer are divided as follows:

1) WebSocket parsing: The received client byte stream should be parsed according to the requirements of the WebSocket protocol;
2) Socket state retention: store the basic state information of the connection;
3) Encryption and decryption: All data communicated with the client is encrypted, and the transmission with the back-end module is json plaintext;
4) Serialization: On the same physical connection, two requests A and B arrive at the server one after another. In the back-end service, B may get a response before A, but we cannot send it to the client immediately after receiving B, and must wait for A to complete. Then, send it to the client in the order of A and B;
5) Back-end message distribution: The access layer is not only connected to a single service, but may be forwarded to different services according to different messages;
6) Authentication: security-related verification, identity verification, etc.

7.3 The idea of splitting the access layer
Divide the previous single module into two sub-modules according to whether it has a state or not.

details as follows:

1) Front-end: stateful, minimized functions, and minimized online;
2) Back-end: stateless, with maximized functions, users can be unperceived when going online.

Therefore, according to the above principles, in theory, we will make such a functional division, that is, the front end is small and the back end is large. The schematic diagram is shown below.

8. Technical realization of the new version of the access layer

8.1 Overview

The module is split into front and back ends:

1) The front end is stateful, and the back end is stateless;
2) The front and back ends are independent processes and are deployed on the same machine.

Supplement: The front-end is responsible for establishing and maintaining the state of a long-term connection to the device, which is a stateful service; the back-end is responsible for specific business requests, which is a stateless service. Back-end service online will not lead to disconnection and reconnection of device connection and authentication call, avoiding unnecessary jitter caused by version upgrade or logic adjustment of long connection state;

The front end is implemented using CPP:

1) The Websocket protocol is completely self-parsed: all information can be obtained from the Socket level, and any bug can be handled;
2) Higher CPU utilization: no additional JVM cost, no GC drags down performance;
3) Higher memory utilization: As the number of connections increases, the memory overhead associated with connections increases, and self-management can be extremely optimized.

The backend is temporarily implemented in Scala:

1) The implemented functions are directly migrated, which is much cheaper than rewriting;
2) Some dependent external services (such as authentication) have the Scala (Java) SDK library that can be used directly, but there is no C++ version. If it is rewritten in C++, the cost is very high;
3) Stateless transformation of all functions, which can be restarted at any time without the user's perception.

Communication uses ZeroMQ:
The most efficient way of inter-process communication is shared memory. ZeroMQ is implemented based on shared memory, and the speed is no problem.

8.2 Front-end implementation
Overall structure:

As shown in the figure above, it consists of four sub-modules:

1) Transport layer: Websocket protocol analysis, XMD protocol analysis;
2) Distribution layer: Shield the differences in the transport layer, no matter what interface the transport layer uses, it is converted into a unified event at the distribution layer and delivered to the state machine;
3) State machine layer: In order to achieve pure asynchronous services, the self-developed Akka-like state machine framework XMFSM based on the Actor model is used, which implements the single-threaded Actor abstraction;
4) ZeroMQ communication layer: Since the ZeroMQ interface is a blocking implementation, this layer is responsible for sending and receiving through two threads respectively.

8.2.1) Transport layer:

The WebSocket part implements websocket-lib using C++ and ASIO. Xiaoai long connection is based on the WebSocket protocol, so we have implemented a WebSocket long connection library ourselves.

The features of this long link library are:

a. Lock-free design to ensure excellent performance;
b. Developed based on BOOST ASIO to ensure the performance of the underlying network.

The pressure test shows that the performance of the library is very good:

This layer also undertakes the sending and receiving tasks of the other two channels in addition to the original WebSocket.

Currently, the transport layer supports the following three different client interfaces:

a. websocket (tcp): ws for short;
b. SSL-based encrypted websocket (tcp): wss for short;
c. xmd (udp): referred to as xmd.

8.2.2) Distribution layer:

Convert different transport layer events into unified events and deliver them to the state machine. This layer acts as an adapter to ensure that no matter which type of transport layer is used in the front, the arrival of the distribution layer becomes a consistent event delivery to the state machine.

8.2.3) State machine processing layer:

The main processing logic is located in this layer, and a very important part here is the encapsulation of the sending channel.

For the Xiaoai application layer protocol, the processing logic of different channels is completely consistent, but each channel has different details in terms of processing and security-related logic.

for example:

a. wss sending and receiving does not require encryption and decryption, the encryption and decryption is done by the more front-end Nginx, and ws needs to be sent using AES encryption;
b. wss does not need to send challenge text to the client after successful authentication, because wss does not need to do encryption and decryption;
c. The content sent by xmd is different from the other two. It is a private protocol based on protobuf encapsulation, and xmd needs to deal with the logic after sending failure, while ws/wss does not need to consider the problem of sending failure, which is guaranteed by the underlying Tcp protocol.

In response to this situation: we use the polymorphic features of C++ to deal with it, and specifically abstract a Channel interface. The methods provided in this interface contain some key difference steps in a request processing, such as how to send a message to the client, how to stop the connection , how to handle sending failures, etc. For 3 different sending channels (ws/wss/xmd), each channel has its own Channel implementation.

As soon as the client connection object is created, the concrete Channel object of the corresponding type is instantiated immediately. In this way, only the public logic of the business layer can be implemented in the main logic of the state machine. When the difference logic is called, the Channel interface is directly called to complete. Such a simple polymorphism feature helps us to divide the differences and ensure that the code is clean.

8.2.4) ZeroMQ communication layer:

The read and write operations of ZeroMQ are asynchronousized through two threads, and they are also responsible for the encapsulation and parsing of several private instructions.

8.3 Backend Implementation
8.3.1) Stateless transformation:

One of the most important changes made to the backend is to strip out all information related to the connection state.

The whole service takes Request (N requests can be transmitted on one connection) as the core for various forwarding and processing, and each request has nothing to do with the previous request. Multiple requests on a connection are handled as separate requests in the backend module.

8.3.2) Architecture:

Scala services implement business logic using Akka-Actor architecture.

After the service receives the message from ZeroMQ, it is directly delivered to the Dispatcher for data analysis and request processing. In the Dispatcher, different requests will be sent to the corresponding RequestActor for Event protocol analysis and distributed to the business Actor corresponding to the event for processing. Finally, the processed request data is sent to the back-end AIMS&XMQ service through XmqActor.

The processing flow of a request in multiple Actors in the backend:

8.3.3) Dispatcher request distribution:

The front-end and back-end interact through Protobuf, which avoids the performance consumption of Json parsing and makes the protocol more standardized.

After the back-end service receives the message from ZeroMQ, it will parse the PB protocol in the DispatcherActor and process the data according to different classifications (CMD for short). The classifications include the following.

  • BIND command:

The authentication function, due to the complex logic of the authentication function, is difficult to implement in the C++ language, and is still placed in the scala business layer for authentication. This part parses the HTTP headers requested by the device, extracts the token for authentication, and returns the result to the front end.

  • LOGIN command:

After the device is logged in and the device is authenticated, the current connection has been successfully established. At this time, the Login command will be executed to send the long connection information to the AIMS and record it in the Varys service, which facilitates subsequent active push-down and other functions. During the Login process, the service will first request the Account service to obtain the uuid of the persistent connection (used for routing during the connection process), and then send the device information + uuid to AIMS for device login operation.

  • LOGOUT command:

When the device is logged out, the device needs to perform the Logout operation when it is disconnected from the server to delete the long connection record from the Varys service.

  • UPDATE and PING commands:

a. Update command, device status information update, used to update the relevant information saved in the database by the device;
b. Ping command, connection keep-alive, is used to confirm that the device is in the online connection state.

  • TEXT_MESSAGE and BINARY_MESSAGE:

For text messages and binary messages, when text messages or binary messages are received, they will be sent to the RequestActor corresponding to the request according to the requestid for processing.

8.3.4) Request request parsing:

For received text and binary messages, the DispatcherActor will send it to the corresponding RequestActor for processing according to the requestId.

Among them: The text message will be parsed as an Event request, and distributed to the specified business Actor according to the namespace and name in it. The binary message will be distributed to the corresponding business actor according to the currently requested business scenario.

8.4 Other optimizations
In the process of completing the adjustment of the new architecture 1.0, we are also constantly stressing the capacity of long connections and summarizing some points that have a greater impact on the capacity.

8.4.1) Protocol optimization:

a. JSON is replaced by Protobuf: In the early front-end and back-end communication, the json text protocol was used. Later, it was found that json serialization and deserialization took up a lot of CPU. After changing to the protobuf protocol, the CPU occupancy rate dropped significantly.

b. JSON supports partial parsing: the protocol of the business layer is based on json, and there is no way to directly replace it. We only parse a small header part to get the namespace and name through the method of "partial parsing of json", and then forward most of them directly The message is forwarded, and only a small number of json messages are completely deserialized into objects. After this optimization, the CPU usage is reduced by 10%.

8.4.2) Extend heartbeat time:

When we tested the 20w connection for the first time, we found that among the messages sent and received at the front and back ends, a heartbeat PING message used to keep the user online accounted for 75% of the total message volume. Sending and receiving this message consumed a lot of CPU. Therefore, we extend the heartbeat time to reduce CPU consumption.

8.4.3) Self-developed intranet communication library:

In order to improve the performance of communication with backend services, we use the self-developed TCP communication library, which is a pure asynchronous multi-threaded TCP network library developed based on Boost ASIO. Its excellent performance helps us increase the number of connections to 120w+.

9. Future planning
After the optimization of version 1.0 of the new version of the architecture, it is verified that our splitting direction is correct, because the preset goal has been achieved:

1) The number of connections carried by a single machine is 28w => 120w+ (the peak request QPS of an ordinary server machine with 16G memory and 40 cores exceeds 10,000), and the access layer offline saves 50%+ of the machine cost;
2) The backend can be online without loss.

Let's re-examine our ideal goal. With this as the direction, we have the prototype of version 2.0:

Specifically:

1) The back-end module is rewritten in C++ to further improve performance and stability. At the same time, the part of the back-end module that cannot be rewritten in C++ is operated and maintained as an independent service module, and the back-end module is called through the network library;
2) Try to migrate non-essential functions in the front-end module to the back-end, so that the front-end functions are less and more stable;
3) If the front-end and back-end processing capabilities are quite different after the transformation, considering that ZeroMQ actually has excess performance, you can consider using the network library to replace ZeroMQ, so that the front-end and back-end deployments can be changed from 1:1 single-machine deployment to 1:N more Machine deployment, better use of machine resources.

The goal of version 2.0 is: after the above transformation, it is expected that a single front-end module can reach a connection processing capacity of 200w+.

10. References

[1] In the last 10 years, the famous C10K concurrent connection problem
[2] In the next 10 years, it is time to consider C10M concurrency
[3] One article to understand the threading model in high-performance network programming
[4] Go deep into the operating system and understand processes, threads, and coroutines in one article
[5] Detailed explanation of Protobuf communication protocol: code demonstration, detailed principle introduction, etc.
[6] From entry to mastery of WebSocket, half an hour is enough!
[7] How to make your WebSocket disconnect and reconnect faster?
[8] Architecture evolution from 1 million to 10 million high concurrency

study Exchange:

(This article is simultaneously published at: http://www.52im.net/thread-3860-1-1.html )


JackJiang
1.6k 声望810 粉丝

专注即时通讯(IM/推送)技术学习和研究。