Long connection gateway technology topic (6): Graphite document single machine 500,000 WebSocket long connection architecture practice

This article is shared by Du Minxiang, Graphite Document Technology. The original title "Practice of Graphite Document Websocket Million-Long Connection Technology" has been revised.

1 Introduction

Part of the business of Graphite Documents, such as document sharing, comments, slide presentations, and document table follow-up scenarios, involves the real-time synchronization of multi-client data and the needs of server-side bulk data online push. The general HTTP protocol cannot meet the server-side requirements. The scenario of actively Pushing data, so choose to use the WebSocket solution for business development.

With the development of the graphite document business, the current daily connection peak value has reached the order of one million. The increasing number of user connections and the architecture design that does not meet the current order of magnitude have led to a sharp increase in memory and CPU usage. Therefore, we consider long-term connection gateways. Refactor.

This article shares the evolution of the graphite document long-connection gateway from 1.0 architecture to 2.0, and summarizes the entire performance optimization practice process.

study Exchange:

5 groups for instant messaging/push technology development and communication: 215477170 [recommended]
Mobile IM development introductory article: "One entry is enough for novices: Develop mobile IM from scratch"
Open source IM framework source code: https://github.com/JackJiang2011/MobileIMSDK

(This article was published synchronously at: http://www.52im.net/thread-3757-1-1.html)

2. Thematic catalogue

This article is the sixth in a series of articles. The general content is as follows:

"Special Topic on Long Connection Gateway Technology (1): Summary of Jingdongmai's Production-level TCP Gateway Technology Practice"
"Special topic on persistent connection gateway technology (2): Knowing the practice of high-performance long-term connection gateway technology with tens of millions of concurrent connections"
"Special topic on long-connection gateway technology (3): The road to technological evolution of mobile terminal access layer gateways on mobile phones."
"Special topic on persistent connection gateway technology (4): Practice of iQIYI WebSocket real-time push gateway technology"
"Special topic on long connection gateway technology (5): Himalaya self-developed billion-level API gateway technology practice"
"Special Topic on Long Connection Gateway Technology (6): Practice of 500,000 WebSocket Long Connection Architecture for a Single Graphite Document" (* This article)

3. Problems faced by v1.0 architecture

The v1.0 version of this long-connection gateway system is a modified version based on Socket.IO using Node.js, which satisfies the needs of business scenarios at the user level at the time.

3.1 Architecture introduction
Architecture design diagram of version 1.0:

Version 1.0 client connection process:

1) The user connects to the gateway through NGINX, and this operation is perceived by the business service;
2) After the business service perceives the user connection, it will query related user data, and then Publish the message to Redis;
3) The gateway service receives the message through Redis Sub;
4) Query user session data in the gateway cluster, and push messages to the client.

3.2 Problems faced
Although the long-connection gateway of version 1.0 works well online, it cannot well support the expansion of subsequent services.

And there are several issues that need to be resolved:

1) Resource consumption: Nginx only uses TLS for decryption and requests transparent transmission, resulting in a lot of waste of resources. At the same time, the previous Node gateway has poor performance and consumes a lot of CPU and memory;
2) Maintenance and observation: It is not connected to the graphite monitoring system and cannot be connected with the existing monitoring alarms, and there are certain difficulties in maintenance;
3) Business coupling problem: The business service and the gateway function are integrated into the same service, and the targeted level expansion cannot be carried out for the performance loss of the business part. In order to solve the performance problem and the subsequent module expansion capabilities, service decoupling is required.

4. Practice of v2.0 architecture evolution

4.1 Overview
The v2.0 version of the long-connection gateway system needs to solve many problems.

For example, there are many components (documents, tables, slides, forms, etc.) in Graphite Documents. In version 1.0, the components can make business calls to the gateway through Redis, Kafka, and HTTP interfaces. The source is unsearchable and difficult to control.

In addition, from the perspective of performance optimization, it is also necessary to decouple the original services, and split the version 1.0 gateway into a gateway function part and a business processing part.

specifically is:

1) The gateway function part is WS-Gateway: integrated user authentication, TLS certificate verification and WebSocket connection management, etc.;
2) The business processing part is WS-API: the component service directly communicates with the service via gRPC.

Moreover:

1) Capacity expansion can be carried out for specific modules;
2) Service reconstruction plus Nginx removal, the overall hardware consumption is significantly reduced;
3) Services are integrated into the graphite monitoring system.

4.2 Overall architecture
Architecture design diagram of version 2.0:

The 2.0 version client connection process:

1) The client establishes a WebSocket connection with the WS-Gateway service through a handshake process;
2) After the connection is successfully established, the WS-Gateway service stores the session on the node, caches the connection information mapping relationship in Redis, and pushes the client online message to WS-API through Kafka;
3) WS-API receives client online messages and client uplink messages through Kafka;
4) WS-API service preprocessing and assembling messages, including obtaining necessary data for message push from Redis, and performing filtering logic to complete message push, and then Pub message to Kafka;
5) WS-Gateway obtains the messages that the server needs to return through Sub Kafka, and pushes the messages to the client one by one.

4.3 Handshake process
If the network is in good condition, after completing steps 1 to 6 as shown in the figure below, directly enter the WebSocket process; if the network environment is poor, the WebSocket communication mode will degenerate to HTTP, and the client will push messages to The server then returns data from the read server through GET long polling.

The handshake process when the client first requests the server connection establishment:

The process description is as follows:

1) Client sends a GET request to try to establish a connection;
2) Server returns related connection data, sid is the unique Socket ID generated for this connection, and subsequent interactions are used as credentials:
{"sid":"xxx","upgrades":["websocket"],"pingInterval":xxx,"pingTimeout":xxx}
3) Client carries the sid parameter in step 2 to request again;
4) Server returns 40, indicating that the request was successfully received;
5) The Client sends a POST request to confirm the status of the later degraded path;
6) Server returns ok, at this time the first phase of the handshake process is completed;
7) Try to initiate a WebSocket connection. First, perform 2probe and 3probe request responses. After confirming that the communication channel is unblocked, normal WebSocket communication can be carried out.

4.4 TLS memory consumption optimization
The wss protocol used to establish the connection between the client and the server. In version 1.0, the TLS certificate is mounted on Nginx, and the HTTPS handshake process is completed by Nginx. In order to reduce the machine cost of Nginx, we mount the certificate to the service in version 2.0.

By analyzing the service memory, as shown in the figure below, the memory consumed during the TLS handshake process accounts for about 30% of the total memory consumption.

This part of the memory consumption cannot be avoided. We have two options:

1) Use seven-layer load balancing, mount TLS certificates on the seven-layer load, and hand over the TLS handshake process to tools with better performance;
2) Optimize the performance of Go's handshake process for TLS. In the communication with industry leader Cao Chunhui (Cao Da), I learned that he recently submitted a PR in the Go official library and related performance test data.

4.5 Socket ID design
A unique code must be generated for each connection. If repeated, it will cause the problem of serial number and messy message push. The SnowFlake algorithm is selected as the unique code generation algorithm.

In the physical machine scenario, a fixed number of the physical machine where the copy is located can ensure that the Socket ID generated by the service on each copy is a unique value.

In the K8S scenario, this solution is not feasible, so the number is returned by registration and issuance. After all the copies of WS-Gateway are started, the service startup information is written to the database, and the copy number is obtained, which is used as a parameter as the copy number of the SnowFlake algorithm For Socket ID production, the service restart will inherit the existing copy number, and when a new version is issued, a new copy number will be issued according to the self-incremented ID.

At the same time, the Ws-Gateway copy will write heartbeat information to the database, which serves as the basis for the health check of the gateway service itself.

4.6 Cluster Session Management Solution: Event Broadcast
After the client completes the handshake process, the session data is stored in the memory of the current gateway node, and part of the serializable data is stored in Redis. The storage structure is described in the following figure.

The message push triggered by the client or component service is used to query the Socket ID of the target client that returns the message body in the WS-API service through the data structure stored in Redis, and then the WS-Gateway service performs cluster consumption. If the Socket ID is not in the current node, you need to query the relationship between the node and the session to find the WS-Gateway node that actually corresponds to the Socket ID of the client user. There are usually the following two solutions (as shown in the figure below).

After determining the use of event broadcast for message transfer between gateway nodes, we further choose which specific message middleware to use, and lists three options to be selected (as shown in the figure below).

Therefore, 100w enqueue and dequeue operations were performed on Redis and other MQ middleware. During the test, it was found that Redis performed very well when the data was less than 10K.

Further combined with the actual situation: the data size of the broadcast content is about 1K, the business scenario is simple and fixed, and it must be compatible with historical business logic. Finally, Redis was selected for message broadcast.

In the future, WS-API and WS-Gateway can be interconnected in pairs, and gRPC stream bidirectional stream communication can be used to save intranet traffic.

4.7 Heartbeat mechanism
After the session is stored in the node memory and Redis, the client needs to continuously update the session timestamp through heartbeat reporting. The client reports the heartbeat according to the cycle sent by the server. The reported timestamp is first updated in the memory, and then through another cycle Perform Redis synchronization to prevent a large number of clients from simultaneously reporting heartbeats and put pressure on Redis.

specific process:

1) After the client successfully establishes a WebSocket connection, the server sends the heartbeat report parameters;
2) The client transmits the heartbeat packet according to the above parameters, and the server will update the session timestamp after receiving the heartbeat;
3) Other uplink data of the client will trigger the update of the corresponding session timestamp;
4) The server regularly cleans up overtime sessions and executes the active shutdown process;
5) Use the time stamp data updated by Redis to clean up the relationship between WebSocket connections and users and files.

Session data memory and Redis cache cleaning logic:

for{
select{
case<-t.C:

  var now = time.Now().Unix()
  var clients = make([]*Connection, 0)
  dispatcher.clients.Range(func(_, v interface{}) bool{
     client := v.(*Connection)
     lastTs := atomic.LoadInt64(&client.LastMessageTS)
     if now-lastTs > int64(expireTime) {
        clients = append(clients, client)
     } else{
        dispatcher.clearRedisMapping(client.Id, client.Uid, lastTs, clearTimeout)
     }
     return true
  })

  for_, cli := rangeclients {
     cli.WsClose()
  }

}
}

Based on the existing two-level cache refresh mechanism, the dynamic heartbeat reporting frequency is further used to reduce the server performance pressure caused by heartbeat reporting. In the default scenario, the client reports heartbeats to the server at an interval of 1s. It is assumed that the current single machine carries 50w. The current QPS is: QPS1 = 500000/1.

From the perspective of server performance optimization, the dynamic interval under normal heartbeat conditions is realized. For every x normal heartbeat reporting, the heartbeat interval increases by a, the upper limit of increase is y, and the minimum value of dynamic QPS is: QPS2=500000/y.

In the extreme case, the QPS generated by the heartbeat is reduced by y times. After a single heartbeat timeout, the server immediately changes the value of a to 1s to retry. The above strategy is adopted to reduce the performance loss caused by the heartbeat on the server while ensuring the quality of the connection.

4.8 Custom Headers
The purpose of using Kafka custom headers is to avoid performance loss caused by decoding the message body at the gateway layer.

After the client WebSocket connection is successfully established, a series of business operations will be carried out. We choose to put the operation instructions and necessary parameters between WS-Gateway and WS-API in the Headers of Kafka, for example, through X-XX-Operator as Broadcast, read the X-XX-Guid file number, and push messages to all users in the file.

The trace id and timestamp are written in Kafka Headers, which can track the complete consumption link of a message and the time consumption of each stage.

4.9 Message receiving and sending
type Packet struct{
...
}

type Connect struct{
*websocket.Con
send chanPacket
}

func NewConnect(conn net.Conn) *Connect {
c := &Connect{

send: make(chanPacket, N),

}

goc.reader()
goc.writer()
return c
}

The writing of the first version of the message interaction between the client and the server is similar to the above.

A stress test on the Demo found that each WebSocket connection occupies 3 goroutines. Each goroutine requires a memory stack, and the stand-alone load capacity is very limited.

Mainly subject to a large amount of memory usage, and most of the time c.writer() is idle, so we consider whether to enable only 2 goroutines to complete the interaction.

type Packet struct{
...
}

type Connect struct{
*websocket.Conn
mux sync.RWMutex
}

func NewConnect(conn net.Conn) *Connect {
c := &Connect{

send: make(chanPacket, N),

}

goc.reader()
return c
}

func(c *Connect) Write(data []byte) (err error) {
c.mux.Lock()
deferc.mux.Unlock()
...
return nil
}

Keep the c.reader() goroutine. If you use the polling method to read data from the buffer, it may cause read delay or lock problems. The c.writer() operation is adjusted to be called actively, and the goroutine is not used to start continuous monitoring. Reduce memory consumption.

After researching event-driven lightweight high-performance network libraries such as gev and gnet, the actual measurement found that the message delay problem may occur in a large number of connection scenarios, so it is not used in a production environment.

4.10 Core Object Cache
After determining the data receiving and sending logic, the core object of the gateway part is the Connection object, and functions such as run, read, write, and close are developed around Connection.

Use sync.pool to cache the object and reduce GC pressure. When a connection is created, the Connection object is obtained through the object resource pool.

After the life cycle ends, Put back to the resource pool after resetting the Connection object.

In actual coding, it is recommended to encapsulate GetConn() and PutConn() functions to converge data initialization, object reset and other operations.

var ConnectionPool = sync.Pool{

New: func() interface{} {

  return &Connection{}

}

func GetConn() *Connection {

cli := ConnectionPool.Get().(*Connection)

return cli

}

func PutConn(cli *Connection) {

cli.Reset()

ConnectionPool.Put(cli) // put back into the connection pool

}

4.11 Optimization of data transmission process
In the process of message flow, it is necessary to consider the optimization of the transmission efficiency of the message body, and use MessagePack to serialize the message body and compress the size of the message body. Adjust the MTU value to avoid sub-packaging. Define a to be the detection packet size. Use the following commands to detect the MTU limit value of the target service ip.

ping-s {a} {ip}

When a = 1400, the actual transmission packet size is 1428.

Among them, 28 is composed of 8 (ICMP echo request and echo response message format) and 20 (IP header).

If a is set too large, the response timeout will occur. When the actual environment packet size exceeds this value, sub-packaging will occur.

While debugging the appropriate MTU value, the message body is serialized through MessagePack to further compress the size of the data packet and reduce CPU consumption.

4.12 Infrastructure support
Use the EGO framework for service development: business log printing, asynchronous log output, dynamic log level adjustment and other functions to facilitate online troubleshooting and improve log printing efficiency; microservice monitoring system, monitoring of CPU, P99, memory, goroutine, etc.

Client Redis monitoring:

Client Kafka monitoring:

Custom monitor the market:

5. Time to check results: performance pressure test

5.1 Pressure test preparation
The test platforms prepared are:

1) Choose a virtual machine configured with 4 cores and 8G as the server, and the target bears 48w connections;
2) Choose eight virtual machines configured with 4 cores and 8G as clients, and each client opens 6w ports.
5.2 Simulation scenario one
Users are online, 50w online users.

The peak number of connections established per second for a single WS-Gateway is: 1.6w/s, and each user occupies a memory: 47K.

5.3 Simulation scenario 2
The test time is 15 minutes, the online users are 50w, and all users are pushed every 5s, and the users have return receipts.

The push content is:

42["message",{"type":"xx","data":{"type":"xx","clients":[{"id":xx,"name":"xx","email":"xx@xx.xx","avatar":"ZgG5kEjCkT6mZla6.png","created_at":1623811084000,"name_pinyin":"","team_id":13,"team_role":"member","merged_into":0,"team_time":1623811084000,"mobile":"+xxxx","mobile_account":"","status":1,"has_password":true,"team":null,"membership":null,"is_seat":true,"team_role_enum":3,"register_time":1623811084000,"alias":"","type":"anoymous"}],"userCount":1,"from":"ws"}}]

After 5 minutes of the test, the service restarted abnormally. The reason for the restart was that the memory usage exceeded the limit.

Analyze the reason why the memory exceeds the limit:

The newly added broadcast code uses 9.32% of the memory:

The part that receives the user receipt message consumes 10.38% of the memory:

The test rules are adjusted, the test time is 15 minutes, the online users are 48w, and all users are pushed every 5s, and the users have return receipts.

The push content is:

The peak value of the number of connections established: 1w/s, the peak value of received data: 9.6w/s, the peak value of sent data is 9.6w/s.

5.4 Simulation scenario three
The test time is 15 minutes, the online users are 50w, and all users are pushed every 5s, and the users do not need to return receipt.

The push content is:

The peak value of the number of connections established: 1.1w/s, and the peak value of sent data is 10w/s. There is no abnormality except for the high memory usage.

The memory consumption is extremely high. Analyzing the flame graph, most of the consumption is in the operation of broadcasting at a timed 5s.

5.5 Simulation scene four
The test time is 15 minutes, the online users are 50w, and all users are pushed every 5s, and the users have return receipts. 4w users go online and offline every second.

The push content is:

Connection establishment peak value: 18570 pieces/s, received data peak value: 329949 pieces/s, sent data peak value: 393542 pieces/s, no abnormal situation occurred.

5.6 Summary of stress test
Under the hardware condition of 16-core 32G memory: 50w single machine connection, the above four scenarios including user online and offline, message receipt and other four scenarios are tested. The memory and CPU consumption are in line with expectations, and under a long-term stress test, The service is also very stable.

The result of the test can basically meet the resource saving requirements under the current level. We believe that we can continue to improve the function development on this basis.

6. Summary of this article

Facing the increasing number of users, the reconstruction of gateway services is imperative.

This refactoring is mainly:

1) Decoupling of gateway services and business services, removing the dependency on Nginx, and making the overall architecture clearer;
2) Analyze the overall process from the user's establishment of the connection to the push message of the underlying business, and specifically optimize these processes.

The 2.0 version of the persistent connection gateway has less resource consumption, lower unit user memory consumption, and a more complete monitoring and alarm system, making the gateway service itself more reliable.

The above optimization content is mainly the following aspects:

1) Degradable handshake process;
2) Socket ID production;
3) Optimization of the client's heartbeat processing process;
4) Custom headers avoid message decoding and strengthen link tracking and monitoring;
5) Optimization of code structure design for message receiving and sending;
6) The use of the object resource pool, the use of cache to reduce the frequency of GC;
7) Serialization compression of the message body;
8) Access to service observation infrastructure to ensure service stability.

While ensuring the performance of the gateway service, it is further to converge the way the underlying component services call the gateway business. From the previous HTTP, Redis, Kafka and other methods, they are unified into gRPC calls to ensure that the source can be checked and controlled. Business access has laid a better foundation.

7. Related articles

[1] WebSocket from entry to master, half an hour is enough!
[2] One article is enough to understand modern web-side instant messaging technologies: WebSocket, socket.io, SSE
[3] From guerrilla to regular army (3): Technical practice of distributed IM system of Mafengwo Travel Network based on Go
[4] The enlightenment brought by 12306 ticket grabbing: see how I use Go to achieve a million QPS spike system (including source code)
[5] The practice of Go language to build a high-concurrency message push system with tens of millions of online (from 360 company)
[6] Learn IM from the source code (6): teach you to use Go to quickly build a high-performance and scalable IM system

This article has been simultaneously published on the official account of "Instant Messaging Technology Circle".
The link for synchronous publishing is: http://www.52im.net/thread-3757-1-1.html

Long connection gateway technology topic (6): Graphite document single machine 500,000 WebSocket long connection architecture practice

1 Introduction

2. Thematic catalogue

3. Problems faced by v1.0 architecture

4. Practice of v2.0 architecture evolution

5. Time to check results: performance pressure test

6. Summary of this article

7. Related articles

JackJiang

引用和评论

长连接网关技术专题(十二)：大模型时代多模型AI网关的架构设计与实现

极致出海友好，融云 IM 支持消息免打扰设置时区

Python3 使用 websockets 调用阿里云实时语音识别（qbit）

印度股票实时数据API接口选型指南：iTick如何成为开发者优选

泰国股票实时报价 API 对比及iTick数据优势分析

支持百万人超大群聊的Web端IM架构设计与实践

几款免费德国股票报价API对比｜实时股票数据API