Alibaba IM technology sharing (5): Timeliness optimization practice of Xianyu billion-level IM messaging system

This article was shared by Youyou from the Alibaba Xianyu technical team. The original title "bybye to the news delay: plan for the timely arrival of Xianyu news", with revisions and changes, thanks to the author for sharing.

1 Introduction

IM messaging is an important trading advisory tool for Xianyu users. The core objectives are as follows:

1) The first is to ensure that the user's message is not lost;
2) The second is to ensure that the user's message is delivered to the recipient in time.

IM messages are divided into offline and online pushes according to whether the recipient device of the message is online. Data shows that more than half of Xianyu's IM messages every day are online channels, and the arrival rate and timeliness of online messages directly affect user experience.

Based on the optimization practice of the Xianyu IM messaging system in terms of message timeliness, this article will analyze in detail the various technical problems faced by IM online channels, and optimize it through corresponding technical means to ensure the timely arrival of user messages.

PS: If you have no idea about the reliability of IM messages, it is recommended to read this introductory article "Introduction to zero-based IM development (2): What is the real-time nature of the IM system? ".

study Exchange:

5 groups for instant messaging/push technology development and communication: 215477170 [recommended]
Introduction to Mobile IM Development: "One entry is enough for novices: Develop mobile IM from scratch"
Open source IM framework source code: https://github.com/JackJiang2011/MobileIMSDK

(This article was published simultaneously at: http://www.52im.net/thread-3726-1-1.html)

2. Series of articles

This article is the fifth in a series of articles. The general content is as follows:

"Alibaba IM Technology Sharing (1): Enterprise-level IM King-Nailed in the back-end architecture"
"Alibaba IM Technology Sharing (2): Xianyu IM's Flutter-based mobile terminal cross-terminal transformation practice"
"Alibaba IM Technology Sharing (3): The Road to the Evolution of the Architecture of Xianyu's Billion-level IM Message System"
"Alibaba IM Technology Sharing (4): Reliable Delivery Optimization Practice of Xianyu's 100-million-level IM Message System"
"Alibaba IM Technology Sharing (5): Timeliness Optimization Practice of Xianyu's Billion-level IM Message System" (* This article)

3. Current problems

3.1 End long connection interrupted
In the IM scenario, users communicate with the cloud frequently, and in order to achieve the timely arrival of user messages, the cloud pushes down the message to reach the user, so when the user is online, the device and the cloud will maintain a TCP long connection channel, which can be lighter It interacts with the server at an order of magnitude, and the downlink messages of modern IM instant messaging are all sent through long connections.

The current Xianyu IM messaging system uses ACCS long connection, which is a full-duplex, low-latency, and high-security channel service provided by Taobao Wireless.

However, due to the uncertainty of the network status of the user equipment, various network abnormalities may occur and cause the ACCS long connection channel to be interrupted. Once the long connection is interrupted unexpectedly, users will not be able to receive online messages in time.

In response to this problem, we need to sense the interruption of the long connection as soon as possible and try to reconnect. Specific optimization ideas will be shared later in this article.

3.2 The push-down message has not arrived
Perceiving the interruption of a long connection and reconnecting can only guarantee the validity of the long connection most of the time, but the message client pushed down during the period of invalid or unstable long connection may not be received at all.

Simply put, the reconnection mechanism alone cannot guarantee that the downlink message will arrive. The following scenarios may cause the downlink message to fail:

1) The server sends the downlink message for a long time, and the message is disconnected on the transmission road, and the client cannot receive it;
2) There is a delay in the online status of the device, and the server thinks that the device is online when it downloads the message, but the device is actually offline and cannot be received;
3) The client receives the downlink message, but the subsequent processing on the client fails (for example, the library fails to be dropped, and the message is not successfully displayed to the user).

According to the statistics of data buried points, the downlink success rate of ACCS long connection is about 97%.

The statistical method of the downlink success rate of ACCS long connection is as follows:

ACCS Downlink Success Rate = The amount of messages received by the client during the successful downlink via ACCS / The amount of messages considered by the server to be successfully downlink via ACCS

Anyone who is impatient is about to ask, have you lost 3% of the news?

not at all! The 3% of the messages will not be lost, but it is not guaranteed to reach users in time.

Our message synchronization model is a push-pull combination mode. When the user pulls a message, it will pull all the messages from the current location of the device and the latest location of the server. The message of the ACCS downlink failure will be obtained through the active pull mode, but the client The trigger timing of actively pulling messages is limited.

The current trigger timings for the client to actively pull messages are mainly as follows:

1) The user cold starts the app and actively synchronizes messages;
2) The user takes the initiative to pull down to refresh;
3) App background switch to the foreground;
4) When a push message is received, the client finds that the location of the new message has a gap with the latest local location, triggering synchronization.

It can be seen that the triggering of the above-mentioned active synchronization message largely depends on user behavior or whether new messages are received, and it is difficult to ensure that the messages arrive in time.

If it is the IM software that the user opens frequently, this will not cause much problem. However, the activity of Xianyu app is low, and sometimes it even relies on IM messages to get live, and a delayed message may cause users to miss a transaction. Xianyu messages do not allow such a delay to occur.

Based on the above analysis, we first describe a data indicator to reflect the status quo.

From the above description, it can be seen that not all ACCS messages are pushed down, and they may also be pulled down actively. If it is a push, it must be reached in time; if it is a pull, it is limited by user behavior.

We define this part of the message as the arrival of the ACCS message compensation, and then calculate the time it takes for the ACCS message compensation to arrive. The message scope is limited to the server ACCS successfully downlinking but the client actively pulls the synchronized message. This data is in the previous version. About 60 minutes.

Note: This data is not the time it takes for the message to reach the user, because if online to offline reach, the time to pull the message depends on the user's behavior (when the user opens the app), but this data can also roughly reflect the online message The arrival delay situation.

The statistical method for compensating the time-consuming arrival of messages for ACCS long connections is as follows:

ACCS message compensation arrival time = the time it takes the client to obtain the ACCS message through the pull-the ACCS downlink time of the server

Next, this article will describe in detail how we optimize the stability of the online channel from the two aspects of long connection reconnection and unreached message retransmission, so as to optimize and ensure the timely arrival of messages.

4. Optimization method 1: Increase long connection reconnection mechanism

4.1 Why is the long connection interrupted?
If there is a cause, there must be an effect. Let's first analyze what causes the connection to be interrupted.

For the IM scenario, there may be the following reasons:

1) The user equipment is disconnected from the network;
2) The device has a network switch;
3) The equipment is in a weak network environment and the network is unstable;
4) The device network is normal, and the TCP connection is interrupted by the operator due to the NAT timeout.

For APP, if the user operation causes the network status to change, there will be a network status change event notification. In this case, you can monitor the event and actively try to reconnect. But in reality, most situations are "unexpected" (like the disconnect possibilities listed above).

So since "unexpected" outages are unpredictable, how can technology effectively perceive various abnormal conditions?

PS: If you want a thorough understanding of disconnection, weak network, and TCP link effectiveness, it is not what this article can explain clearly. You can refer to the following materials to understand it in depth, and it is worth learning.

Regarding the validity of the TCP link itself, you can read the following two articles:

"Why does the mobile terminal IM based on the TCP protocol still need a heartbeat keep-alive mechanism? 》
"Unknown Network Programming (12): Thoroughly understand the KeepAlive keepalive mechanism of the TCP protocol layer"

Regarding the complexity of mobile networks, you can learn from the following introductory popular science articles:

"Introduction to Zero-Basic Communication Technology for IM Developers (11): Why is the WiFi signal poor? Understand in one article! 》
"Introduction to Zero-Basic Communication Technology for IM Developers (12): Is the Internet stuck? Internet disconected? Understand in one article! 》
"Introduction to Zero-Basic Communication Technology for IM Developers (13): Why is the mobile phone signal poor? Understand in one article! 》
"Introduction to Zero-Basic Communication Technology for IM Developers (14): How difficult is it to surf the Internet on high-speed rail? Understand in one article! 》

Regarding the various problems and optimization schemes brought by the mobile weak network, you can learn from the following systems:

"Summary of optimization methods for short connection of modern mobile network: request speed, weak network adaptation, security assurance"
"Mobile IM Developers Must-Read (1): Easy to understand, understand the "weak" and "slow" of mobile networks"
"A Must-Read for Mobile IM Developers (2): Summary of the Most Complete Mobile Weak Network Optimization Method in History"
"Baidu APP mobile terminal network deep optimization practice sharing (3): mobile terminal weak network optimization article"

4.2 Heartbeat detection mechanism
Like most link keep-alive scenarios, the most effective detection method in IM scenarios is heartbeat detection. Need a heartbeat keep-alive mechanism?").

The principle is: the client sends heartbeat packets regularly, and the server feeds back to the client after receiving the heartbeat packets. Through the cooperation of the client and the server, both the client server and the server can perceive each other. Whether the connection is interrupted.

From the perspective of timeliness effect: the shorter the heartbeat interval, the better, and frequent heartbeat detection will inevitably lead to the loss of user traffic and power, so our goal is how to achieve as few heartbeat detection as possible and try to perceive it in time Unexpected situation where a long connection is interrupted.

State machine + message heartbeat queue:

In the design of the heartbeat protocol, it should be noted that the core goal of the heartbeat packet is to detect whether the long-connected channel is unblocked. If the client actively uplinks the heartbeat packet and can receive the reply from the server, the long-connected channel is considered healthy. Therefore, the uplink message of the heartbeat and the data packet of the return packet should be as small as possible. Generally speaking, it is enough to identify the heartbeat packet and response through the protocol header (so that the size of the protocol packet can be saved).

PS: For an introductory article on the heartbeat mechanism, you can read "An article to understand the network heartbeat packet mechanism in instant messaging applications: functions, principles, implementation ideas, etc.".

4.3 Heartbeat strategy
The heartbeat strategy is the core mechanism to achieve our above goals. This article only briefly lists a few heartbeat strategies.

For example, the following:

1) In the initial state of short heartbeat detection, after receiving ACK for 3 consecutive pings, it can be considered to be in a stable state;
2) Regular fixed-time heartbeat (according to different app status, the frequency can be adjusted Mid+, Mid-, Long);
3) Adaptive heartbeat automatically adapts to the heartbeat interval according to changes in the device network status;
4) Redundant heartbeats, the app background switches to the foreground, and the active heartbeat is once.
You can even write an article about the detailed design of the heartbeat strategy. Interested students can read the following recommended articles to continue their in-depth research.

"WeChat team original sharing: Android version of WeChat background keep-alive actual sharing (network keep-alive)"
"Mobile IM Practice: Realizing the Smart Heartbeat Mechanism of Android WeChat"
"Mobile IM Practice: Analysis of the Heartbeat Strategy of WhatsApp, Line and WeChat"
"Rongyun Technology Sharing: Practice of Network Link Keep Alive Technology of Rongyun Android IM Products"
"Web-side instant messaging practice dry goods: How to make your WebSocket disconnect and reconnect faster? 》
"Correctly understand the heartbeat and reconnection mechanism of the IM long connection, and implement it by hand (there is a complete IM source code)"
"A Discussion on the Design and Implementation of an IM Intelligent Heartbeat Algorithm on Android (with sample code)"
"Teach you how to use Netty to realize the heartbeat mechanism and disconnection reconnection mechanism of network communication programs"

5. Optimization method 2: Message ACK response and retransmission mechanism

5.1 Overview
In order to solve the above problems, we also introduced a message ACK response and retransmission mechanism.

The overall idea is: After the client receives the ACCS message and processes it successfully, it returns an ACK response packet to the server. When the server sends the ACCS message, it adds the message to the retry queue, and updates the message arrival status after receiving the ACK response packet. And terminate the retry.

The overall design flow chart is as follows:

The difficulty of this scheme is to retry the implementation design of the processor, and then we will focus on the detailed design of this part.

5.2 Retry queue storage design
We use the Alibaba Cloud Table Store TimeLine model to store the arrival status of downstream messages. The Timeline model is a data model designed for message data scenarios. It can meet the special requirements of message data scenarios for message sequence preservation, massive message storage, and real-time synchronization. It is widely used in IM, feed stream and other message scenarios. (Regarding the TimeLine model, here is a detailed article to learn from "Discussion on the Synchronization and Storage Scheme of Chat Messages in Modern IM Systems")

We define a TimeLine for each user device, the timeline-id is defined as userId_deviceId, and the sequenceId is customized as the message location.

The storage structure is as follows:

Each time a message is successfully downlinked through ACCS, it is inserted into the TimeLine of the receiving user equipment, and the message arrival status is updated according to the message id after receiving the ACK.

At the same time, since the retry action only occurs within a short period of time after the downlink message, we can set a relatively short global expiration time to avoid data expansion.

5.3 Delayed Retry Design

As shown in FIG:

1) Every time a message is sent through ACCS, it is inserted into the Timeline first, and the initial state is not reached, and then a delayed message with a delay of N seconds is produced;
2) After each consumption of a delayed message, read the arrival status of the message in the tablestore, and terminate the delay if it arrives, otherwise continue;
3) For each retry, first determine whether the device is online, if the device is not online, forward the offline channel and terminate the retry, if the device is online, re-push unreached messages and delay consumption again for N seconds;
4) The same delayed message used in the retry life cycle of each message can be retried and consumed at most M times. If the number is exceeded, no retry will be made and the log will be buried (this situation can be monitored later and optimized based on this data) .
5.4 Delayed retransmission strategy
The delayed retransmission strategy refers to how to select an appropriate delay time in the retransmission process to maximize the efficiency of retransmission.

The network environment of different users at different times and places is quite different, and the time required for the network to return to a stable state is also different. It is necessary to select an appropriate delay strategy to ensure the retransmission efficiency.

The goal of the optimal delay strategy is to deliver the message successfully in the shortest time and using the least number of retransmissions. The following are several options.

5.4.1) Fixed delay time:

If you want to find the optimal delay strategy, you must get the answer from the data through analysis, and the wild imagination is often far from reality.

We first use a fixed delay time (10s) to analyze a wave of data with a maximum of 6 retries:

From this set of data, it can be seen that about 85% of the messages can be delivered successfully after being retransmitted within 40s, and 12% of the messages still do not receive ACK after reaching the maximum number of retries. After 4 retries, the fifth time was only 2.03%, and the sixth time was only 0.92%. The profit of continuing to reissue has become very low.

After 6 times, there are still some messages that have not received ACK. If the fixed delay time strategy is used for this part of the message, the cost performance is very low, and frequent retransmission wastes system resources. We need to continue to improve the strategy.

5.4.2) Fixed delay + fixed step increment:

Considering that the network of some users cannot be recovered in a short time, and frequent short-interval retransmissions are of little value, we use 4 fixed short intervals to delay N seconds, and each delay time is the last time the delay time is increased by a fixed step length of M seconds. Strategy. Until the ACK is received, the user equipment is offline or the maximum delay time MAX(N) is reached.

This strategy can solve the problem of the fixed delay time retransmission strategy to a certain extent, but if the user cannot recover the network in a short time, it must be re-incremented each time it is retransmitted, which is not an optimal solution.

5.4.3) Adaptive delay:

Design flow chart:

As shown above: We finally derived an adaptive delay strategy.

Adaptive delay refers to: according to the user's network conditions, the delay time is automatically adjusted in order to achieve the highest retransmission efficiency.

Specifically: the new message first detects the network status of the device with a fixed short delay of N seconds. Once the network is restored, we will clear the device's N value (the device N value refers to the previous retransmission experience, the current device network The shortest time required to reply ACK. By default, this value is empty, which means that the user equipment network is normal). After 4 retransmissions, the ACK is still not received. We try to read the device N value. If it is empty, take the initial value. After each delay, the fixed step M will be increased, and the current device N value will be updated after the retransmission. , Until the message receives an ACK or reaches the maximum delay time MAX(N).

5.5 New and old version compatibility
It should be noted that the old version of the app will not return an ACK. If the messages sent to the old version of the device are also added to the retry queue, such messages will be retried until the maximum number of times before they terminate, which consumes resources for no reason.

Therefore, we design that after the ACCS long connection is established, the client actively uploads a piece of device information, which contains the version number of the app, and the server stores it for a certain period of time. Before adding the message to the retry queue, first verify the version number of the receiver’s device app. , Meet the requirements and then join the retry queue.

6. The final optimized effect

After the message reconnection and retransmission program was launched, the ACCS compensation arrival time, which we defined above, was significantly reduced from 60 minutes to 15 minutes, a decrease of 75%.

This confirms our technical analysis. At the same time, users’ public opinion feedback on message delays has dropped significantly. It can be seen that the message retransmission mechanism is effective in ensuring that user messages arrive in time.

7. Future prospects

The stability optimization of the message online channel has come to an end. In the future, we will continue to optimize the experience of using Xianyu messages, including the improvement of basic functions and the improvement of basic experience.

Basic functions: In the recent version, we have supported message withdrawal and draft functions, and will gradually support functions such as sending and positioning, conversation grouping, remarks, and message search in the future.

Basic experience: We have optimized and upgraded the UI style of the message, and optimized the cpu and memory usage of the app message tab page. In the future, we will continue to optimize the message usage experience in terms of traffic, power, and performance.

Appendix: Reference Materials

[1] Why does the mobile terminal IM based on the TCP protocol still need a heartbeat keep-alive mechanism?
[2] Unknown network programming (12): Thoroughly understand the KeepAlive keepalive mechanism of the TCP protocol layer
[3] Discussion on the synchronization and storage of chat messages in modern IM systems
[4] Summary of optimization methods for short connection of modern mobile network: request speed, weak network adaptation, security assurance
[5] A must-read for mobile IM developers (2): Summary of the most comprehensive mobile weak network optimization method in history
[6] Introduction to Zero-Basic Communication Technology for IM Developers (12): Is the Internet stuck? Internet disconected? Understand in one article!
[7] Introduction to zero-based communication technology for IM developers (13): Why is the mobile phone signal poor? Understand in one article!
[8] Mobile IM Practice: Realize the smart heartbeat mechanism of Android version of WeChat
[9] Rongyun technology sharing: Rongyun Android IM product network link keep-alive technology practice
[10] Web-side instant messaging practice dry goods: How to make your WebSocket disconnect and reconnect faster?

This article has been simultaneously published on the official account of "Instant Messaging Technology Circle".
The synchronous publishing link is: http://www.52im.net/thread-3726-1-1.html

Alibaba IM technology sharing (5): Timeliness optimization practice of Xianyu billion-level IM messaging system

1 Introduction

2. Series of articles

3. Current problems

4. Optimization method 1: Increase long connection reconnection mechanism

5. Optimization method 2: Message ACK response and retransmission mechanism

6. The final optimized effect

7. Future prospects

Appendix: Reference Materials

JackJiang

引用和评论

长连接网关技术专题(十二)：大模型时代多模型AI网关的架构设计与实现

极致出海友好，融云 IM 支持消息免打扰设置时区

如何基于 Go 语言设计一个简洁优雅的分布式任务系统

支持百万人超大群聊的Web端IM架构设计与实践

全平台开源即时通讯IM框架MobileIMSDK：7端+TCP/UDP/WebSocket协议

鸿蒙NEXT如何保证应用安全：详解鸿蒙NEXT数字签名和证书机制

《北京日报》点赞！融云助力打造“数字丝路”新范式