Ali IM Technology Sharing (4): Reliable delivery optimization practice of Xianyu billion-level IM messaging system

This article was shared by Jing Song, the technical team of Ali Xianyu. The original title "Arrival rate of 99.9%: Xianyu news changes engine at high speed (Integrated Dacheng)", there are revisions and changes, thanks to the author for sharing.

1 Introduction

At the beginning of 2020, I took over Xianyu's IM instant messaging system. There were various problems with the news at that time, and the public opinion of online users was also continuous.

Typical problems, such as:

1) "Chat messages are often lost";
2) "Message user avatar is messed up";
3) "The order status is incorrect" (I believe that you are still complaining about the news that you are reading the article).

Therefore, the stability and reliability of Xianyu's instant messaging system is an urgent problem to be solved.

We investigated some solutions within the group, such as Dingding's IMPass. If you rush to migrate directly, the technical costs and risks are relatively high, including the need for double writing of server-side data, compatibility of new and old versions, and so on.

So based on Xianyu's existing instant messaging system architecture and technical system, how to optimize its message stability and reliability? Where should governance start? What is the current status of the system? How to measure objectively? I hope this article will show you a different Xianyu instant messaging system.

PS: If you have no idea about the reliability of IM messages, it is recommended to read this introductory article "Introduction to zero-based IM development (3): What is the reliability of IM systems?" ".

study Exchange:

5 groups for instant messaging/push technology development and communication: 215477170 [recommended]
Introduction to Mobile IM Development: "One entry is enough for novices: Develop mobile IM from scratch"
Open source IM framework source code: https://github.com/JackJiang2011/MobileIMSDK

(This article was published simultaneously at: http://www.52im.net/thread-3706-1-1.html)

2. Series of articles

This article is the fourth in a series of articles. The general content is as follows:

"Alibaba IM Technology Sharing (1): Enterprise-level IM King-Nailed in the back-end architecture"
"Alibaba IM Technology Sharing (2): Xianyu IM's Flutter-based mobile terminal cross-terminal transformation practice"
"Alibaba IM Technology Sharing (3): The Road to the Evolution of the Architecture of Xianyu's Billion-level IM Message System"
"Alibaba IM Technology Sharing (4): Reliable Delivery Optimization Practice of Xianyu Billion-level IM Message System" (* This article)

3. Industry plan

After reviewing the mainstream message reliable delivery technical solutions shared on the Internet, I have made a brief summary.

Usually the delivery link of IM messages is roughly divided into three steps:

1) Sent by the sender;
2) The server receives and drops the library;
3) The server informs the receiver.

Especially the mobile network environment is more complicated:

1) Maybe you are sending a message and the network is suddenly disconnected;
2) It is possible that the message is being sent, the network is suddenly up, and it needs to be re-sent.

The technical schematic diagram is as follows:

PS: Many people may not have a systematic understanding of the complexity of mobile networks. The following articles need to be read systematically:

"Easy to understand, understand the "weak" and "slow" of mobile networks"
"Summary of the most comprehensive mobile weak network optimization method in history"
"Why is the WiFi signal poor? Understand in one article! 》
"Why cell phone signal is poor? Understand in one article! 》
"How difficult is the wireless Internet access on the high-speed rail? Understand in one article! 》

So, how to deliver IM messages stably and reliably in such a complicated network environment?

For the sender, it does not know whether the message has been delivered, and if it wants to be sure of delivery, it needs to add a response mechanism.

This mechanism is similar to the following response logic:

1) The sender sends a message "Hello" and enters the waiting state;
2) The receiver receives the message "Hello", and then tells the sender the confirmation that I have received this message;
3) After the sender receives the confirmation message, the process will be completed, otherwise it will try again.

The above process seems simple. The key is that there is a server-side forwarding process in the middle. The problem lies in who gets the confirmation message back and forth, and when to return the confirmation message.

One of the most frequently found information on the Internet is the following message must reach model:

Each message type is explained as follows:

As shown in the above two figures, the sending process is:

1) A sends a message request packet to IM-server, namely msg:R1;
2) After successful processing, IM-server will reply A with a message response packet, namely msg:A1;
3) If B is online at this time, IM-server actively sends a message notification package to B, namely msg:N1 (of course, if B is not online, the message will be stored offline).

As shown in the above two figures, the receiving process is:

1) B sends an ack request packet to IM-server, namely ack:R2;
2) After successful processing, IM-server will reply B with an ack response packet, namely ack:A2;
3) IM-server actively sends an ack notification packet to A, namely ack:N2.

As the above model shows: a reliable message delivery mechanism is guaranteed by six messages. Any error in the middle link can be based on the request-ack mechanism to determine whether there is an error and try again.

The solution we finally adopted is also based on the above model. The logic sent by the client is directly based on http, so there is no need to retry for the time being. The main reason is that the logic of retry will be added when the server pushes to the client. .

Due to space limitations, this article will not be expanded in detail. If you are interested, you can study the following articles systematically:

"From the perspective of the client to talk about the message reliability and delivery mechanism of the mobile terminal IM"
"Implementation of IM Message Delivery Guarantee Mechanism (1): Guarantee the reliable delivery of online real-time messages"
"Implementation of IM Message Delivery Guarantee Mechanism (2): Guaranteeing the Reliable Delivery of Offline Messages"
"How to design a "failure retry" mechanism for a completely self-developed IM? 》
"A set of IM architecture technical dry goods for hundreds of millions of users (Part 2): reliability, orderliness, weak network optimization, etc."
"Understanding the "Reliability" and "Consistency" Issues of IM Messages and Discussion of Solutions"
"Rongyun Technology Sharing: Fully Revealing the Reliable Delivery Mechanism of 100 Million-level IM Messages"

4. Current specific problems

4.1 Overview
Before solving the problem of reliable message delivery, we must first find out what specific problems are currently facing.

However, when we took over this instant messaging system, there was no relevant accurate data for reference. So the current first step is to do a complete investigation of this messaging system, so we made a full link to the message.

The specific burying links are as follows:

Based on the entire link of the message, we sorted out several key indicators:

1) Sending success rate;
2) Message arrival rate;
3) Client drop rate.

This time, the entire data statistics are based on burying points, but a big problem was discovered in the process of burying points: the current instant messaging system does not have a globally unique message ID. This makes it impossible to uniquely determine the life cycle of this message during the process of burying the point in the entire link.

4.2 Message uniqueness problem

As shown in the figure above, the current message is uniquely determined by three variables:

1) SessionID: the ID of the current session;
2) SeqID: The sequence number of the message sent locally by the user, the server does not care about this data, it is completely transparent;
3) Version: This is more important. It is the serial number of the message in the current session. The server is the target, but the client will also generate a fake version.

The above figure is an example: when A and B send messages at the same time, the above key information will be generated locally. When the message (yellow) sent by A arrives at the server first, because there is no other version of the message before, so the original The data is returned to A. When the client A receives the message, it merges with the local message, and only one message is retained. At the same time, the server will also send this message to B, because B also has a local version=1 message, so the message from the server will be filtered out, which causes the problem of message loss.

When B sends a message to the server, because there is already a message with version=1, the server will increment the version of B’s message, and the message’s version=2 at this time. This message is sent to A, and the local message can be merged normally. But when this message is returned to B, and the local message is merged, two identical messages will appear, and message duplication will appear. This is the main reason why Xianyu always has message loss and message duplication before.

4.3 Message push logic problem
The current message push logic also has a big problem. Because the sender uses http requests, the content of the message sent will basically not have a problem. The problem occurs when the server pushes to the other end.

As shown below:

As shown in the figure above: when the server pushes to the client, it will first determine whether the client is online at this time, and will push if it is online, and push offline messages if it is not online.

This approach is very simple and rude: if the state of the long connection is unstable, resulting in inconsistencies between the real state of the client and the storage state of the server, the message will not be pushed to the end.

4.4 Client logic issues
In addition to the above related to the server, there is another type of problem that is the design of the client itself.

It can be boiled down to the following situations:

1) Multithreading problem: The layout of the feedback message list page will be disordered, and the local data has not been fully initialized, and the interface will be rendered;
2) The count of unread and small red dots is not accurate: the local display data is inconsistent with the database storage;
3) Message merging problem: When merging messages locally, they are merged in sections, and the continuity and uniqueness of the messages cannot be guaranteed.

In situations such as the above, we first sorted out and refactored the client code.

The architecture is shown in the figure below:

5. Our optimization work 1: Upgrade Tongxin core

The first step in solving the problem is to solve the problem of the uniqueness of the current message system.

We have also investigated Dingding’s solution. Dingding is the unique ID for the server to maintain messages globally. Considering the historical burden of the Xianyu instant messaging system, we use UUID as the unique ID of the message, so that it can be used in the message link. The burying and de-duplication have been greatly improved.

5.1 Solve the uniqueness of messages
On the new version of the APP, the client will generate a uuid. For situations where the old version cannot be generated, the server will also add relevant information.

The ID of the message is similar to a1a3ffa118834033ac7a8b8353b7c6d9. After the client receives the message, it will de-duplicate it according to the MessageID first, and then sort it based on Timestamp. Although the client time may be different, the probability of repetition is still relatively small.

Take the iOS side as an example, the code is roughly as follows:

(void)combileMessages:(NSArray<PMessage>)messages {
...
// 1. Deduplication according to MessageId
NSMutableDictionary *messageMaps = [self containerMessageMap];
for (PMessage *message in msgs) {
  [messageMaps setObject:message forKey:message.messageId];
}
Quote
// 2. Messages are sorted after merging
NSMutableArray *tempMsgs = [NSMutableArray array];
[tempMsgs addObjectsFromArray:messageMaps.allValues];
[tempMsgs sortUsingComparator:^NSComparisonResult(PMessage _Nonnull obj1, PMessage _Nonnull obj2) {
  // 根据消息的timestamp进行排序
  return obj1.timestamp > obj2.timestamp;
}];
...
}

5.2 Realize the mechanism of message retransmission and disconnection reconnection

Based on the retransmission and reconnection model in the "3. Industry Solutions" section of this article, we have improved the message retransmission logic on the server side, and the client side has improved the disconnection reconnection logic.

The specific measures are:

1) The client will periodically detect whether the ACCS long connection is connected;
2) The server will detect whether the device is online, if it is online, it will push a message, and there will be a timeout waiting;
3) After the client receives the message, it will return an Ack.
5.3 Optimize data synchronization logic
To retransmit and reconnect to solve the basic network layer problems, the next step is to look at the business layer problems.

In the existing message system, many complex situations are solved by adding compatible codes in the business layer, and message data synchronization is a very typical scenario.

Before perfecting the logic of data synchronization, we have also investigated DingTalk's complete set of data synchronization solutions. They are mainly guaranteed by the server, behind which there is a stable long connection guarantee.

The general process of Dingding's data synchronization solution is as follows:

Our server does not yet have this capability, so Xianyu can only control the data synchronization logic from the client.

The data synchronization methods include:

1) Pull the conversation;
2) Pull news;
3) Push messages, etc.
Because the scenarios involved are more complex, there was a previous scenario where push will trigger incremental synchronization. If there are too many pushes, multiple network requests will be triggered at the same time. In order to solve this problem, we have also done related push-pull queue isolation.

The client-side control strategy is that if it is pulling, it will first add the pushed message to the cache queue, and when the result of the pull comes back, it will merge with the logic of the local cache, so that multiple network requests can be avoided The problem.

5.4 Client data model optimization
In terms of data organization, the client is mainly divided into two categories: sessions and messages, and sessions are further divided into: virtual nodes, session nodes, and folder nodes.

The client will build the same tree as the above figure. This tree mainly saves the relevant information displayed by the session, such as unread, red dots and the latest news summary. The child node is updated, and it will be updated to the parent node by the way. The process of building the tree It is also the process of updating the read and unread.

One of the more complicated scenarios is Xianyu Intelligence Agency. This is actually a folder node, which contains many sub-conversations, which determines that his message sorting, red dot count and message summary update logic will be more complicated. The client informs the client of the list of sub-sessions, and the client then splices these data models.

5.5 Server-side storage model optimization

In the foregoing content, I roughly talked about the client's request logic, that is, historical messages will be divided into incremental and full domain synchronization.

This domain is actually a layer of concept on the server side, which is essentially a layer of cache for user messages. After the message comes, it will be temporarily stored in the cache to speed up message reading.

However, this design also has a flaw: the domain ring has a length and can store up to 256 messages. When the number of user messages exceeds 256, it can only be read from the database.

Regarding the storage method of the server, we have also investigated Dingding's solution-write diffusion. The advantage is that each user's message can be customized well. The disadvantage is that the storage capacity is very large.

Our solution should be a solution between read proliferation and write proliferation. This design method not only complicates the client-side logic, but also the server-side data reading speed will be slower. This section can also be optimized in the follow-up.

6. Our optimization work 2: Increase the quality control system

While doing the transformation of the full link between the client and the server, we also monitored and checked the behavior of the message line.

6.1 Full link troubleshooting

The full-link investigation is based on the user's real-time behavior log, and the client's burying point is flushed into the SLS through the group's real-time processing engine Flink.

User behavior includes:

1) The processing of the message by the message engine;
2) The user's click/visit page behavior;
3) The user's network request.
On the server side, there will be some long connection push and retry logs, which will also be cleaned to the SLS, which constitutes a solution for the entire link from the server to the client.

6.2 Reconciliation system
Of course, in order to verify the accuracy of the message, we also made a reconciliation system:

When the user leaves the session, we will count a certain number of messages in the current session, generate an md5 check code, and report it to the server. The server will determine whether the message is correct after getting the check code.

After sampling data verification, the accuracy of the message is basically 99.99%.

7. Optimize the statistical method of data indicators

We encountered a problem when we were counting the key indicators of news: We used to use user buried points to make statistics, and found that there would be a 3% to 5% difference in data.

Later, we used sampled real-time reported data to calculate data indicators:

Message arrival rate = the amount of messages actually received by the client / the amount of messages that the client should receive

The message actually received by the client is defined as "message falling into the database is considered to be".

This indicator does not distinguish between offline and online, and takes the last time the user updated the device on the same day. In theory, all messages sent on the same day and before this time should be received.

After the aforementioned optimization work, the message arrival rate of our latest version has basically reached 99.9%. From the perspective of public opinion, the feedback of missing messages is indeed a lot less.

8. Future planning

On the whole, after a year of optimized governance, the indicators of our instant messaging system are slowly getting better.

But there are still some areas to be optimized:

1) The security of the message is insufficient: it is easy to be used by hacked products and send some illegal content with the help of messages;
2) The extensibility of the message is weak: adding some cards or capabilities will require the release of the version, which lacks the ability to dynamic and expand.
3) Insufficient scalability of the bottom layer: The bottom layer protocol is now more difficult to expand, and it is still necessary to standardize the protocol in the future.

From a business perspective, messaging should be a horizontally supported tool or platform-based product, and it can be quickly connected to two parties and three parties.

Next, we will continue to pay attention to the user's public opinion related to the news, and hope that the Xianyu instant messaging system can help users better complete business transactions.

Appendix: More related articles

[1] More Alibaba technical resources:
"Ali DingTalk Technology Sharing: Enterprise-level IM King-DingTalk's outstanding features in the back-end architecture"
"Discussion on the Synchronization and Storage Scheme of Chat Messages in Modern IM System"
"Alibaba Technology Sharing: Demystifying the 10-year history of changes in Alibaba's database technology solutions"
"Alibaba Technology Sharing: The Hard Way to Growth of Alibaba's Self-developed Financial-Level Database OceanBase"
"From Ali OpenIM: Technical Practice Sharing for Creating Safe and Reliable Instant Messaging Services"
"DingTalk-Technical Challenges of a New Generation of Enterprise OA Platform Based on IM Technology (Video + PPT) [Attachment Download]"
"Ali Technology Crystal: "Alibaba Java Development Manual (Statute)-Huashan Version" [Attachment Download]"
"Heavy release: "Alibaba Android Development Manual (Statute)" [Attachment Download]"
"The author talks about the story behind the "Alibaba Java Development Manual (Statute)"
"The Story Behind "Alibaba Android Development Manual (Statute)"
"Dried up this bowl of chicken soup: from the barber shop to the Ali P10 technical master"
"Revealing the rank and salary system of Ali, Tencent, Huawei, and Baidu"
"Taobao Technology Sharing: The Technological Evolution Road of the Mobile Access Layer Gateway of the Hand Taobao Billion Level"
"A Rare Dry Goods, Revealing Alipay's 2D Code Scanning Technology Optimization Practice Road"
"Taobao live broadcast technology dry goods: high-definition, low-latency real-time video live broadcast technology decryption"
"Ali Technology Sharing: E-commerce IM messaging platform, technical practice in group chat and live broadcast scenarios"
"Ali Technology Sharing: Xianyu IM's Cross-End Transformation Practice Based on Flutter"
"Alibaba IM Technology Sharing (3): The Road to the Evolution of the Architecture of Xianyu's Billion-level IM Message System"
"Alibaba IM Technology Sharing (4): Reliable Delivery Optimization Practice of Xianyu's 100-million-level IM Message System"
[2] Articles about IM architecture design:
"On the architecture design of IM system"
"A brief description of the pits of mobile IM development: architecture design, communication protocol and client"
"A set of mobile IM architecture design practice sharing for massive online users (including detailed graphics and text)"
"An Original Distributed Instant Messaging (IM) System Theoretical Architecture Plan"
"From Zero to Excellence: The Evolution of the Technical Architecture of JD's Customer Service Instant Messaging System"
"Mushroom Street Instant Messaging/IM Server Development Architecture Selection"
"Tencent QQ's 140 million online users' technical challenges and architecture evolution PPT"
"WeChat background based on the time series of massive data cold and hot hierarchical architecture design practice"
"WeChat Technical Director Talks about Architecture: The Way of WeChat-Dao Zhi Jian (Full Speech)"
"How to Interpret "WeChat Technical Director Talking about Architecture: The Way of WeChat-The Road to the Simple""
"Rapid Fission: Witness the evolution of WeChat's powerful back-end architecture from 0 to 1 (1)"
"How to ensure the efficiency and real-time performance of large-scale group message push in mobile IM? 》
"Discussion on the Synchronization and Storage Scheme of Chat Messages in Modern IM System"
"Technical Challenges and Practice Summary Behind the 100 Billion Visits of WeChat Moments"
"Behind the glamorous bullet message: the chief architect of Netease Yunxin shares the technical practice of the billion-level IM platform"
"WeChat Technology Sharing: Practice of Generating Massive IM Chat Message Sequence Numbers in WeChat (Principles of Algorithms)"
"A set of high-availability, easy-scalable, and high-concurrency IM group chat and single chat architecture design practices"
"Social Software Red Envelope Technology Decryption (1): Comprehensive Decryption of QQ Red Envelope Technology Scheme-Architecture, Technical Implementation, etc."
"From guerrilla to regular army (1): the evolution of the IM system architecture of Mafengwo Travel Network"
"From guerrilla to regular army (2): Mafengwo Travel Network's IM Client Architecture Evolution and Practice Summary"
"From Guerillas to Regular Army (3): Technical Practice of Distributed IM System of Mafengwo Travel Network Based on Go"
"The data architecture design of Guazi IM intelligent customer service system (organized from the on-site speech, with supporting PPT)"
"IM Development Basic Knowledge Supplementary Lesson (9): Want to develop an IM cluster? First understand what RPC is! 》
"Ali Technology Sharing: E-commerce IM messaging platform, technical practice in group chat and live broadcast scenarios"
"A set of IM architecture technical dry goods for hundreds of millions of users (Part 1): overall architecture, service split, etc."
"A set of IM architecture technical dry goods for hundreds of millions of users (Part 2): reliability, orderliness, weak network optimization, etc."
"From novice to expert: How to design a distributed IM system with billions of messages"
"The Secret of the IM Architecture Design of Enterprise WeChat: Message Model, Ten Thousands of People, Read Receipt, Message Withdrawal, etc."
"Rongyun Technology Sharing: Fully Revealing the Reliable Delivery Mechanism of 100 Million-level IM Messages"
"IM Development Technology Learning: Demystifying the System Design Behind the Information Push of WeChat Moments"
"Alibaba IM Technology Sharing (3): The Road to the Evolution of the Architecture of Xianyu's Billion-level IM Message System"

This article has been simultaneously published on the official account of "Instant Messaging Technology Circle".
The synchronous publishing link is: http://www.52im.net/thread-3706-1-1.html

Ali IM Technology Sharing (4): Reliable delivery optimization practice of Xianyu billion-level IM messaging system

1 Introduction

2. Series of articles

3. Industry plan

4. Current specific problems

5. Our optimization work 1: Upgrade Tongxin core

6. Our optimization work 2: Increase the quality control system

7. Optimize the statistical method of data indicators

8. Future planning

Appendix: More related articles

JackJiang

引用和评论

全平台开源即时通讯IM聊天框架MobileIMSDK的服务端开发指南，支持鸿蒙NEXT

即时通讯安全篇（一）：正确地理解和使用Android端加密算法

全民AI时代，大模型客户端和服务端的实时通信到底用什么协议？

融云数据监控平台「北极星」教程，聊天室洪峰、连接异常、消息未达正确解法

极致出海友好，融云 IM 支持消息免打扰设置时区

即时通讯安全篇（十五）：详解硬编码密码的泄漏风险及其扫描原理和工具

软件架构模式实战指南：用真实血泪案例讲透技术选型