Alibaba IM Technology Sharing (7): Optimization Practice of Xianyu IM&#39;s Online and Offline Chat Data Synchronization Mechanism

This article was shared by Shu Xian, the technical team of Ali Xianyu. The original title "How to effectively shorten the processing time of Xianyu messages" has been revised and changed.

1 Introduction

The Xianyu technical team has shared several practical summary articles focusing on the technical category of IM. This article will share some problems encountered in the synchronization mechanism of online and offline chat message data in the Xianyu IM system, as well as practice Sexual solution.

study Exchange:

Introductory article on mobile IM development: "One entry is enough for beginners: developing mobile IM from scratch"
Open source IM framework source code: https://github.com/JackJiang2011/MobileIMSDK

(This article has been published simultaneously at: http://www.52im.net/thread-3856-1-1.html )

2. Series of articles

This article is the seventh in a series of articles, the general directory is as follows:

"Alibaba IM Technology Sharing (1): The King of Enterprise-level IM - Dingding's Superiority in Back-end Architecture"
"Alibaba IM Technology Sharing (2): Xianyu IM based on Flutter's mobile terminal cross-terminal transformation practice"
"Alibaba IM Technology Sharing (3): Architecture Evolution of Xianyu Billion-level IM Messaging System"
"Alibaba IM Technology Sharing (4): Reliable Delivery Optimization Practice of Xianyu Billion-level IM Message System"
"Alibaba IM Technology Sharing (5): Timeliness Optimization Practice of Xianyu Billion-level IM Messaging System"
"Alibaba IM Technology Sharing (6): Optimization of Offline Push Reach Rate of Xianyu Billion-level IM Messaging System"
"Alibaba IM Technology Sharing (7): Optimization Practice of Xianyu IM's Online and Offline Chat Data Synchronization Mechanism" (* this article)

3. Problem background

With the rapid growth of the number of users, the Xianyu IM system has also faced unprecedented challenges.

After years of business iterations, the IM code on the client side has not been clear enough due to years of iteration, and some previously hidden chat data synchronization problems have also been magnified as the number of users increases.

The specific process here is: the data packets that need to be synchronized to the client side in the background, the background will be divided into different data domains according to the business type of the data packets, the data packets have a unique and continuous number in the corresponding domain, each data packet After it is sent to the terminal and successfully consumed, the terminal will record the current version number of each data domain that has been synchronized, and the next data synchronization will start with the number of the local data domain and continuously synchronize to the client.

Of course, users will not wait for messages online all the time, so the front-end side adopts a push-pull combination to ensure data synchronization.

Specifically:

1) When the client is online: use ACCS to push the latest data content to the client in real time (ACCS is a full-duplex, low-latency, high-security channel service provided by Taobao Wireless to developers);
2) After the client is started from the offline state: according to the local data domain number, pull the data difference when it is not online;
3) When there is a black hole in data acquisition: trigger data synchronization pull ("black hole" refers to the state of discontinuous version of data packets).

4. Problem analysis

The current chat data synchronization strategy can basically guarantee the data synchronization of IM, but it is also accompanied by some hidden problems.

These implicit problems are mainly:

1) When intensive data is pushed in a short time, multiple data domain synchronizations will be quickly triggered. If there is a problem with the data returned by domain synchronization, a new round of synchronization will be triggered, resulting in a waste of network resources. Redundant data packets/invalid data content will occupy the processing resources of valid content and waste CPU and memory resources;
2) Whether the data packets in the data domain are normally consumed by the client, the server side has no perception, and can only passively return data according to the current data domain information;
3) The logical splitting of data collection/message data body analysis/storage storage library is not clear enough, and it is impossible to perform ABTest on the code splitting and replacement of a certain layer.

In response to the above problems, we have carried out a layered transformation of Xianyu IM—that is, the data synchronization layer is removed. In this way, in addition to the hope that the synchronization content of this data can be used in IM in the future, it is also hoped that with the increase of stability, other business scenarios can be empowered.

In the following content, we will focus on some practical ideas for solving the problem of data synchronization in Xianyu IM chat on the client side.

5. Optimization ideas

5.1 Hierarchical split
For the server: after the business side generates the data packet, it will splicing the current data domain information, and then push the data to the terminal side through the data synchronization layer.

For the client: after receiving the data packet, it will determine the business party that needs to consume the data packet according to the current data domain information, ensure that the data packet is complete and continuous in the data domain, and then unpack the data body and hand it over to the business side Consume, and respond to the state of consumption.

Extraction of the data synchronization layer: Encapsulate the packing, unpacking, verification, and retry processes in data synchronization, so that the upper-layer business only needs to care about the data field information that it needs to monitor, and then when these data fields update the data. At that time, you can get these data for consumption, and you no longer need to care about whether the data package is complete.

To do so:

1) The business side only needs to care about the protocol of the business side docking;
2) The data side only needs to care about the protocol packaged by the data side;
3) The network layer is responsible for the actual data transmission.

The overall architecture principle is as follows:

To sum it up:

1) Align the data layer data transmission protocol and describe the data field information of the current data packet body;
2) Separate the processing/merging/dropping of messages into data consumers;
3) Depend on abstraction up and downstairs, and remove the dependency on concrete implementation.

5.2 Data Layer Structure Model
Based on the stripping of the data model and the regularization of solutions to the problems encountered at the moment, the data synchronization layer is split into the following architecture.

The specific implementation ideas are:

1) The ACCS long link service is established when the App starts to ensure the push channel link, and a data pull is triggered according to the current local data domain information;
2) Data consumers register consumer information and data domain information to be monitored, here is a one-to-many relationship;
3) After the new data arrives at the end side, the data packet is put into the buffer pool of the specified data field, and after the batch data is summarized, the data reading is started again;
4) Pop up the most optimal data packet according to the priority of the current data field, and judge whether the version of the data field meets the requirements of the consumer. Domain information triggers incremental data domain synchronization pull;
5) When the data domain synchronous pull is triggered, the block data is read. At this time, the data reached through ACCS will continue to be summarized into the specified data domain queue, waiting for the data domain synchronous pull result, sorting the data packets, Remove duplicates and merge them into the corresponding data domain queue. Then reactivate the data read;
6) After the data packet body is correctly consumed by the consumer, update the domain information and notify the server of the correctly processed data domain information through the uplink channel.

Data Domain Synchronization Protocol:

The data carried in the Region does not need to be too much, but the content of the data packet needs to be clearly described, specifically:

1) The ID of the target user to determine whether the target packet is correct;
2) Data field ID and priority information;
3) The domain priority version of the current packet.

Sorting strategy:

For domain data induction, whether sorting when writing data or searching when reading requires a sorting operation, and the optimal time complexity is O(logn) level.

In the actual coding, it is found that in a data domain, the version information of the data packet is continuous and unique and there is no fault, the version information of the last stable consumption data body is automatically incremented to the version of the next data packet, so here we use Map storage with Versio as the main key not only reduces the time complexity, but also enables the content of the package that arrives at the end side after the uniquely identified data package to overwrite the previous content of the package.

6. New problems and solutions

6.1 Balance of Multiple Data Sources and Unique Data Consumption
Whenever a packet is generated for the current user:

1) If the current ACCS long link exists, the data packet will be pushed to the client through ACCS;
2) If the App switches to the background for a period of time, or is directly killed, and the ACCS link is disconnected, it can only be pushed to the user's notification panel offline.

Therefore, whenever the app switches to the active state, it needs to trigger a data synchronization from the background according to the current locally stored data domain information.

The source of data packets reaching the client side is mainly the push of ACCS long links and the pull of domain synchronization, but the consumption of data packets is the only consumer divided according to the monitoring of the data domain, that is, only one consumer can be consumed at the same time. data pack.

In the stress test: when the background intensively pushes data packets to the end-side through ACCS in a short period of time, the data packets received by the end-side are not in order, and the discontinuous version of the data packet field will trigger a new data field synchronization , resulting in the same data packet reaching the end side multiple times through two different channels, wasting unnecessary traffic.

When the data domain is synchronized: the new data packet generated by this time node will also be pushed to the end side, and the data body is valid and needs to be correctly consumed.

Solutions to these problems:

That is, a data intermediate layer is loaded between data consumption and data acquisition. When the data domain synchronization is triggered, the block data read and the data packets pushed by ACCS will be stored in a data transfer station. When the data domain is pulled synchronously After the data comes back, merge the data and then restart the data reading process.

6.2 Data field priority
The data packets that need to be pushed to the client side are divided according to different priorities of services.

The data packets generated by the chat between users and users will have a higher priority than the data packets of operational messages. Therefore, when the multi-priority data packets arrive at the end side quickly, the data packets in the high-priority data field need to be processed. Priority consumption, and the priority of the data domain also needs to be dynamically adjusted, and the priority strategy needs to be constantly changed.

Solutions to this problem:

Different data fields generate different data queues, and the data packets in the high-priority queue will be read and consumed first.

The data domain information brought back in each data packet body can be marked with the current data domain priority. When the data domain priority changes, the data packet consumption priority policy can be adjusted.

7. The optimized effect

In addition to the hierarchical sorting of the structure, the data synchronization layer and the dependent service content can be easily decoupled/pluggable in each link, and the optimization effect of the data synchronization for message consumption time/traffic saving in the stress test scenario is more obvious.

In the stress test scenario of "100 out-of-order data packets pushed within 500ms":

1) The message processing time (receive-screen) is shortened by 31%;
2) Traffic loss (the cumulative size of data packets finally pulled to the end-side) is reduced by 35%.

8. Follow-up optimization plan

8.1 Data synchronization layer capability improvement
The goal of the data synchronization side is not only to ensure that the data packets arrive at the end side completely, but also to reduce the data pulling as much as possible under the premise of ensuring stability, so that each data acquisition is effective.

Subsequent data synchronization layers will further optimize the effective data rate and arrival rate.

For different scenarios, dynamically and intelligently adjust the priority strategy of data synchronization.

Blocking long link push ensures that only push mode or pull mode exists at the same time, further reducing the push of redundant data packets.

8.2 IM terminal side overall architecture upgrade
Upgrading the data synchronization layer strategy is mainly to improve the capabilities of IM. After data synchronization is layered, the next step is to process the processing of messages. Each process can be monitored and traced back to improve the correct analysis, storage and storage of IM data packets. library rate.

To refine it is:

1) After the data source side is separated, the subsequent rectification of IM will gradually separate the message processing layer by layer;
2) Process reporting of key nodes of message processing and establish a complete monitoring system, so that problem discovery precedes user public opinion;
3) Dynamic self-check of message integrity to minimize data compensation and completion.

9. References

[1] Should "Push" or "Pull" be used for online status synchronization in IM single chat and group chat?
[2] IM group chat messages are so complicated, how to ensure that they are not lost or heavy?
[3] A set of high-availability, easy-to-scale, high-concurrency IM group chat, single chat architecture design practice
[4] A set of IM architecture technology dry goods for hundreds of millions of users (Part II): reliability, orderliness, weak network optimization, etc.
[5] From novice to expert: how to design a distributed IM system with hundreds of millions of messages
[6] Rongyun Technology Sharing: Comprehensively Reveal the Reliable Delivery Mechanism of Billion-Level IM Messages
[7] How to ensure the efficiency and real-time performance of large-scale group message push in mobile IM?
[8] Discussion on Synchronization and Storage Scheme of Chat Messages in Modern IM System
[9] One entry for beginners is enough: developing mobile IM from scratch
[10] Implementation of IM Message Delivery Guarantee Mechanism (1): Ensure reliable delivery of online real-time messages
[11] Implementation of IM message delivery guarantee mechanism (2): ensuring reliable delivery of offline messages
[12] Introduction to Zero-Based IM Development (4): What is the message timing consistency of the IM system?
[13] IM development dry goods sharing: how do I solve the problem that a large number of offline messages cause the client to freeze

(This article has been published simultaneously at: http://www.52im.net/thread-3856-1-1.html )

Alibaba IM Technology Sharing (7): Optimization Practice of Xianyu IM's Online and Offline Chat Data Synchronization Mechanism

1 Introduction

2. Series of articles

3. Problem background

4. Problem analysis

5. Optimization ideas

6. New problems and solutions

7. The optimized effect

8. Follow-up optimization plan

9. References

JackJiang

引用和评论

长连接网关技术专题(十二)：大模型时代多模型AI网关的架构设计与实现

极致出海友好，融云 IM 支持消息免打扰设置时区

支持百万人超大群聊的Web端IM架构设计与实践

全平台开源即时通讯IM框架MobileIMSDK：7端+TCP/UDP/WebSocket协议

鸿蒙NEXT如何保证应用安全：详解鸿蒙NEXT数字签名和证书机制

《北京日报》点赞！融云助力打造“数字丝路”新范式

拥抱国产化：转转APP的鸿蒙NEXT端开发尝鲜之旅

Alibaba IM Technology Sharing (7): Optimization Practice of Xianyu IM&#39;s Online and Offline Chat Data Synchronization Mechanism

1 Introduction

2. Series of articles

3. Problem background

4. Problem analysis

5. Optimization ideas

6. New problems and solutions

7. The optimized effect

8. Follow-up optimization plan

9. References

JackJiang

引用和评论

长连接网关技术专题(十二)：大模型时代多模型AI网关的架构设计与实现

极致出海友好，融云 IM 支持消息免打扰设置时区

支持百万人超大群聊的Web端IM架构设计与实践

全平台开源即时通讯IM框架MobileIMSDK：7端+TCP/UDP/WebSocket协议

鸿蒙NEXT如何保证应用安全：详解鸿蒙NEXT数字签名和证书机制

《北京日报》点赞！融云助力打造“数字丝路”新范式

拥抱国产化：转转APP的鸿蒙NEXT端开发尝鲜之旅

Alibaba IM Technology Sharing (7): Optimization Practice of Xianyu IM's Online and Offline Chat Data Synchronization Mechanism