2
头图

This article is shared by the current and youyou of Ali Xianyu's technical team, and it has been revised this time.

1 Introduction

Xianyu's instant messaging system has gone through several generations of iterations, and has now been able to stably support billions of messages.

In the process of building this message system, we have gone from simple to complex, from trouble to breaking, and every technological change is to better solve the problems faced by the current business.

This article shares the road of technological change of Xianyu's instant messaging system architecture from scratch, in the hope that more peers can learn from experience and get valuable inspiration on this basis.

study Exchange:

  • 5 groups for instant messaging/push technology development and communication: 215477170 [recommended]
  • Introduction to Mobile IM Development: "One entry is enough for novices: Develop mobile IM from scratch"
  • Open source IM framework source code: https://github.com/JackJiang2011/MobileIMSDK

(This article was published synchronously at: http://www.52im.net/thread-3699-1-1.html )

2. Series of articles

This article is the third in a series of articles. The general content is as follows:

"Alibaba IM Technology Sharing (1): Enterprise-level IM King-Nailed in the back-end architecture"
"Alibaba IM Technology Sharing (2): Xianyu IM's Flutter-based mobile terminal cross-terminal transformation practice"
"Alibaba IM Technology Sharing (3): The Road to the Evolution of the Architecture of Xianyu's 100-Million-Class IM Message System" (* This article)
"Alibaba IM Technology Sharing (4): Practice of Reliable Delivery Technology of Xianyu Billion-level IM Message System" (* to be released later)

3. Version 1.0: Business start-up period, minimum availability

3.1 Technical background
In 2014, the idle transaction independent APP "Xianyu" was launched, and the first phase of the construction of the APP main link was completed, including products: release → search → product details → IM conversation → transaction.

As a start-up app, the business needs to be launched as soon as possible to verify the effect, and the technical construction needs to complete the system construction of Xianyu news from scratch.

3.2 Technical solution
As an instant messaging system, minimization capabilities include:

1) Message storage: conversation, summary, news;
2) Message synchronization: push, pull;
3) Message channel: long connection, manufacturer push.

Different from the general IM conversation model, Xianyu conversation takes commodities as the main body and "people + people + commodities" as elements to construct a conversation.

Due to the difference in the conversation model, the existing messaging system of Tao is unable to meet business needs in the short term, and Xianyu's completely self-built messaging system takes a lot of time.

In order to ensure the efficient launch of the business, the technology selection maximizes the reuse of existing system capabilities and avoids re-creating wheels.

Therefore, our technical solution is:

1) The data model and underlying storage rely on the Taoxi private letter system for construction;
2) In terms of message data acquisition, the client pulls all message data from the server;
3) The communication protocol uses the SDK and mtop.

The overall structure is as shown in the figure below, and the fast delivery in this mode ensures that the business is minimized and available:

4. Version 2.0: The number of users is increasing rapidly, and the messaging system needs to be rebuilt

4.1 Technical background
The number of Xianyu users is rapidly surpassing 1 million, and instant messaging service calls are skyrocketing. Under such a background, the user feedback message data acquisition is stuck and white screen becomes the normal state, and the system alarms are frequently issued when a large number of push messages are sent.

The reason for these problems: In the 1.0 version of the architecture mode, the full amount of message data is obtained in the pull mode, and the client's pure UI does not store data.

Specifically:

1) When the user needs to view the message data, the success of the data pull depends on the network and data access speed, which occasionally causes stuttering and white screen;
2) Centralized data storage. Reading is far greater than writing. Under high concurrency, the server load is too large.

Aiming at point 2): For example, 1W users can chat online at the same time and pull all messages concurrently according to the current architecture, with an estimated QPS of 50,000. Let's suppose that when there are 100,000 simultaneous online chat users, the pressure on the server can be imagined.

4.2 Technical Solution
Based on the above issues, we decided to rebuild the message system architecture to cope with the greater increase in users in the future.

Back to the core functions of the IM system:

4.2.1) Message storage model:

1) Session model: unique sessions are identified by owner, itemid, user, and sessionType, and extended attributes are added to support personalization;
2) Summary model: As a user session view, different users in the same session can be presented in a personalized manner, and the unique summary is identified by userid and sid;
3) Message model: consists of sender, message content, message version, and sid. sid+message version uniquely determines a message;
4) Instruction model: It is a two-terminal agreement, a set of instructions issued by the server and executed by the client. Such as do not disturb instructions, delete instructions, etc.

4.2.2) Message channel:

1) Online channel: use the full-duplex, low-latency, and high-security channel service provided by Taobao wireless ACCS long connection;

2) Offline channel: use the Taobao message push platform AGOO. It shields the complexity of the docking of mainstream manufacturers and directly provides services to the business system.

4.2.3) Message synchronization model:

1) The client establishes a database and stores the message data: When the message data is stored on the local device, the message synchronization is optimized from the full amount of pull to the combined mode of full amount + incremental synchronization.

Incremental and full synchronization specifically refer to:

a. Incremental synchronization: the client stores the message location information, and only synchronizes incremental messages by comparing with the latest location of the server;
b. Full synchronization: When the user uninstalls and reinstalls or the location gap is too large, the client pulls all historical message data to reconstruct the data on the terminal.

2) The server builds a personal message domain ring (inbox model): to synchronize incremental data with the client. At the same time, the problem of more reads and less writes in the 1.0 version of the architecture, the read and write pressure is balanced by the write diffusion of the personal domain ring.

The following figure shows the process of a message from sending to receiving and the execution process of the server and the client:

As shown in the figure above: Suppose Ua sends a message to Ub, and the message write spreads to the respective domain rings of Ua and Ub:

1) When the client is online, the received message location = the domain version of the current terminal + 1, the local message database can be merged;
2) When the client is offline, only offline push notification is performed. When the user is online again, data synchronization is performed, and the server determines whether to trigger incremental synchronization or full synchronization.

For point 2), the specific logic is:

1) If the domain ring version difference is less than the threshold, after the incremental synchronization, the local message database merge is performed;
2) When the domain ring version difference is greater than the threshold, the full amount of messages is pulled and the data is rebuilt on the end.

The entire synchronization logic is based on the instant message domain ring of Xianyu. The domain ring can be regarded as a user message inbox with a fixed capacity. All messages sent to a user will be synchronized to his domain ring.

Specifically:

1) Domain ring storage: The domain ring needs to support high concurrent data read and write, which is realized by using Ali distributed KV storage system tair;
2) Domain ring capacity: In order to reduce the total amount of message synchronization, the personal domain ring capacity is planned based on the average amount of messages that the user needs to synchronize next time when entering the idle fish. At the same time, use FIFO loop to cover historical data;
3) Domain ring version: The user's current message location, when the message enters the personal domain ring, the domain ring version is strictly and continuously increased through tair's counter, which is used for full and incremental synchronization judgment.

After the completion of the above construction, Xianyu has its own independent instant messaging system, the current problems encountered have been alleviated, and the user experience has been greatly improved.

5. Version 3.0: With the rapid development of business, system stability needs to be guaranteed

5.1 Technical background
With the enrichment of Xianyu’s business ecosystem, the types of IM conversations and message content continue to expand. At the same time, with the rapid increase in the number of users, public opinion issues such as unreceived user feedback messages and message delays have become increasingly prominent.

5.2 Problem analysis
Problem 1: The Xianyu app process does not have an effective keep-alive mechanism. The process will soon be suspended by the system after the app retreats to the background, causing the long connection to be interrupted. At this time, the message is pushed through the manufacturer channel, and the real-time performance of the manufacturer channel is poor, and the priority setting of the message push is different, which causes the user to perceive the message delay.

Question 2: When the accs online message is pushed, the average delay is relatively short, but there are false connections. Moreover, the current message push link does not have an ack mechanism, which causes the server to think that the message has been sent, but the client does not actually receive it. The user can only see the message after opening the app next time, and the user perceives the message delay.

PS: The main reason for the false connection is that the user returns to the background, the accs long connection is interrupted, but the device status update is delayed.

Question 3: The current message synchronization push mode (accs push) and pull mode (mtop), the client is not isolated and processed asynchronously, resulting in abnormal message database processing in some extreme cases, resulting in message loss.

For example, a user receives multiple messages continuously after going online, and one of them triggers a domain black hole. When the data is reconstructed on the message synchronization terminal, there is a small probability of processing errors.

Question 4: Most of the online message problems are found by public opinion feedback, such as messy messages, the system has no perception, no remedial measures, and difficulty in troubleshooting after the problem, and can only be repaired following the version.

Question 5: The business is constantly enriched, and service accounts based on the messaging system, small program content marketing, and message groups have been incubated. Various message sending links share the domain ring and data storage, causing stability problems.

For example, personal domain ring messages include IM chats and marketing messages. IM chats are triggered by users and need to ensure strong arrival; while marketing messages are generally sent in batches by the system through a shuttle bus. The message level is large and the tps is high, which affects IM services. stability.

5.3 Resolution of the case
Based on the above analysis, we solve the problems one by one.

1) Message retransmission and push-pull isolation:

As shown in FIG:

a. ACK: Ensure that the message arrives in time. When the server downloads the accs message, it adds the message to the retry queue and delays the retry. After the client receives the accs message and processes it successfully, it returns an ack to the server, and the server updates the message arrival status after receiving the ack, and terminates Try again to avoid false connection of the device or unstable network;
b. Retransmission: Decide when to retransmit the message according to the delayed retransmission strategy to ensure the deterministic arrival of the message. The adaptive delay retransmission strategy means that the new message first detects the network status of the device with a fixed short delay of N seconds, and then increases the delay strategy with a fixed step M according to the network status. This strategy can guarantee the shortest time Within, use the least number of retransmissions to deliver the message successfully;
c. Message queue: A message queue is introduced on the terminal to process messages in order to ensure the accuracy of message processing. At the same time, push-pull isolation is performed to ensure orderly consumption of the queue, and solve the problem of concurrent processing of message data merging errors in complex situations.

2) Data storage split:

More than half of the instant messages sent by Xianyu every day are marketing messages. The sending of marketing messages has obvious peaks and troughs. The peak period will cause the message database to jitter and affect IM messages. I'll do business isolation for message, summary, and domain ring storage to meet the different requirements for stability in different business scenarios.

The specific approach is:

1) IM messages require extremely high stability guarantees, and their messages and abstracts continue to be stored in mysql;
2) The marketing message storage period is short, the stability requirement is lower than IM, and Lindorm storage is adopted;
3) The domain ring is isolated at the instance level to ensure that the capacity of the IM domain ring will not be occupied by other messages, which affects message synchronization.

PS: Lindorm is a multi-model cloud-native database service with advantages such as low cost, custom TTL, and horizontal capacity expansion.

3) Online problem discovery and recovery:

The key element to ensure stability is to monitor various core indicators, and the monitoring must first have data sources, bury the key link nodes of the server + client, based on the group UT, SLS, and perform real-time cleaning and calculation through blink , And finally a unified and standardized log data is sent to SLS for real-time monitoring and link investigation.

The core goal of the messaging system is to ensure that user messages are sent, received, and received in a timely manner. Therefore, we monitor the stability of the system by calculating the sending success rate, arrival rate, and message delay.

In addition, in order to solve the difficult problem of user public opinion investigation:

1) We have designed a set of instructions. Through the agreed instruction protocol, the server sends instructions to the designated users, and the client executes the corresponding instructions to report abnormal data to improve the efficiency of investigation;
2) Expansion of commands such as mandatory full synchronization, data correction, etc., to fix user message data problems directionally. Compared with the previous serious bugs, users can only be solved by uninstalling and reinstalling. This method is obviously more user-friendly.

After a series of special governance, technical public opinion dropped by 50%, a message stability system was built from 0 to 1, and the user experience was further improved.

6. Looking to the future

Xianyu is an e-commerce transaction APP, in which IM is the front link of the transaction, and the product experience of IM greatly affects the efficiency of user transactions.

A user survey was conducted some time ago, and the NPS of Congxianyu IM was lower than expected (NPS is a measure of user loyalty = recommender%-detractor%).

From user feedback:

1) Some users have strong demands for product functions, such as message search, grouping, etc.;
2) Most users have difficulty understanding violations in the process of sending messages;
3) There are still many public opinion feedback messages not received or delayed.

Mapping to the current instant messaging system of Xianyu, our system architecture still has many areas that need continuous improvement.

Typical examples are: synchronization protocol redundancy, which is easy to cause problems during the process of demand iteration, the impact of the lack of an effective keep-alive mechanism on the instant delivery of messages, the failure to receive offline messages of niche models, and the bloated online database of many years of data accumulation, etc. The problem affects the speed of Xianyu’s business iteration and NPS.

As a technical team, the next step is to improve NPS as a core technical goal. Xianyu's instant messaging system version 4.0 architecture is on the way...

Appendix: More related articles

[1] More Alibaba technical resources:
"Ali DingTalk Technology Sharing: Enterprise-level IM King-DingTalk's outstanding features in the back-end architecture"
"Discussion on the Synchronization and Storage Scheme of Chat Messages in Modern IM System"
"Alibaba Technology Sharing: Demystifying the 10-year history of changes in Alibaba's database technology solutions"
"Alibaba Technology Sharing: The Hard Way to Growth of Alibaba's Self-developed Financial-Level Database OceanBase"
"From Ali OpenIM: Technical Practice Sharing for Creating Safe and Reliable Instant Messaging Services"
"DingTalk-Technical Challenges of a New Generation of Enterprise OA Platform Based on IM Technology (Video + PPT) [Attachment Download]"
"Ali Technology Crystal: "Alibaba Java Development Manual (Statute)-Huashan Version" [Attachment Download]"
"Heavy release: "Alibaba Android Development Manual (Statute)" [Attachment Download]"
"The author talks about the story behind the "Alibaba Java Development Manual (Statute)"
"The Story Behind "Alibaba Android Development Manual (Statute)"
"Dried up this bowl of chicken soup: from the barber shop to the Ali P10 technical master"
"Revealing the rank and salary system of Ali, Tencent, Huawei, and Baidu"
"Taobao Technology Sharing: The Technological Evolution Road of the Mobile Access Layer Gateway of the Hand Taobao Billion Level"
"A Rare Dry Goods, Revealing Alipay's 2D Code Scanning Technology Optimization Practice Road"
"Taobao live broadcast technology dry goods: high-definition, low-latency real-time video live broadcast technology decryption"
"Ali Technology Sharing: E-commerce IM messaging platform, technical practice in group chat and live broadcast scenarios"
"Ali Technology Sharing: Xianyu IM's Cross-End Transformation Practice Based on Flutter"
"Alibaba IM Technology Sharing (3): The Road to the Evolution of the Architecture of Xianyu's Billion-level IM Message System"

[2] Articles about IM architecture design:
"On the architecture design of IM system"
"A brief description of the pits of mobile IM development: architecture design, communication protocol and client"
"A set of mobile IM architecture design practice sharing for massive online users (including detailed graphics and text)"
"An Original Distributed Instant Messaging (IM) System Theoretical Architecture Plan"
"From Zero to Excellence: The Evolution of the Technical Architecture of JD's Customer Service Instant Messaging System"
"Mushroom Street Instant Messaging/IM Server Development Architecture Selection"
"Tencent QQ's 140 million online users' technical challenges and architecture evolution PPT"
"WeChat background based on the time series of massive data cold and hot hierarchical architecture design practice"
"WeChat Technical Director Talks about Architecture: The Way of WeChat-Dao Zhi Jian (Full Speech)"
"How to Interpret "WeChat Technical Director Talking about Architecture: The Way of WeChat-The Road to the Simple""
"Rapid Fission: Witness the evolution of WeChat's powerful back-end architecture from 0 to 1 (1)"
"How to ensure the efficiency and real-time performance of large-scale group message push in mobile IM? 》
"Discussion on the Synchronization and Storage Scheme of Chat Messages in Modern IM System"
"Technical Challenges and Practice Summary Behind the 100 Billion Visits in WeChat Moments"
"Behind the glamorous bullet message: the chief architect of Netease Yunxin shares the technical practice of the billion-level IM platform"
"WeChat Technology Sharing: Practice of Generating Massive IM Chat Message Sequence Numbers in WeChat (Principles of Algorithms)"
"A set of high-availability, easy-scalable, and high-concurrency IM group chat and single chat architecture design practices"
"Social software red envelope technology decryption (1): Comprehensive decryption of QQ red envelope technical solutions-architecture, technical implementation, etc."
"From guerrilla to regular army (1): the evolution of the IM system architecture of Mafengwo Travel Network"
"From guerrilla to regular army (2): Mafengwo Travel Network's IM Client Architecture Evolution and Practice Summary"
"From Guerillas to Regular Army (3): Technical Practice of Distributed IM System of Mafengwo Travel Network Based on Go"
"The data architecture design of Guazi IM intelligent customer service system (organized from the on-site speech, with supporting PPT)"
"IM Development Basic Knowledge Supplementary Lesson (9): Want to develop an IM cluster? First understand what RPC is! 》
"Ali Technology Sharing: E-commerce IM messaging platform, technical practice in group chat and live broadcast scenarios"
"A set of IM architecture technical dry goods for hundreds of millions of users (Part 1): overall architecture, service split, etc."
"A set of IM architecture technical dry goods for hundreds of millions of users (Part 2): reliability, orderliness, weak network optimization, etc."
"From novice to expert: How to design a distributed IM system with billions of messages"
"The Secret of the IM Architecture Design of Enterprise WeChat: Message Model, Ten Thousands of People, Read Receipt, Message Withdrawal, etc."
"Rongyun Technology Sharing: Fully Revealing the Reliable Delivery Mechanism of 100 Million-level IM Messages"
"IM Development Technology Learning: Demystifying the System Design Behind the Information Push of WeChat Moments"
"Alibaba IM Technology Sharing (3): The Road to the Evolution of the Architecture of Xianyu's Billion-level IM Message System"

More similar articles...
This article has been simultaneously published on the official account of "Instant Messaging Technology Circle".
The synchronous publishing link is: http://www.52im.net/thread-3699-1-1.html

JackJiang
1.6k 声望808 粉丝

专注即时通讯(IM/推送)技术学习和研究。