头图

This article is quoted from the InfoQ community "How can 500 million users communicate efficiently? DingTalk reveals the secret of instant messaging service DTIM for the first time", author Chen Wanhong, etc., planning Chu Xingjuan, with revisions and changes.

I. Introduction

This article is the technical design practice of DingTalk, the de facto king of domestic enterprise IM, for the first time to decipher its instant messaging service (ie DingTalk IM, DTIM for short). The content of this article will show the various technical challenges and corresponding solutions encountered by DTIM in actual production applications, from model design principles to specific technical architecture, the lowest-level storage model to cross-regional unitization, etc. It is hoped that the sharing of the content of this article can bring thinking and inspiration to the development of domestic enterprise-level IM.
图片
study Exchange:

  • Introductory article on mobile IM development: "One entry is enough for beginners: developing mobile IM from scratch"
  • Open source IM framework source code: https://github.com/JackJiang2011/MobileIMSDK (click here for alternate address)
    (This article has been published simultaneously at: http://www.52im.net/thread-4012-1-1.html )

    2. A series of articles

    This article is the eighth in a series of articles, the general directory is as follows:
    "Alibaba IM Technology Sharing (1): The King of Enterprise-level IM - Dingding's Superiority in Back-end Architecture"
    "Alibaba IM Technology Sharing (2): Xianyu IM based on Flutter's mobile terminal cross-terminal transformation practice"
    "Alibaba IM Technology Sharing (3): Architecture Evolution of Xianyu Billion-level IM Messaging System"
    "Alibaba IM Technology Sharing (4): Reliable Delivery Optimization Practice of Xianyu Billion-level IM Message System"
    "Alibaba IM Technology Sharing (5): Timeliness Optimization Practice of Xianyu Billion-level IM Messaging System"
    "Alibaba IM Technology Sharing (6): Optimization of Offline Push Reach Rate of Xianyu Billion-level IM Messaging System"
    "Alibaba IM Technology Sharing (7): Optimization Practice of Xianyu IM's Online and Offline Chat Data Synchronization Mechanism"
    "Alibaba IM Technology Sharing (8): Deep Decryption of the Technical Design of DingTalk Instant Messaging Service DTIM" (* this article)

related articles:
Discussion on Synchronization and Storage Scheme of Chat Messages in Modern IM System DingTalk——Technical Challenges of New Generation Enterprise OA Platform Based on IM Technology (Video + PPT)
The Secret of IM Architecture Design of Enterprise WeChat: Message Model, Ten Thousand People, Read Receipt, Message Recall

3. Technical challenges of Dingding

图片
Dingding has been used by 21 million+ organizations and 500 million+ registered users. DTIM provides instant messaging services for DingTalk users for communication inside and outside organizations, including companies, governments, schools, etc., ranging in size from a few people to millions. DTIM has a wealth of functions, such as single chat, various types of group chat, message read, text expression, multi-terminal synchronization, dynamic card, exclusive security and storage, etc. At the same time: There are many business modules in DingTalk, such as documents, DingTalk, Teambition, audio and video, attendance, approval, etc. Each business uses DTIM for business process notification, operation message push, and business signaling delivery. Wait. Each business module has different traffic peak models for DTIM calls and different availability requirements. DTIM needs to be able to face these complex scenarios, maintain good usability and experience, and take into account performance and cost. A general instant messaging system has high requirements on the success rate, delay, and arrival rate of message sending. Due to the ToB feature, enterprise IM has the ultimate in data security and reliability, system availability, multi-terminal experience, and open customization. requirements. To build a stable and efficient enterprise IM service, the main challenges that DTIM faces are: 1) The extreme experience requirements of enterprise IM are challenges to the system architecture design: such as data storage efficiency brought by long-term data storage, roaming, multi-terminal data synchronization, dynamic messages, etc. and cost pressure, consistency problems caused by multi-terminal data synchronization, etc.; 2) The impact of extreme scenarios and availability problems caused by system errors: such as large group messages, high concurrency of online office and online teaching caused by sudden epidemics Traffic, the system needs to be able to cope with the impact of traffic and ensure high availability; at the same time, when the general availability of middleware is less than 99.99%, the DTIM service needs to ensure the availability of 99.995% of the core functions; 3) The expanding business scale, for system deployment Architectural challenges: For example, the number of users continues to grow, emergencies such as the global epidemic, and the single-region architecture can no longer meet the requirements of business development. In the system design of DTIM: 1) In order to achieve a balance between message sending and receiving experience, performance and cost, an efficient read-write diffusion model and synchronization service, as well as customized NoSQL storage, are designed; 2) Through the analysis of DTIM service traffic, for The scenarios of large groups of messages, a large number of message hotspots for a single account, and message update hotspots have been merged, peak-shaving and valley-filling, etc.; 3) The dependency of the application middleware of the core link is handled for disaster recovery, so that the failure of a single middleware does not affect the Core messaging to ensure basic user experience. In the message storage process, once the storage system writes abnormally, the system will rotate the buffer to redo, and when the service is restored, the data can be actively synchronized to the end. As the number of users continues to grow, a single region can no longer carry the capacity and disaster recovery requirements of DTIM. DTIM implements a cloud-native elastic architecture with multiple units in different locations. The principle of layering is to focus on the cloud and light the terminal: complex operations such as business data calculation, storage, and synchronization should be moved to the cloud for processing as far as possible, and the client only receives and displays the final data. By reducing the complexity of client business implementation To maximize the speed of client iteration, so that on-end development can focus on improving the user's interactive experience, all functional requirements and system architecture are designed and expanded around this principle. In the following chapters, we will introduce DTIM in more detail.

4. Model Design

4.1. DTIM system architecture Low latency, high reach, and high availability have always been the first principles of DTIM design. According to this principle, DTIM divides the system into three services to carry capabilities. The three services are: 1) Message service: responsible for the IM core message model and open API. The basic capabilities of IM include message sending, single-chat relationship maintenance, group meta information management, historical message pulling, read status notification, and IM data. Storage and cross-regional traffic forwarding; 2) Synchronization service: Responsible for the end-to-end synchronization of user message data and status data, and perform real-time data exchange through the client-to-server long connection channel. When DingTalk various devices are online, IM 3) Notification service: Responsible for user third-party channel maintenance and notification function, when DingTalk's self-built channel cannot synchronize data to the end DingTalk will push messages through the notification and transparent transmission capabilities provided by the three parties to ensure the timeliness and effectiveness of DingTalk messages. In addition to serving the message service, the synchronization service and notification service are also for other DingTalk services such as audio and video, live broadcast, Ding, documents and other multi-terminal (multi-device) data synchronization. Figure 1: DTIM Architecture Diagram▼
图片
The figure above shows the DTIM system architecture, followed by a detailed description of the messaging link. 4.2. Message sending and receiving link Figure 2: DTIM message processing architecture▼
图片
1) Message sending: The message sending interface is provided by the Receiver. The DingTalk unified access layer forwards the message request sent by the user from the client to the Receiver module, and the Receiver verifies the legitimacy of the message (security review such as text and pictures, and group mute functions). Whether to enable or trigger the session message current limit rule, etc.) and the validity of the membership (single chat to verify the two chats, group chat to verify that the sender is in the group chat member list), after the verification is passed, a message is generated for the message The globally unique MessageId is packaged with the message body and receiver list into message data packets and delivered to the asynchronous queue for processing by the downstream Processor. After the message is delivered successfully, the Receiver returns a receipt to the client that the message was sent successfully. 2) Message processing: When Processor consumes events sent to IM, it first distributes the events according to the region of the receiver (DTIM supports cross-domain deployment, Geography, Geo) for message event distribution, and stores the messages of users in this domain as local storage (message body, Receiver dimension, read status, personal conversation list red dot update), and finally package the message body and the recipient list in this domain into an IM synchronization event and forward it to the synchronization service through the asynchronous queue. 3) Message reception: The synchronization service writes to the respective synchronization queues according to the receiver dimension, and at the same time checks the online status of the current user equipment. When the user is online, it retrieves the unsynchronized messages in the queue and pushes them to each end through the long connection at the access layer. When the user is offline, the packaged message data and the offline user status list are IM notification events, which are forwarded to the PNS module of the notification service. 4.3. Design of storage model The quickest way to understand IM service is to master its storage model. Although mainstream IM services in the industry have different organizational relationships between messages, conversations, and conversations, they can be summed up in two forms: write-diffusion-read aggregation and read-diffusion-write aggregation. The so-called read-write diffusion actually defines the storage form of messages in group conversations. As shown below. Figure 3: Read Mode and Write Mode▼
图片
As shown in the figure above: 1) The scenario of read diffusion: the message belongs to the session, and the corresponding table in the storage is equivalent to a conversation_message, which stores all the messages generated by the session (cid->msgid->message, cid session ID, msgid Message ID, message message), the advantage of this implementation is that the message storage efficiency is high, and only the binding relationship between the session and the message can be stored; 2) The scenario of write diffusion: the message generated by the session is delivered to the recipient similar to personal mail The box, the message_inbox table, stores all personal messages (uid->msgid->message, uid user ID, msgid message ID, message message), based on this implementation, each message in the conversation can be presented to different recipients different states. DTIM has extremely strict requirements on the timeliness of IM messages and the consistency of front-end and back-end storage states, especially for historical message roaming. The current industry IM products are not good enough for long-term message storage and multi-terminal roaming of client historical messages. Unsatisfactory, mainly because the storage cost is too high. Therefore, a balance needs to be found between product experience and input cost. Adopt read diffusion: There are great constraints on personalized message expansion and realization level. The problem brought by the use of write diffusion is also obvious: a session with N group members will spread N message records once a message is generated. If the amount of message sending and diffusion is small, this implementation is more effective than read diffusion. It's simpler, and storage costs aren't an issue. However, the activity of DTIM sessions is very high, and the average diffusion ratio of a message can reach 1:30. Supergroup is the core communication scenario of enterprise IM. If the full write diffusion is used, the storage cost problem will inevitably restrict the development of DingTalk business. Therefore, in the storage implementation of DTIM, DingTalk adopts a mixed solution, adapting read diffusion and write diffusion for different scenarios, and finally the system will be unified and merged from the user's perspective, as shown in the following figure. Figure 4: DTIM Read/Write Mixed Mode▼
图片
It exists as a mixed form of read diffusion and write diffusion. The messages of the user dimension are obtained from the conversation_message and message_inbox tables respectively, and are sorted and merged on the application side according to the message sending time. The conversation_message table records the common messages received by the session for all group members. The message N (Normal), and the message_inbox table is written in the following two scenarios: 1) The first one is: the directed message P (Private, private message), the message sent in the group session specifies the receiver range, then the Write directly to the receiver's message_inbox table. For example, the red envelope receiving status message can only be seen by the red envelope sender, then this kind of message is defined as a directional message. 2) The second is: the state transition NP (Normal to Private) belonging to the session message. When the session message changes the message state of some message recipients through a certain behavior, the state will be changed. Write to the message_inbox table. For example, if the user deletes the message on the receiving side, the deletion status of the message will be written into the message_inbox. When the user pulls it, the message in the conversation_message will be overwritten by the deletion status of the message_inbox. Can't get my deleted messages. When the user initiates a historical message roaming request for a session on the client, the server pulls the message list from the conversation_message table and the message_inbox table according to the message deadline (message_create_time) uploaded by the user, merges the status at the application layer, and finally returns For the merged data of users, the three types of messages N, P, and NP achieve a good balance between message personalization processing and storage cost. 4.4. Design of Synchronization Model 4.4.1 Push Model How are messages sent by users in the session and events such as message status changes synchronized to the end? There are roughly three implementation schemes for the synchronization model of messages in the industry: 1) a client-side pull scheme; 2) a server-side push scheme; 3) a push-pull combination scheme in which the client pulls after the server pushes the site. The three schemes have their own advantages and disadvantages, which are briefly summarized here: 1) First, the advantages of the client-side pull scheme are that the scheme is simple to implement, low in R&D costs, and is a traditional B/S architecture. The disadvantage is low efficiency, and the control of the pull interval is on the client side. For real-time scenarios such as IM, it is difficult to set an effective pull interval. Too short an interval puts a lot of pressure on the server, and if the interval is too long, the timeliness is poor; 2) Secondly : The advantages of the server-side active push solution are that it has low latency and can be real-time. The most important initiative is on the server side. The disadvantage is relative to the pull scheme, and there is a problem in how to coordinate the processing capabilities of the server and the client; 3) Finally: the push-pull scheme integrates the advantages of pull and push, but the scheme is more complex, and at the same time, it will have one more RTT than the push scheme, especially It is in the scenario of a mobile network that we have to face the problems of power consumption and push success rate. Compared with traditional toC scenarios, DTIM has obvious differences: 1) The first is the requirement for real-time performance: in enterprise services, such as employee chat messages, various system alarms, and shared drawing boards in audio and video, all require Synchronization of real-time events requires a low-latency synchronization scheme. 2) The second is the ability of weak network access: among the objects of DTIM services, tens of millions of enterprise organizations involve all walks of life, from 5G high-speed in big cities to weak networks in remote mountainous areas, all require DTIM messages to be sent , can be reached. For a complex network environment, the server needs to be able to judge the access environment and adjust the strategy for synchronizing data according to different environmental conditions. 3) The third is controllable power consumption and controllable cost: In the enterprise scenario of DTIM, the frequency of message sending and receiving is an order of magnitude higher than that of traditional IM. How to ensure the controllable power consumption of DTIM in such a large message sending and receiving scene, especially The controllable power consumption of the mobile terminal is a problem that DTIM must face. Under this requirement, DTIM needs to reduce the number of IOs as much as possible, and merge and synchronize based on different message priorities, which can not only ensure that the real-time performance is not destroyed, but also achieve low power consumption. From the above three points, it can be seen that the model of active push by the server is more suitable for DTIM scenarios: 1) First, it can achieve extremely low latency, ensuring that the push time is at the millisecond level; 2) Secondly, the server can judge by user access information The user access environment is good or bad, and the corresponding subcontracting optimization is carried out to ensure the success rate under weak network links; 3) Finally, the active push can reduce one IO compared to the push-pull combination, which is over 100 million per minute for DTIM. For the message service, it can greatly reduce the power consumption of the device, and at the same time cooperate with the optimization of the message priority merge package to further reduce the power consumption of the end. Although active push has many advantages, the client will be offline, and even the processing speed of the client cannot keep up with the speed of the server, which will inevitably lead to the accumulation of messages. In order to coordinate the problem of inconsistent processing capabilities between the server and the client, DTIM supports the ability to rebase. When the number of messages accumulated on the server reaches a certain threshold, Rebase is triggered, and the client pulls the latest message from DTIM, while the server skips This part of the news starts to push news from the latest location. DTIM calls this synchronization model the Preferred-Push Model (PPM). Under the PPM-based push solution, DTIM also makes a series of optimizations to ensure the reliable delivery of messages. These optimizations are as follows: 1) Support message reentrancy: the server may push a certain message repeatedly, and the client needs to perform deduplication processing according to the msgId to avoid repeated display of messages on the client. 2) Support message sorting: The server pushes messages, especially in the scenario where the group is more active. A message fails to be pushed due to the push link or the end-side network jitter, while the new message is normally pushed to the end-side. If it is sorted, the message list will be out of order, so the server will assign a timestamp to each message. Every time the client enters the message list, it is sorted according to the timestamp and then delivered to the UI layer for message display. 3) Support for missing data filling: In an extreme case, the client group message event arrives on the terminal earlier than the group creation event. At this time, the basic information message of the group cannot be displayed on the terminal, so the client needs to actively report to the terminal. The server pulls the group information and synchronizes it to the local, and then does the message disclosure. 4.4.2 Multi-terminal data consistency The multi-terminal data consistency problem has always been the core problem of multi-terminal synchronization. A single user can log in on PC, Pad and Mobile at the same time. The status of messages and session red dots needs to be consistent on multiple terminals, and users need to change devices. In this case, the message can be backtracked in full. Based on the above business requirements and many challenges at the system level, DingTalk developed a synchronization service to solve the consistency problem. The design concepts and principles of DingTalk's synchronization service are as follows: 1) Abstraction of unified message model, new messages generated by DTIM service and events such as read events, session additions, deletions and changes, and multi-terminal red dot clearing are unified and abstracted into events of synchronization services; 2 ) The synchronization service is unaware of the type of event and how the data is serialized. The synchronization service assigns a self-incrementing ID to each user's event (note: non-continuous incrementing here), to ensure that messages can be traversed in an orderly query based on the ID; 3) Unified synchronization queue, the synchronization service assigns an ID to each user FIFO queue storage, self-incrementing id and events are serially written to the queue; when there is an event push, the server makes incremental changes according to the user's current online device status at each end, and pushes incremental events to each end; 4) According to the device Different from the network quality, a variety of sub-package push strategies can be implemented to ensure that messages can be sent to clients in an orderly, reliable and efficient manner. The above introduces the design and thinking of the storage model and synchronization model of DTIM: 1) In the storage optimization, the storage will be deeply optimized based on the characteristics of DTIM messages, and the principles and implementation details will be deeply analyzed and introduced; 2) In synchronization In the mechanism, it will further introduce how the multi-terminal synchronization mechanism ensures that the message must be reached and the message consistency of each terminal is guaranteed.

Five, message storage design

The bottom layer of DTIM uses table storage as the core storage system of the message system. Table storage is a typical LSM storage architecture, and read-write amplification is a typical problem of such systems. The LSM system uses Major Compaction to reduce the level of data and reduce the impact of read data amplification. In the DTIM scenario, the real-time message writing exceeds one million TPS, and the system needs to allocate a lot of computing resources for Compaction processing, but the online message reading delay burr is still serious. In the performance analysis of storage, we found the following characteristics: 1) 6% of users contributed about 50% of the message volume, and 20% of the users contributed 74% of the message volume. High-frequency users generate far more messages than low-frequency users. In Flush MemTable, high-frequency user messages occupy most of the file; 2) For high-frequency users, due to their "high frequency", when messages enter DTIM , the system finds that the user equipment is online (high probability online), and will push messages immediately, so most of the messages that need to be pushed are in the MemTable in memory; 3) High-frequency users generate a lot of messages, and Compaction consumes a lot of computing and IO resources of the system ; 4) Low-frequency user messages are usually scattered in multiple files. From the above four performances, we can draw the following conclusions: 1) The vast majority of Compactions are invalid calculations and IOs. A large number of files are generated due to the writing of a large number of messages, but high-frequency user messages have actually been pushed down to On the user's device, Compaction greatly reduces the read acceleration effect. On the contrary, it will preempt computing resources and cause IO jitter; 2) Due to the low frequency of incoming messages, low-frequency users account for a low proportion of files after Flush; at the same time, the user's online frequency is low, and a large number of messages to be received will be accumulated during the period, then When a user goes online, there is a high probability that continuous historical messages are scattered in multiple files, resulting in glitches in traversing and reading messages, and in severe cases, read timeouts may occur. To solve such problems, we adopt a divide-and-conquer approach, which treats the messages of high-frequency users and low-frequency users differently. We draw on the idea of WiscKey KV separation technology, which is to separate the Value that reaches a certain threshold, reduce the proportion of such messages in the file, and effectively reduce the problem of write amplification. Attachment: WiscKey KV original paper attachment download: WiscKey: Separating Keys from Values in SSD-conscious Storage(52im.net).pdf (2.24 MB) But WiscKey KV separation only considers the case of a single key. The size gap between them is not big, and directly adopting this KV separation technology cannot solve the above problems. Therefore, based on the original KV separation, we improved the KV separation, and aggregated and judged multiple Keys with the same prefix. If the Value of multiple Keys exceeds the threshold, then the Value of these Keys will be packaged into a value-block to separate out and write to the value file. The data shows that: the above method can put 70% of the messages in the MemTable into the value file in the Minor Compaction stage, which greatly reduces the key file size and brings lower write amplification; Zoom lower. Frequent users send more messages, so their data lines are more easily separated into value files. When reading, high-frequency users generally read all recent messages. Considering that the DTIM message ID is incrementally generated, the keys for message reading are continuous, and the data within the same value-block will be read sequentially. Based on this, Reading value-blocks in advance through IO prefetching technology can further improve performance. For low-frequency users, they send fewer messages, and KV separation does not take effect, thereby reducing the additional IO brought by the value file when reading. Figure 5: KV separation and pre-read ▼
图片

6. Multi-terminal synchronization mechanism design

6.1. Overview The biggest difference between DTIM and general user-oriented products in data synchronization from server to client is the huge amount of messages, complex change events, and strong demands for multi-terminal synchronization. DTIM builds a complete synchronization process based on synchronization services. The synchronization service is a server-to-client data synchronization service. It is a unified data downlink platform that supports multiple DingTalk application services. Figure 6: Microservice Architecture▼
图片
The synchronization service is a set of multi-terminal data synchronization services, which consists of two parts: 1) the synchronization service deployed on the server; 2) the synchronization service SDK integrated by the client APP. The working principle is similar to the MQ message queue: 1) The user ID is similar to the topic in the message queue; 2) the user device is similar to the Consumer Group in the message queue. As a message queue consumer, each user device can obtain a copy of the user's data on demand, thereby realizing multi-terminal synchronization requirements. When the business needs to synchronize a change data to the specified user or device: the business calls the data synchronization interface, the server will persist the data that the business needs to synchronize to the storage system, and then push the data to the client when the client is online . When each piece of data is stored in the database, it will atomically generate a site that monotonically increases according to the user dimension, and the server will push each piece of data to the client in the order of sites from small to large. After the client responds successfully, the site with the largest update push data is stored in the site management storage, and the next push starts from this new site. The synchronous service SDK is responsible for receiving the synchronous service data, writing it to the local database after receiving, and then distributing the data asynchronously to the client business module. After the business module processes successfully, the corresponding content of the local storage is deleted. In the above chapters, the synchronization service push model and the consideration of multi-end consistency have been initially introduced. This chapter mainly introduces how DTIM does storage design, how to achieve data consistency in multi-end synchronization, and finally introduces the technical solution of server-side message data accumulation Rebase. * Recommended reading: If you have no concept of message synchronization mechanism, you can read "On the Principles of Multi-point Login and Message Roaming of Mobile IM" first. 6.2. Full message storage logic In the synchronization service, the user is the center, all the messages to be pushed to the user are aggregated together, and each message is assigned a unique and increasing PTS (ie point, the English term Point To Sequence). ), the server saves the site pushed by each device. Through the two users Bob and Alice, the logical form of the message stored in the storage system is actually displayed. Example: Bob sends Alice a message "Hi! Alice", Alice replies to Bob with the message "Hi! Bob". When Bob sends the first message to Alice, the receivers are Bob and Alice respectively. The system will add a message to the end of the storage area of Bob and Alice respectively. When the storage system is successfully stored, it will assign a message to the two lines respectively. Unique and incremental site (Bob's site is 10005, Alice's site is 23001); after the storage is successful, the push is triggered. For example, the last push position of Bob's PC terminal is 10000, and the push position of Alice's mobile terminal is 23000. After the push process is initiated, there will be two push tasks. The first is Bob's push task. The push task starts from the last push task. The location (10000) + 1 starts to query the data, the "Hi" message at the location 10005 will be obtained, and the message will be pushed to Bob's device. After the push is successful, the push location (10005) will be stored. The same is true for Alice's push process. After Alice receives Bob's message, Alice replies to Bob, similar to the above process, the storage is successful and the site is allocated (Bob's site is 10009, Alice's site is 23003). Figure 7: Storage Design of Sync Service▼
图片
6.3. Message Multi-terminal Synchronization Logic Multi-terminal synchronization is a typical feature of DTIM. How to maintain multi-terminal data access in time and resolve consistency is the biggest challenge of DTIM synchronization service. The event storage model of the synchronization service has been introduced above, and the messages that need to be pushed are aggregated by users. When the user has multiple devices, the device's location is stored in the location management system. Key is the user + device ID, and Value is the last pushed location. If it is the first time the device is logged in, the bit defaults to 0. It can be seen from this that each device has a separate site, and the data on the server side has only one data according to the user dimension. The messages pushed to the client side are snapshots of the corresponding site on the server side, thus ensuring that the data on each side is is consistent. For example, when Bob logs in to the mobile phone (the device has logged in to DingTalk before), the synchronization service will obtain the device login event. In the event, there is the location where the device received data last time (such as 10000), and the synchronization service will start from 10000. + 1 (site) starts to query data, obtains five messages (10005~10017), pushes the messages to this phone and updates the server site. At this point, the messages on Bob's mobile phone and PC are consistent. When Alice sends a message again, the synchronization service will push messages to Bob's two devices, and always maintain the consistency of message data between Bob's two devices. 6.4. The optimization logic of a large number of offline messages that need to be synchronized is as described above: we use a push-first model to push down data to ensure the real-time nature of events, and use site management to achieve multi-terminal synchronization, but the actual situation is far better than the above situation. complex. The most common problem is that the device is offline and re-login, during which the user may accumulate a large amount of unreceived message data, such as tens of thousands. If according to the above scheme, the server will push a large number of messages to the client in a short period of time, and the client's CPU resources are very likely to be exhausted, causing the entire device to feign death. In fact, for the IM scenario: data from a few days or even a few hours ago, and then pushing it to the user has lost the meaning of instant messaging, but will consume the power of the customer's mobile device, which is not worth the loss. Or there will be a lot of news for various activities in the holiday group. For the above situation: The synchronization service provides a Rebase solution. When the messages to be pushed reach a certain threshold, the synchronization service will send a Rebase event to the client. After the client receives the event, it will obtain the latest message from the message service ( Lastmsg). In this way, a large number of messages can be skipped in the middle. When users need to view historical messages, they can backtrack up based on Lastmsg, which saves power and improves user experience. Take Bob as an example: Bob logs in to the Pad device (a brand new device), the synchronization service receives the Pad login event, and finds that the login site is 0. The query starts from 0 to the present, and 10,000 messages have been accumulated. If the cumulative amount is greater than the threshold of the synchronization service, the synchronization service sends the Rebase event to the client, and the client obtains the latest message "Tks!!!" from the message service, and the latest position obtained by the client from the synchronization service is 10017 , and tell the sync service to start pushing from the 10017 position. When Bob enters the conversation with Alice, the client only needs to backtrack a few historical messages from Lastmsg to fill the chat box.

7. High Availability Design

7.1. Overview DTIM provides a 99.995% availability SLA. Millions of organizations use DingTalk as their digital office infrastructure. Due to its wide coverage, a little jitter in DTIM will affect a large number of enterprises, institutions, schools and other organizations. This may lead to social events. Therefore, DTIM faces great stability challenges. High availability is the core capability of DTIM. For high availability, it needs to be viewed from two dimensions. The first is self-protection of services. When encountering traffic floods, hacker attacks, and business exceptions, there must be traffic control and fault tolerance capabilities to ensure that services are still basic in extreme traffic scenarios. ability to serve. The second is the service expansion capability, such as the expansion of common computing resources, the expansion of storage resources, etc. As resources grow and shrink with traffic, provide high-quality service capabilities and achieve a better balance with costs. 7.2. Self-protection 7.2.1 Background DTIM is often faced with various sudden and large traffic, such as red envelope battles with tens of thousands of people, morning rush hour punch-in reminders, Spring Festival New Year's Eve red envelopes, etc., which will generate a large number of chat messages in a short period of time, bringing the system to the system. There is a big impact, and we have adopted a variety of measures based on this. The first is flow control. Current limiting is the easiest and most effective way to protect the system. The DTIM service protects itself and the downstream through current throttling in various dimensions, and the most important thing is to protect the downstream storage. Storage in a distributed system is all sharded, and the most likely issue is the hot issue of a single shard. There are two dimensions of data in DTIM: users, sessions (messages belong to sessions), and shards are also these two dimensions. Therefore, the current limit adopts the current limit of session and user dimensions, which can not only protect a single partition of downstream storage, but also limit the overall traffic to a certain extent. To control the overall traffic of the system, the first two dimensions are not enough. DTIM also uses two dimensions of current limiting: client type and application (server-side IM upstream business) to avoid the impact of the overall traffic increase on the system. Followed by peaks and valleys. Current limiting is simple and effective, but has a greater impact on users. There are many messages in the DTIM service that do not require high real-time performance, such as operational push. For the messages in these scenarios, the asynchronous nature of the message system can be fully utilized, and the asynchronous message queue can be used to cut peaks and flatten the valleys, which reduces the impact on users on the one hand, and reduces the instantaneous pressure on downstream systems on the other hand. DTIM can classify messages according to various dimensions such as business types (such as operational push), message types (such as like messages), and guarantee that low-priority messages will be processed within a certain period of time (such as 1 hour). The last is hotspot optimization. DTIM services are faced with various hotspots, and for these hotspots, current limiting alone is not enough. For example, the DingTalk secretary pushes upgrade reminders to a large number of users. Since one account establishes a session with a large number of accounts, there will be a hot issue of conversation_inbox. If the speed limit is used to solve the problem, the push speed will be too slow and business will be affected. This kind of problem needs to be solved architecturally. In general, there are mainly two types of problems: hot issues caused by large accounts and large groups. For the large account problem, because conversation_inbox uses the user dimension as a partition, the requests of the system account will all fall into a certain partition, resulting in hot spots. The solution is to do hotspot splitting, which not only merges conversation_inbox data into conversation_member (partitioning according to session), and converts user-dimension operations into session-dimension operations, so that system account requests can be scattered to all partitions, Thereby eliminating hot spots. For large group problems, the pressure comes from a large number of messages sent, messages read, and emoji interactions, and a large number of recipients bring a great amount of diffusion. So we divide and conquer for the above three scenarios. 7.2.2 Calculation delay and on-demand pull For message sending, general messages are the same for everyone in the group, so the method of read diffusion can be used, that is, no matter how large the group is, a message is stored in one copy. On the other hand, since everyone has the number of red dots and Lastmsg in each conversation, in general, the red dots and Lastmsg need to be updated every time a message is sent, but in a large group scenario, there will be a lot of proliferation, which will bring huge consequences to the system. pressure. Our solution is that for large groups of red dots and Lastmsg, they will not be updated when sending messages, but will be calculated in real time when pulling the fold. Since pulling the fold is a low-frequency operation and each person only has one or two large groups, the pressure of real-time calculation is very high. In this way, 99.99% of storage operations can be reduced during the peak period, thus solving the impact of large group messages on DTIM. 7.2.3 Request merging In the scenario of large group messages, if users are all online, there will be a large number of read requests instantaneously. If each read request is processed, M*N (M messages, N groups) will be generated. membership), which is terrifying. The solution of DTIM is that the client merges multiple reads in a session and sends it to the server at one time, and the server merges the read requests of each message. For example, all requests in 1 minute are merged into 1 time. ask. In a large group, when a message is liked, a large number of updates will be generated in a short period of time. In addition, it needs to be spread to everyone in the group. The overall spread of the system is huge. We found that for scenarios where messages are updated multiple times, multiple updates can be merged over a period of time, which can greatly reduce the amount of diffusion. From the actual optimization data, the amount of diffusion of the system during peak periods is reduced by 96% year-on-year. Even if the above points are fully achieved, it is difficult to provide the currently promised SLA. In addition to preventing problems with its own services, it is also necessary to implement disaster tolerance for dependent components. We have adopted the scheme of combining redundant heterogeneous storage and asynchronous queues with RPC as a whole. When any type of DTIM-dependent product has problems, DTIM can work normally. Due to space problems, it will not be expanded here. 7.3. Horizontal expansion capability The elastic expansion capability of a service also needs to be viewed in two dimensions: 1) First: the elastic expansion within the service, such as the expansion of computing resources, the expansion of storage resources, etc., is what we usually build elastic expansion capabilities The focus of attention; 2) Second: the expansion of cross-regional dimensions, the service can expand a service cluster in other regions according to its own needs, the new service cluster undertakes part of the traffic, and forms a logically unified distributed service at the cross-regional level. We call this distributed service unitization. 7.3.1 The scalability of the elastic application architecture for DTIM, because it is built and grown on the cloud, it has more cloud features and options in the construction of elastic scalability. For computing nodes: the application has the ability to scale horizontally, and the system can sense the sudden increase in traffic in a short period of time and perform rapid expansion. It can easily cope with the traffic increase caused by the various activities mentioned above. At the same time, the system supports regular expansion and contraction, achieving a better balance between system flexibility and cost. For storage: The bottom layer of DTIM selects a serverless storage service that can be scaled horizontally. The internal storage service performs dynamic scheduling based on the size of read and write traffic, and the upper layer of the application is completely unaware. For the scalability of the service itself, we have also implemented the design of immutable infrastructure, application stateless, single point removal, loose coupling, load balancing, etc., so that DTIM has built a set of efficient elastic application architecture. 7.3.2 Elastic region-level expansion (unitization) After efficient elasticity is achieved within the application, along with the growth of business traffic, a single region can no longer meet the elastic expansion requirements of DTIM's 100 million-level DUA. Due to the characteristics of DTIM, all users can chat after adding friends, which means that it is not possible to simply change the region to build an island-style DTIM. In order to solve the elastic capability at this scale, based on the regional architecture on the cloud, within a Geo region, we have built a set of elastic architectures with multiple activities in different places and logically integrated, which we call unitization. The following figure is the unitized architecture of DTIM. Figure 8: DTIM unitized architecture▼:
图片
For the unitized elastic expansion architecture, the core content is the dynamic scheduling of traffic, the self-enclosure of data in a single region, and the overall degradation of the unit. 7.3.3 Flexible and dynamic scheduling of traffic routing determines the direction of data flow. We can rely on this capability to schedule large groups of traffic to new units to undertake rapidly growing business traffic, and at the same time realize traffic aggregation according to the enterprise dimension, provide the ability to deploy nearby, and then provide Excellent RT service. The current mainstream unit scheduling solution in the industry is mainly based on the static route segment based on the user dimension. The algorithm of this solution is simple and reliable, but it is difficult to implement dynamic route scheduling. Once the user route is fixed, the service unit cannot be adjusted. For example, in the DTIM scenario, when the scale of enterprises (users) grows with time and the scale of user services increases, when a single region cannot effectively support multiple large-scale enterprise users, it is difficult for the traditional static solution to expand the enterprise elastically to other units. Forced migration will incur a very high operational cost. Therefore, the traditional routing scheme does not have flexible scheduling capabilities. DTIM provides a globally consistent high-availability routing service system (RoutingService). The service stores the unit where the user session is located, and the message service routes traffic to different units based on the routing service. After the application updates the routing data, the routing information also changes. At the same time, the routing service initiates a data correction event to migrate the historical message data of the session. After the migration is completed, the routing is officially switched. The bottom layer of the routing service relies on the stored GlobalTable capability. After the routing information is updated, the consistency across regions is guaranteed. 7.3.4 Elastic unit self-enclosure The unit self-enclosure of data is to enclose the most important and largest data in DTIM: the process of receiving, processing, persisting, and pushing “message data” in the current unit, releasing the dependence on other units. , which can effectively expand the unit and achieve high-efficiency elasticity at the cross-regional level. To achieve the self-enclosure of business data within the unit, the most important thing is to identify the elastic expansion capability of the data to be solved. In the DTIM scenario, user profiles, session data, and message data are the core assets of DTIM. The scale of message data far exceeds other data, and the elastic expansion capability is also being built around the processing of message data. How to reasonably divide messages into key dimensions of unit self-closure according to unit data. In the IM scenario, the message data comes from the chat between people, and the data can be divided according to the person, but if the two people chatting are between different units, the message data must be copied or redundant in the two units. , so dividing data by person is not a good dimension. DTIM adopts the session dimension division: because people and sessions are metadata, the data scale is limited, the message data is almost unlimited, messages belong to sessions, there is no intersection between sessions, and there is no cross-unit call during message processing. Therefore, splitting different sessions into different units ensures that message data is processed and persisted in only one unit, and no cross-unit request calls are generated, thereby realizing unit self-closure. 7.3.5 Degradation of Elastic Units In the unitized architecture, in order to support the horizontal expansion capability of the service level, multiple units are the basic form. The abnormal flow of the unit or the influence of the maintenance of the service version will enlarge the influence area, thereby affecting the overall service of DTIM. Therefore: DTIM focuses on the ability to downgrade the unit. After a single unit loses its service capability, DTIM will switch the business traffic to the new unit, and new messages will be pushed down from the normal unit, and the DingTalk client will not be able to render the data. Affected by the faulty unit, the user is not aware of the unit failover.

8. Summary of this paper

Through the dimensions of model design, storage optimization, synchronization mechanism and high availability, this paper comprehensively demonstrates the core of contemporary enterprise-level IM design. This article is also a technical summary of DTIM in the past period. With the continuous growth of the number of users, DTIM is also advancing with the times, iterating and optimizing continuously. For example, it supports conditional indexing to achieve index acceleration and cost control, and realize message location. Continuous accumulation, realize on-demand message pulling and efficient integrity verification, provide a variety of upstream and downstream channels, and further improve the success rate and experience under weak networks.

9. Related Articles

[1] Enterprise-level IM King - Dingding's excellence in back-end architecture
[2] Discussion on Synchronization and Storage Scheme of Chat Messages in Modern IM System
[3] Dingding——Technical Challenges of the New Generation of Enterprise OA Platform Based on IM Technology (Video + PPT)”
[5] A set of IM architecture technology dry goods for hundreds of millions of users (Part 1): overall architecture, service splitting, etc.
[6] A set of IM architecture technology dry goods for hundreds of millions of users (Part II): reliability, orderliness, weak network optimization, etc.
[7] From novice to expert: how to design a distributed IM system with hundreds of millions of messages
[8] The secret of IM architecture design of enterprise WeChat: message model, 10,000 people, read receipt, message withdrawal, etc.
[9] Comprehensively reveal the reliable delivery mechanism of billion-level IM messages
[10] A set of high-availability, easy-to-scale, high-concurrency IM group chat, single chat architecture scheme design practice
[11] A set of practice sharing of mobile IM architecture design for massive online users (including detailed pictures and texts)
[12] A set of original distributed instant messaging (IM) system theoretical architecture scheme (this article has been published simultaneously at: http://www.52im.net/thread-4012-1-1.html )


JackJiang
1.6k 声望808 粉丝

专注即时通讯(IM/推送)技术学习和研究。