This article was shared by the author jhon_11, with a lot of revisions and changes.
1. Introduction How to design a high-performance, high-concurrency, and high-availability im integrated messaging platform is a problem that many companies will encounter and must solve during the development process. For example, the internal communication system of a company and the customer service consulting system of various Internet platforms are all inseparable from an easy-to-use and maintainable im comprehensive messaging system.
So, how should we design an im system with three high characteristics, and can support the access of various business lines at the same time (such as: internal OA communication, customer service consultation, message push, etc.)?
Let me introduce the architectural design process of the company's IM integrated messaging system that I am responsible for, as well as some ideas and summaries in the architectural design process, hoping to bring inspiration to you.
study Exchange:
- Introductory article on mobile IM development: "One entry is enough for beginners: developing mobile IM from scratch"
- Open source IM framework source code: https://github.com/JackJiang2011/MobileIMSDK (click here for alternate address)
(This article has been published simultaneously at: http://www.52im.net/thread-3954-1-1.html )
2. The first version of IM architecture
2.1 Overview
The original intention of the first version of im was that the company needed an im message middleware to support customer service consulting business.
However, considering that in order to facilitate the access of other business lines to the message communication platform in the future, the capability requirements of the entire message center were given to the middleware team for development from the beginning, so that all business lines except customer service could access the integrated message center , so as to realize the real-time access capability of multiple messages.
2.2 Introduction to the architecture of the first version The architecture diagram of the first version is shown in the following figure:
For the above architecture diagram, we explain the function of each module one by one.
1) Storage side:
In the first version of the architecture, we use tidb and redis as the main storage on the storage side:
[1] redis is used to store read and unread messages, cache connection information and other functions;
[2] As an open source distributed database, tidb is chosen to facilitate the storage of messages.
2) mq message bus:
We use rocketmq to implement the message bus (PS: that is, in the case of distributed, message interaction between different im instances through MQ).
The message bus is the core of the entire im, and rocketmq can support 100,000-level tps. Basically all services consume messages from the message bus for business processing.
3) Zookeeper registration center: each service will be registered in zk to facilitate internal calls between services, and it can also expose services to external calls.
4) link service:
The link service is mainly used to receive the connection of the client's ws (WebSocket protocol), tcp, udp and other protocols.
At the same time, the user service is called for authentication, and the message of successful connection is delivered to the location service for consumption, and the connection information is stored.
The message from ws (WebSocket protocol) first arrives at the link and then is delivered to the message bus.
5) Message distribution service:
The message distribution service is mainly used to receive the messages pushed by the message bus for processing. After constructing the message body according to the internal message protocol of im, push it to the message bus (for example, it will be pushed to the session service, message box, link service).
6) Location Services:
Store the link's (WebSocket protocol) connection, tcp connection and other information, and use redis to cache (the key is userId), so that it is convenient to query which link the logged-in client of the user is connected to according to the UserId.
A user can only log in to one device on the same device, but multiple logins can be supported.
7) User service: used to store all users and provide authentication query interface.
8) Message box: store all messages, provide message query, message read and unread, message unread, message retrieval and other functions.
9) Conversation service: management conversation, group chat conversation, single chat conversation and other functions.
2.3 Overall Timing Diagram The timing diagram of the overall architecture is as follows:
3. Problems existing in the first version of IM architecture and thinking
In the introduction to the architecture design in the previous section, we shared the design ideas and specific processes of the first version of the IM system architecture in detail.
So what kind of problems still exist in the first version of IM architecture design, and how to optimize it? Let's take a look one by one.
3.1 The problem of using the MQ message bus As shared in the previous section, in our first version of the IM architecture, the message from the link service to the message distribution service uses the MQ message bus.
In the first version of the architecture design, when the link service pushes down messages to the message distribution service for processing, it uses the mq message bus (in layman's terms, the communication between different IM instances in the IM cluster relies on MQ for message delivery), and The mq message bus must have a certain delay in doing it right (and the delay is subject to the system implementation and technical strategy of MQ itself).
for example:
When two clients A and B in different IM instances chat, user A sends a message to link --> message bus --> message distribution service --> message bus --> link --> user B.
As in the above example, the im message delivery process is too long, and this will greatly reduce the throughput of the system.
3.2 The problem of message storage is the problem of write diffusion. In the implementation stage, we use the same write diffusion strategy as WeChat (see "Revelation of the IM Architecture Design of Enterprise WeChat: Message Model, Ten Thousand Crowds, Read Receipts, Message Withdrawal, etc.") .
So why is it not a defect that WeChat uses write diffusion, but it is indeed a defect for our IM architecture?
Technical features of WeChat:
1) WeChat says that no chat records of users are stored, all of which are pushed in real time;
2) All WeChat chat records will be stored on our mobile phone, and the chat records on the two mobile terminals are not interoperable and invisible to each other.
Our IM integrated message center technical features:
1) The integrated message center will have the function of pulling historical chat records (pulling from the server), storing the full amount of messages;
2) The client of the integrated message center needs to support the web version.
In summary:
1) Write Diffusion is very friendly to WeChat, a rich client version of instant messaging products with mobile terminals, each message is pushed to the client of all users under this session (single chat, group chat) when the message is distributed. , if no connection is found, write an offline cache message for this user, then the next time the user logs in, the message can be pulled from the cache and the cache is cleared;
2) Write diffusion is not friendly to general integrated messaging platforms like ours. Most of the access parties are web-based clients, so there is no ability to cache messages, and there will be no messages when the browser refreshes, so it needs to be served in real time. The terminal pulls historical messages. Suppose I am writing diffusion, and there are 500 users in a group chat. For these 500 users in this session, I need to write 500 messages, which greatly increases the writing io, and still cannot write to the cache (get write database).
3.3 tidb has problems of instability and transaction concurrency
tidb is the current mainstream open source distributed database with high query efficiency and no need for sub-database and sub-table.
But again, tidb has some hidden problems:
1) In the case of high concurrency in tidb, concurrent transactions will cause transaction failures, and the specific reasons are unknown;
2) The cost of tidb troubleshooting is high, and the company rarely has tidb professional operation and maintenance, and often encounters the situation that the index is not used.
3.4 The problem of group chat and single chat redundancy in the same service In our first version of the IM architecture design, single chat and group chat are redundant in the conversation service, and redundant in the same table.
In fact, from a data point of view, single chat and group chat are still somewhat different (such as business attributes). Although they are both sessions, we still need to separate these two services. Fine-grained service splits can better control overall logic.
4. Upgraded IM architecture
4.1 Initial Architecture Issues As shared in the previous two sections, gradually we found that the initial im architecture has great deficiencies.
The following problems were exposed in production:
1) The tps did not meet expectations, and the throughput could not meet the development of the company's business;
2) The storage middleware used is difficult to maintain (mainly tidb), the cost of trial and error is high, problems are often exposed in production, and the speed is getting slower and slower;
3) Message write diffusion is not necessary, and greatly increases the number of system IOs (see the previous section for reasons);
4) Some features cannot be supported, such as message image and text retrieval, message has been read but not read.
4.2 Introduction to the upgraded im architecture The upgraded im architecture is shown in the following figure:
As shown in the figure above, the modified modules are as follows:
1) Storage side: We switched to mysql for the storage side, and used the master-slave mysql cluster separately for the message service (the master node is used for writing messages, and the slave nodes are used for message retrieval)——;
2) mq message bus: no change compared with the first version;
3) Link service: Compared with the first version, the message push method from link service to message distribution service has been changed (from MQ bus mode to tcp real-time push);
4) Message distribution service: It integrates message processing capabilities and routing capabilities, and each message distribution service has tcp connections for all link services;
5) Single chat service: responsible for the management capabilities of single chat sessions;
6) Group chat service: responsible for the management capabilities of group chat sessions;
7) User service: provide user authentication, login/registration capabilities.
5. Detailed comparison of the changes to the first version of the IM architecture
The upgraded version of the IM architecture, compared to the initial, mainly includes the following changes.
5.1 The message distribution method between different im instances has been improved. In response to the problem of the initial version of MQ message summary, in the upgraded version of the architecture, we changed the link to the message distribution service to a tcp real-time connection. Millions of clients connect to the same link machine, and the message is triggered in real time. The ability to reach tps reached 160,000.
The revision of the link to the message distribution service is one of the highlights of this design. It completely eliminates the delay of mq push, and the routing is simple and almost real-time touch.
For example: (when two clients A and B in different IM instances chat)
1) The first version of the architecture is: A user sends a message to link --> message bus --> message distribution service --> message bus --> link --> B user;
2) The upgraded version architecture is: user A --> link --> message distribution --> link --> user B.
Moreover: the message push from the link service to the message distribution service cluster uses the round-robin load balancing method to ensure fairness and will not cause excessive load on individual machines.
5.2 Cancellation of location service The location service is cancelled (the location here does not refer to the geographic location message in the IM message), and the ability of the message distribution service to integrate the location service.
The message distribution service itself has a simple business, and there is no need to divide the location service separately, because the network io will be increased, and the message distribution service is directly connected to the link, and it is more convenient to let it be responsible for routing.
5.3 Storage changed from tidb to mysql
The storage side has been changed from tidb to mysql, which enhances maintainability. The message service uses the mysql master-slave read-write separation method, which improves the speed of message storage and retrieval, and reduces the pressure on the database.
As mentioned earlier, it is very painful to use a distributed database with high maintenance cost and difficulty in troubleshooting.
And we use mysql more stable, everyone's learning cost of mysql is relatively low. Using read-write separation for message services can greatly improve message throughput.
5.4 Realize the features and functions that the first version could not achieve In the upgraded version of the architecture, we realized the features and functions that the first version could not achieve, such as message read and unread, red envelope push, product link push and other functions.
The new version of the integrated message center has added functions such as message read and unread, sending red envelopes, link push, etc., but these functions have certain business characteristics. After all, not all Im need them, and these functions can be canceled through configuration.
5.5 Message from Write Diffusion to Read Diffusion In the upgraded IM architecture, message storage is changed from write diffusion to read diffusion.
We mentioned the pros and cons of write diffusion and read diffusion earlier. For web-side IM, we are more suitable to use read diffusion. Only one message needs to be dropped, which greatly improves the throughput of message services.
5.6 Added Facade Service In the upgraded version of the IM architecture, we added the facade service im-logic to expose to third-party line-of-business interface calls.
In the initial version of the architecture, each service of im exposes the interface to the outside for invocation, and in the upgraded version, we use the logic service to expose to the external invocation.
In the logic service, some processing can be done for the call, which will not affect the generality of the overall im, and will not increase the complexity of the underlying code of im, thereby decoupling the business logic from the bottom layer.
6. Comparison of optimized effects
For the upgraded version and the initial version of the IM architecture, we have also done some comparative tests, and the specific test process is detailed.
Here are the test results:
7. Thinking about the business division of business line access to im integrated messaging system
7.1 How to design a high-performance general-purpose im integrated messaging system. Regarding the business division of business lines accessing im integrated messaging system, I have also made some summaries and thoughts. In order to be more vivid and easy to understand, I will use the customer service system and enterprise WeChat as the example to analyze.
If I develop a general im comprehensive messaging system, and now there are many business parties that need to access us, it is particularly important how we can clearly divide the business domain, and we need to balance between compromise and non-compromise.
Just like the current open source IM messaging platform on the market, the main problems are: either it integrates a lot of business logic, or it is just a simple customer service system, or it is an IM friend chat system. The division of business is not clear. Of course, this is also beneficial. It can be used immediately without the need for secondary business encapsulation.
So, how to design im as a real high-performance general-purpose im comprehensive messaging system?
A general integrated messaging platform only needs to have general underlying capabilities:
The following case assumes that I have designed a version of the im integrated message center according to the above architecture.
7.2 Take the customer service system as an example The customer service system:
The customer service system not only needs to implement its own business, but also needs to integrate the messaging capabilities of im (consume im messages) to perform scenario analysis and implement logic such as session change and signaling message push.
The customer service system needs to carry out corresponding business encapsulation and customer service user pool of the customer service system according to the underlying support capabilities of im, and how to initialize the C-side user pool to the user center of im need to be considered.
7.3 Internal OA communication as an example Internal OA communication:
The employee's internal OA communication system needs to integrate the IM friend function, and needs to encapsulate the organizational structure, user permissions and other functions according to the user center of im.
At the same time, the internal communication system needs to implement functions such as message read and unread, group chat list, and conversation list pull according to im.
8. Summary of this article
im's integrated messaging platform is a middleware system that needs to be highly integrated with business. It deals directly with business and is fundamentally different from ordinary middleware.
An easy-to-use im integrated messaging platform directly depends on your versatility, scalability and system throughput.
I hope the content shared in this article can inspire everyone's thinking when developing im.
9. References
[1] From zero to excellence: the evolution of the technical architecture of the JD.com customer service instant messaging system
[2] From guerrilla to regular army (1): The evolution of IM system architecture of Mafengwo Travel Network
[3] Data architecture design of Guazi IM intelligent customer service system (organized from the on-site speech, with supporting PPT)
[4] Alibaba DingTalk Technology Sharing: The King of Enterprise IM——DingTalk’s Superiority in Back-end Architecture
[5] One entry for beginners is enough: developing mobile IM from scratch
[6] Introduction to zero-based IM development (1): What is an IM system?
[7] Based on practice: Summary of technical points of a small-scale IM system with millions of messages
[8] A set of IM architecture technology dry goods for hundreds of millions of users (Part 1): overall architecture, service splitting, etc.
[9] A set of IM architecture technology dry goods for hundreds of millions of users (Part II): reliability, orderliness, weak network optimization, etc.
[10] From novice to expert: how to design a distributed IM system with hundreds of millions of messages
[11] The secret of IM architecture design of enterprise WeChat: message model, 10,000 people, read receipt, message withdrawal, etc.
[12] Alibaba IM Technology Sharing (3): Architecture Evolution of Xianyu Billion-level IM Messaging System
[13] A set of high-availability, easy-to-scale, high-concurrency IM group chat, single chat architecture scheme design practice
(This article has been published simultaneously at: http://www.52im.net/thread-3954-1-1.html )
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。