1. Background
The core business of customer service IM is online communication. Customer service and users can help users solve problems in the shortest time through real-time communication. In the early stage, in order to quickly support the business needs, secondary development was carried out based on the third-party SDK. At the same time, it also buried the hidden problems of difficulty in locating problems and high cost of implementing special functions. With the rapid development of the company's business, customer service has higher requirements for the performance and experience of IM chat, and third-party SDK message communication has gradually encountered bottlenecks. Stability and high scalability, self-development of a controllable, stable and flexible IM system is an unavoidable path. The following is mainly based on the customer service terminal (web).
2. Thinking
In the chatting process between customer service and users, intuitively, the customer service is inputting the copy, and then sending it to the user through the network, but how to design the SDK so that the customer service does not perceive the delay in the process of sending the message is very critical. To avoid stuck, it is necessary to design a reasonable sending strategy and avoid the execution of a large number of JS scripts. Here is an example of customer service chatting with users:
The customer service sent a copy of "Customer service Xiaoice serves you", and the SDK interface was called through the business side and passed into the SDK. The SDK would first create the message body, that is, encapsulate the string into a custom structure model;
Then store the data in the data pool, serialize and pass the data object data to the socket interface, and send it to the gateway through the network channel;
After the gateway side receives the message, it deserializes it, transmits it to the data pool for processing, assembles it into a business-recognizable model, and pushes it to the business side for use.
The chat flow is as follows:
As shown in the figure above, the process link of message sending and receiving can be clearly seen. If the SDK design is unreasonable, the process of sending and receiving messages will be stuck, which will directly affect the user experience.
3. Self-research framework architecture diagram
The overall technical transformation has two aspects:
Abstract transformation of message link: mainly the reconstruction of message data storage and message ordering. Abstract transformation on the business access side: The main purpose is to decouple the business logic from the SDK source code to make the code layering clearer.
Fourth, the implementation of message link publish and subscribe
In the process of SDK self-development, how to decouple the framework code and business code to achieve flexible message monitoring, RxJS was used after the preliminary research, here are a few core concepts of RxJS:
Observable: Represents a callable collection of future values or events. Observer (observer): listen to the value provided by the Observable. Subscription: Indicates the execution of an Observable. Subscription has an important method, unsubscribe, which does not require any parameters and is only used to clean up the resources occupied by Subscription, mainly to cancel the execution of Observable.
The bottom layer of the SDK needs to synchronize to the business side after receiving the data. The previous method was implemented by monitoring. This method does not have the ability to cancel the subscription, and the maintenance cost is relatively high. The use of RxJS can clearly sort out the data flow, and realize data communication by publishing and subscribing. The implementation process of RxJS in publish and subscribe is as follows:
As can be seen from the above figure, the entire flow of message processing is very clear. The bottom layer of the framework receives messages, and subscribers consume messages.
Five, the layered realization of the message framework
In the entire IM message communication framework, there are mainly three layers: network layer, data link layer and application layer, as follows:
1. Network layer
As the bottom layer of message sending, the network layer is responsible for TCP connection, message sending & receiving. The network protocol we choose is the TCP protocol. Why didn't we choose UDP? Because UDP is connectionless and not secure enough to provide reliable transmission services, data transmitted over a TCP connection can arrive indiscriminately, without loss, without duplication, and in sequence.
The communication method of the entire SDK is Websocket + Json, grpc + protobuf. The first step we need to do is to establish a Websocket connection. At the code level, we will first create an abstract class of Connection, which mainly deals with network connection related configuration and timeout. Compensation implementation for reconnection, and abstract methods that some inherited classes need to implement.
As shown in the above code, the core is dealing with timeout reconnection. The traditional retry strategy is to retry every once in a while. Since it is a fixed time interval to retry, there will be a large number of requests pouring in at the same time when retrying. , which will continuously limit the current. The exponential backoff method is used here. Exponential backoff is an algorithm that reduces the rate of a certain process exponentially through feedback to gradually find a suitable rate. The delay and retry can be determined according to the time slot and the number of retry attempts. The implementation algorithm is roughly as follows:
We implement the connection of Websocket by inheriting the Connect class, as follows:
At this point, the network layer connection has been completed. It is relatively simple. It is the encapsulation of some socket apis. The core point is to use the exponential backoff algorithm to realize the reconnection of message sending failure.
2. Data link layer
The data link layer is the core layer of the SDK, mainly involving user information, messages, data pools, etc. Let's analyze each module step by step. First, let's sort out the whole process of the customer service sending and receiving messages after logging in to the user's line. The process has the following stages:
2.1 Protocol Type
The message protocol type is very important and is the cornerstone of message sending. Initialize the protocol data body, which can be used to send various subsequent messages and events. The main types of SDK communication protocols developed by IM are as follows:
- Hi: Send the basic information of the client, telling the server the current client version, device type, language and other information
- Login: login, token verification, get or create current user topic information
- Sub: Subscribe to topic or update topic data
- Leave: cancel the subscription and unbind the previous subscription relationship
- Pub: send data messages to subscribers of the specified topic
- Get: Get the metadata information of the topic, for example: get the subscriber list, historical data, etc.
- Set: Update the metadata information of the topic, for example: delete the message or delete the topic
- Del: used for delete operations, including deleting messages, deleting subscription relationships, deleting topics, etc.
- Note: The client sends notifications to the subscribers of the topic, such as message received, message read, currently inputting, etc.
- Action: Triggered events, such as: switching customer service status, getting robot questions, etc.
- Datares: ack mechanism that tells the gateway that the message has been received
2.2 Create a connection
Instantiate the network layer message link to realize the normal sending and receiving of messages. The implementation is as follows:
2.3 Message Definition
To send a message, the customer service must have a corresponding message structure model, that is, the message body needs to be designed. Here, a message class will be designed. Every time a new message body is created, a new instance will be created, and the message can be updated by operating on the instance. status, etc., as follows:
For a single message, we also need to define the message state, which is used to update the message state during the chat, as follows:
2.4 Data Pool
After the message class is created, a message data pool is needed to store it. The message pool structure is defined as follows:
It also involves some basic operation methods of the message body to operate the data in the data pool, so I will not elaborate too much.
2.5 User dimension
The above are all analyzing the public modules, but the customer service and users are in a one-to-many relationship, and a user dimension module needs to be designed. Subsequent operations on the business side are basically operated in the user dimension, which needs to be designed from a single user dimension. subscription relationship, message sending, deletion, etc. Its implementation is roughly as follows:
2.5.1 Send message link analysis
For messages sent by customer service, we must first consider whether the message has been sent from the perspective of customer service, and give priority to the chat page displayed instead of waiting for the gateway to reply and then display it on the chat page. According to past experience, as long as you enter the message, It should be displayed on the chat page immediately, otherwise the customer service will think that there is a lag and the experience will not be effective. In view of the needs of this scenario, this point must be fully considered when designing the message link, so the design process is as follows:
As shown in the figure above, the corresponding message is processed in the SDK first, and then returned to the business side to complete the rendering after the processing is completed, and then the message is sent to the gateway. Normally, it is within one frame, and the customer service cannot perceive the delay. , here we should pay attention to the timing of message body serialization and deserialization to avoid unnecessary performance waste. There is a virtual seq in the above figure, which is mainly used for sorting before receiving a response from the IM gateway. Pictures, videos, messages sent off the network, message sending failures, or missing seqs received from the IM gateway response (scenario: sensitive words) ) and so on need to be accurately sorted by virtual seq.
2.5.2 Receive message link analysis
The process of receiving a message is relatively simple. After receiving the message, deserialize it and update the relevant data, and then complete the deduplication (retry mechanism) in the data pool, sort and update it to the business side for rendering.
2.5.3 Reliable delivery of messages
Reliable delivery of IM messages mainly refers to: during the process of sending and receiving messages, messages are not lost, messages are not repeated, and the order of messages is not disordered. Let's first analyze the following two situations:
The first: if customer service A fails in the process of sending the message to the IM gateway due to network failure, etc.; or the IM gateway fails to receive the message for storage; or the IM gateway has not returned the result, resulting in a timeout, these In this case, customer service A will be prompted that the message failed to be sent.
The second type: After the message is stored in the IM gateway, customer service A is informed that the message has been sent successfully, and then the IM gateway pushes the message to user A's online device. In the preparation stage of push or after the message is written to the memory, if the server is powered off, the message cannot be successfully pushed to user A. If the device of user A receives the message, a problem occurs in the subsequent processing process, which will also cause the message to be lost. For example, when user A's device writes a message to the local DB, an exception occurs, causing the database to fail. In this case, user A cannot see the message because the network layer has actually been successfully transmitted. Our customer service IM's solution for message loss is as follows: Refer to the ACK mechanism of the TCP protocol to implement a set of ACK protocols based on the business layer.
The sequence diagram of message sending before adding ACK is as follows:
- ACK mechanism-
In the TCP protocol, an ACK mechanism is provided by default, and a standard ACK data packet that comes with the protocol is used to confirm the data received by the communication party and inform the communication sender that the data has been successfully received. The ACK mechanism is also similar, and what needs to be solved is: how to confirm whether the message is successfully delivered to the receiver and clearly received by the receiver after the IM gateway pushes it. The specific implementation sequence diagram is as follows:
The customer service or user will carry a msgid (32-bit uuid, similar to the sequenceId of TCP) when sending a message. After receiving the message, the IM gateway will query the database according to the msgid to see if the message exists. If it does not exist, it will be dropped, and then pushed to the receiver. The receiver will reply ACK after receiving the message, and the ACK packet will carry the current latest seqid. After the IM gateway receives the ACK reply, it will respond to the largest seqid to update. Why update the maximum seqid here? what's the point? This design must be reasonable. After receiving the message sent by the sender, the IM gateway will not only check whether the message exists in the database, but also compare the difference between the seq of the currently received message and the maximum seqid. It will push all the data between [seq, seqid) to the receiver. Normally, it is [n, n-1). If the IM gateway does not receive the ACK from the receiver, n-1 will not be updated. The number of messages is greater than 1. If seq and seqid are equal, it means that the sender has repeatedly pushed the message, and it will not be pushed to the receiver at this time. This involves message retry, so let’s continue to analyze it.
- Message retry in ACK mechanism -
What should I do if the message is lost during the push to A? for example:
- Network A is actually unreachable, but the IM gateway has not sensed it (there is a ping problem);
- The message was dropped by some intermediary device en route to the intermediary network.
Solving this problem also refers to the retransmission mechanism of the TCP protocol. We will maintain a timeout timer on the client side, IM gateway, and client side. If the ACK packet from the other party is not received within a certain period of time, the message will be retrieved and pushed again. After retrying for a certain number of times, if the ACK is still not received, it is regarded as giving up. The front-end code structure and effects are as follows:
The data in the above picture is just to simulate the message retry, and the execution frequency in the real scene must be longer than this time.
- The problem of repeated push messages-
If the ACK packet is not received within a certain period of time, the retry mechanism will be triggered. There are two situations in which the ACK cannot be received. In addition to the fact that the pushed message is really lost and A does not return an ACK, it may also be that the ACK packet returned by A is lost.
The solution is: the sender carries a msgid when sending a message. The msgid is globally unique. The msgid of the same re-pushed message remains unchanged. In other words, you will not see repeated messages in the chat interface, which will not affect the user experience.
- Guarantee that messages will not be out of order-
The consistency of messages is very important, and the order of messages cannot be disordered during the chat.
Use the sender's local timestamp as the serial number, but this has a big problem. The sender's timestamp can be changed, which is not desirable; the IM gateway service is deployed in a cluster, and will use topic and seqid as the only one Index, the seqid will be generated before the message is received and stored in the database. When the client and the client receive the receipt for sending the message, they need to sort the message according to the returned seqid (IM gateway self-incrementing). This method is preferable.
Through the above analysis, the reliability of customer service IM messages is ensured by the ACK mechanism, retry mechanism, deduplication mechanism, and sorting mechanism to ensure the complete reach and accurate sorting of each message.
3. Application layer
When using it on the business side, you can directly instantiate the SDK. RxJS has been mentioned in the message link publish and subscribe. At this time, you can subscribe and use it on the business side. It should be noted that a filterMsgItem method is passed when instantiating the SDK, which is mainly used for special business scenarios. Take our customer service business as an example, some specific messages do not need to be displayed on the chat page, such as: users send Messages are tampered with, etc. Of course, we can also filter data when re-filtering or rendering data on the business side. There is no problem with this operation, but it is not necessary. will also increase. There will be some unnecessary waste. Of course, you can not add this parameter, the SDK is fully compatible.
So far, we have completed the implementation of the entire SDK and its use on the business side, and the message sending and receiving are also normal. The effect is as follows:
6. Summary
Self-developed SDK is still quite a challenge, from simple secondary development based on third-party SDK to self-developed SDK and relatively perfect combination with our actual business scenarios. The overall design of the SDK and how to better integrate it with the business side are not achieved overnight. It is all about accumulating experience in actual business scenarios and trying to find a relatively perfect solution. Here is a simple case, such as message sending: How to display, sort, and resend messages in a network disconnection scenario? How to display and sort after the resending fails again in the scenario where the sending fails? How to send messages to trigger the retry mechanism in a weak network scenario in an optimal way to deduplicate and sort? How to deal with sensitive words triggered by sending messages? What should I do with messages that fail to send and trigger sensitive words after the network is disconnected and reconnected? What if the file is involved? ... In the process of self-research, in addition to focusing on business scenarios, we also investigated the processing methods of some of the best web applications in the industry in some special scenarios. Many excellent solutions can only draw on some core ideas, or focus on the business, and the most important thing is to really solve the business pain points through technical means.
The self-developed SDK is still very profitable, and we have accumulated a lot of experience in IM. Completing the self-developed SDK is just the beginning. In the future, we will continue to work on time-consuming tasks and data security.
Reference documentation:
- RxJS
- TCP/UDP protocol
- Application of Exponential backoff in network requests
*Text/Wang Weiqiang
@德物科技public account
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。