Follow the source code to learn IM (ten): Based on Netty, build a high-performance IM cluster (including technical ideas + source code)

The original title of this article is "Building a High-Performance IM System", the author "Liu Li", the content has been revised and changed. In order to respect the originality, if you need to reprint, please contact the author for authorization.

1 Introduction

I believe that many friends are very interested in the implementation principles of WeChat, QQ and other chat software, and the author also has a deep interest in these software. Moreover, the author also does IM in the company, and the company's IM carries hundreds of millions of messages sent every day!

There are such technical resources and conditions, so some time ago, I used my spare time to develop a set of IM system with relatively complete basic functions based on Netty. The system supports private chat, group chat, session management, heartbeat detection, service registration, load balancing, and horizontal expansion of any node.

During this time, some readers on the Internet also hope that the author will share some knowledge related to Netty or IM, so today I will share the IM system I developed with you.

Based on the author's amateur technical practice, this article will tell you how to build a high-performance IM cluster based on Netty+Zk+Redis, including the technical principles and example codes for implementing IM cluster this time, hoping to inspire you.

study Exchange:

Introductory article on mobile IM development: "One entry is enough for beginners: developing mobile IM from scratch"
Open source IM framework source code: https://github.com/JackJiang2011/MobileIMSDK

(This article has been simultaneously published on: http://www.52im.net/thread-3816-1-1.html )

2. The source code of this article

Main address: https://github.com/nicoliuli/chat
Backup address: https://github.com/52im/chat

The directory structure of the source code is shown in the following figure:

3. Knowledge preparation

Important note: This article is not an instant messaging theoretical article. The content of the article comes from actual code combat. If you know too little about instant messaging (IM) technology theory, it is recommended to read it in detail first: "One entry for beginners is enough: developing mobile from scratch. Terminal IM".

Some people may not know what Netty is. Here is a brief introduction:

Netty is an open source framework for Java. Netty provides an asynchronous, event-driven network application framework and tools for rapid development of high-performance, high-reliability network server and client programs.
quote
In other words, Netty is a NIO-based client and server-side programming framework. Using Netty can ensure that you can quickly and easily develop a network application, such as a client and server-side application that implements a certain protocol.
quote
Netty considerably simplifies and streamlines the programming and development process of network applications, such as TCP and UDP socket service development.

Here is an introductory article on Netty:

1) Getting started: The most thorough analysis of Netty's high-performance principles and framework architecture so far
2) For beginners: learning methods and advanced strategies of Java high-performance NIO framework Netty
3) The most popular Netty framework entry long article in history: basic introduction, environment construction, hands-on combat

If you don't even know what NIO in Java is, the following articles are recommended to be read first:

1) Less verbose! Take you to understand the difference between Java's NIO and classic IO in one minute
2) Introduction to the strongest Java NIO in history: If you are worried about getting started and giving up, please read this!
3) Java's BIO and NIO are difficult to understand? Use code practice to show you, if you don't understand me again, I will change careers!

Online access to Netty source code and API:

1) Netty-4.1.x complete source code (online reading version)
2) Netty-4.1.x API documentation (online version)

4. System Architecture

The architecture of the system is shown in the figure above: the whole system is a C/S system, the client does not have a complex graphical interface but is developed with a Java terminal (black window), and the server IM instance is a socket service written by Netty.

ZK is used as a service registry, and Redis is used to cache distributed sessions and save user information and lightweight message queues.

For the working principle of each part of the whole system architecture, we will introduce one by one in the following chapters.

5. How the server works

In the above architecture: NettyServer is started, and each time a Server node is started, its own node information, such as ip, port and other information, will be registered on the ZK (temporary node).

As shown in the architecture diagram in the previous section, two NettyServers are started, so the information of the two servers will be saved on ZK.

At the same time, ZK will monitor each server node. If the server is down, ZK will delete the information registered by the current machine (delete the temporary node), thus completing the function of simple service registration.

6. How the client works

When the Client starts, it will randomly select an available NettyServer from ZK (random representation can achieve load balancing), get the NettyServer information (IP and port) and establish a link with the NettyServer.

After the link is established, the NettyServer will generate a Session (that is, a session), which is used to assemble the current client's Channel and other information into a Session object, which is stored in a SessionMap, and this Session is also stored in Redis.

This session is particularly important. Through the session, we can get information such as the channel of the current Client and NettyServer.

7. The role of Session

We start multiple Clients. Since each Client starts, it will randomly obtain the information of NettyServer from ZK, so if multiple Clients are started, they will connect to different NettyServers.

Friends who are familiar with Netty know that a Channel will be generated after the Client establishes a connection with the Server. Through the Channel, the Client and the Server can perform normal network data transmission.

If Client1 and Client2 are connected to the same Server: then Server obtains the sessions of Client1 and Client2 respectively through SessionMap. The session contains Channel information. With two Client Channels, Client1 and Client2 can complete message communication.

What if Client1 and Client2 are connected to different NettyServers: Client1 and Client2 want to communicate? This question is answered later.

8. Efficient data transmission

Whether it is an IM system or a distributed RPC framework, efficient network data transmission will undoubtedly greatly improve the performance of the system.

When data is transmitted over the network, the object is generally serialized into an array of binary byte streams, and then the data is transmitted to the counterparty server through the socket.

In the field of Java, the way of serializing objects in Java has serious performance problems. The industry often uses Google's protobuf to achieve serialization and deserialization (see "Protobuf Communication Protocol Details: Code Demonstration, Detailed Principles, etc.").

Protobuf supports different programming languages, can implement cross-language system calls, and has extremely high serialization and deserialization performance. This system also uses protobuf for data serialization.

Regarding the basic understanding of Protobuf, the following articles can be read in depth:

"Strongly recommend Protobuf as your instant messaging application data transfer format"
"All-round evaluation: Is Protobuf performance 5 times faster than JSON? 》
"Kingdee Sui Notes Team Sharing: Still Using JSON? Protobuf Makes Data Transmission Cheaper and Faster (Principle)"

In addition: In the article "A set of mobile terminal IM architecture design practice sharing for massive online users (including detailed graphics)", the section "3. Protocol design" is about the actual design and use of protobuf in IM, which can be combined together Learn it.

9. Definition of chat protocol

When we use various chat apps, we will send various messages, and each message will correspond to a different message format (ie "chat protocol").

The chat protocol mainly contains several important information:

1) message type;
2) Sending time;
3) The sender and receiver of the message;
4) Chat type (group chat or private chat).

In my IM system, the chat protocol is defined as follows:

syntax = "proto3";
option java_package = "model.chat";
option java_outer_classname = "RpcMsg";
message Msg{
    string msg_id = 1;
    int64 from_uid = 2;
    int64 to_uid = 3;
    int32 format = 4;
    int32 msg_type = 5;
    int32 chat_type = 6;
    int64 timestamp = 7;
    string body = 8;
    repeated int64 to_uid_list = 9;
}

As in the above protobuf code, the specific meaning of the field is as follows:

1) msg_id: represents the unique id of the message, which can be represented by UUID;
2) from_uid: the uid of the message sender;
3) to_uid: the uid of the message receiver;
4) format: message format, when we use various chat software, we will send text messages, voice messages, picture messages, etc., each message has a different message format, we use format to represent (because this system is java Terminal, the format field does not have much meaning, optional);
5) msg_type: message type, such as login message, chat message, ack message, ping, pong message;
6) chat_type: chat type, such as group chat, private chat;
7) timestamp: the timestamp of the message sent;
8) body: the specific content of the message, the carrier;
9) to_uid_list: This field uses group chat messages to improve the performance of group chat messages. The specific function will be explained in detail in the group chat principle section.

10. Principle of sending private chat messages

When Client1 sends a message to Client2, we need to construct the message body in the previous section.

Specifically: from_uid is the uid of Client1, and to_uid is the uid of Client2.

The processing logic after NettyServer receives the message is:

1) Parse to the to_uid field;
2) Obtain to_uid, the Session of Client2, from the Session collection saved in SessionMap or Redis;
3) Take out the Channel of Client2 from the Session;
4) Then send the message to Client2 through Client2's Channel.

11. Principle of group chat message sending

There are usually two technical implementations for the distribution of group chat messages. Let's take a look at them one by one.

Method 1: Assuming that there are 100 people in a group, if Client1 sends a message to everyone in a group, it is actually equivalent to Client1 sending a message to the other 99 people. We can directly send messages to 99 people in the group by looping directly on the client side, which is equivalent to the client sending the same message to NettyServer 99 times (except to_uid is different).

The above solution has serious performance problems: Client1 sends messages to NettyServer through 99 cycles, and NettyServer sends messages to other users in the group after receiving the 99 messages. Aside from the particularity of the mobile terminal (for example, the mobile phone may retreat to the background and be suspended by the system before the cycle is completed), it is obvious that the 99 cycles from Client1 to NettyServer are obviously unreasonable.

Method 2: The to_uid_list field in the message body in the previous section is to solve the performance problem of this method. Client1 saves the uids of the remaining 99 Clients in the group in to_uid_list, and then NettyServer sends only one message. After NettyServer receives this message, it parses the uids of the remaining 99 Clients in the group through the to_uid_list field, and then sends the messages separately through the loop To the rest of the Clients in the group.

It can be seen that in the group chat of method 2, Client1 and NettyServer only perform one message transmission, which is 50% more efficient than that of method 1.

11. Technical key point 1: How do clients communicate when they are connected to different IM instances?

For the architecture in this article, if multiple clients are connected to different servers, how should clients communicate?

To answer this question, we must first understand the role of Session.

Our friends who have done JavaWeb development know that Session is used to save the user's login information.

The same is true in the IM system: the user's Channel information is saved in the Session. When the connection between the client and the server is successfully established, a channel will be generated, and the client and the server use the channel to realize data transmission. When the connection between the two ends is established, the Server will construct a Session object, save information such as uid and Channel, and save the Session in a SessionMap (in the memory of NettyServer), the uid is the key, and we can find it through the uid The Session corresponding to this uid.

But only SessionMap is not enough: we need to use Redis, its function is to save all the users who have successfully linked the entire NettyServer cluster. This is also a kind of Session, but this Session does not save the corresponding relationship between uid and Channel, but saves Client links to NettyServer information, such as the ip, port, etc. of the NettyServer to which the Client is linked. Through the uid, we can also get the information of the NettyServer that the current Client is linked to from Redis. It is with this information that we can achieve horizontal expansion of any node of the NettyServer cluster.

When the number of users is small: we only need one NettyServer node to carry the traffic, all Clients are linked to the same NettyServer, and the session of each Client is saved in the SessionMap of NettyServer. When Client1 communicates with Client2, Client1 sends the message to NettyServer, and NettyServer takes out the Session and Channel of Client2 from SessionMap and sends the message to Client2.

As the number of users continues to increase: one NettyServer is not enough, we add several NettyServers. At this time, Client1 is linked to NettyServer1 and the session and Client1 link information is saved in SessionMap and Redis. Client2 is linked to NettyServer2 and is in SessionMap and Redis. The session and Client2 link information is saved in Redis. When Client1 sends a message to Client2, the session of Client2 cannot be found through the SessionMap of NettyServer1, and the message cannot be sent, so the NettyServer on which Client2 is linked is obtained from Redis. After obtaining the NettyServer information linked to Client2, we can forward the message to NettyServer2. After NettyServer2 receives the message, it obtains the Session and Channel of Client2 from the SessionMap of NettyServer2, and then sends the message to Client2.

So: how to forward the message of NettyServer1 to NettyServer2? The answer is through message queues, such as the list data structure in Redis. After each NettyServer starts, it needs to listen to its own message queue in Redis, and the queue users receive messages forwarded by other NettyServers to the current NettyServer.

Jack Jiang commented: In the above cluster solution, Redis not only serves as the online user list storage center, but also serves as a message relay service for different IM long-connection instances in the cluster (the role of Redis at this time is equivalent to MQ), then Redis does not become the entire distribution Is it a single point of bottleneck for a cluster?

12. Technical key point 2: The link is broken, how to deal with it?

If the Client and NettyServer are disconnected for some reason (client exit, server restart, network factor, etc.), we must delete the session and the data retained in Redis from the SessionMap.

If the two types of data are not cleared, it is very likely that the message sent by Client1 to Client2 may be sent to other users, or even if Client2 is logged in, Client2 will not receive the message.

We can handle the session clearing operation after the connection is disconnected in the channelInactive method in the Netty framework.

13. Technical key point 3: The role of ping and pong

After the Client establishes a link with NettyServer, due to the poor double-end network, after the Client is disconnected from NettyServer, if NettyServer does not perceive it, it will not clear the data in SessionMap and Redis, which will cause serious problems (for service On the other hand, the session of this Client is actually in a "suspended death" state, and messages cannot be sent in real time).

At this point, a ping/pong mechanism (that is, a heartbeat mechanism) is needed.

The implementation principle is: through the scheduled task, the client sends a ping message to the NettyServer at regular intervals, and the NettyServer replies a pong message to the client after receiving the ping message to ensure that the client and the server can always maintain a link state. If the Client is disconnected from the NettyServer, the NettyServer can immediately detect and clear the session data. In Netty, we can add IdleStateHandler in the Pipeline to achieve this purpose.

If you don't understand the role of heartbeat, be sure to read the following articles:

"Why does the mobile IM based on the TCP protocol still need the heartbeat keep-alive mechanism? 》
"Understanding the Network Heartbeat Packet Mechanism in Instant Messaging Applications: Function, Principle, Implementation Ideas, etc."

You can also learn the heartbeat logic of mainstream IM:

"WeChat team original sharing: Android version WeChat background keep alive actual combat sharing (process keep alive)"
"WeChat team original sharing: Android version WeChat background keep-alive actual combat sharing (network keep-alive)"
"Mobile IM Practice: Realizing the Intelligent Heartbeat Mechanism of Android WeChat"
"Mobile IM Practice: Analysis of Heartbeat Strategy of WhatsApp, Line and WeChat"

If the theory is not intuitive enough, the following code example can be learned intuitively:

"Correctly understand the heartbeat and reconnection mechanism of IM long connection, and implement it by hand (with complete IM source code)"
"Discussion on the design and implementation of an Android-side IM intelligent heartbeat algorithm (including sample code)"
"Is it so difficult to develop IM by yourself? Teach you to create a simple Android version of IM by yourself (with source code)"
"Teach you to use Netty to implement the heartbeat mechanism and disconnection reconnection mechanism of network communication programs"

In fact, the actual effect of the heartbeat algorithm still has some logical skills. The following two suggestions must be read:

"Web-side instant messaging practice dry goods: how to make your WebSocket disconnect and reconnect faster? 》
"Rongyun Technology Sharing: Network Link Keep-Alive Technology Practice of Rongyun Android IM Products"

14. Technical key point 4: Add Hook to Server and Client

If NettyServer restarts or the process is killed, we need to clear the SessionMap of the current node (in fact, there is no need to clear the SessionMap, the data will be automatically deleted when restarted in memory) and the link information of the Client saved by Redis.

We need to traverse the SessionMap to find all the uids, then clear the Redis data one by one, and then exit gracefully. At this point, we need to add a Hook to our NettyServer for data cleaning.

15. Technical key point 5: How to deal with messages when the other party is not online?

Client1 sends a message to the other party, and we can't get the other party's session data through SessionMap or Redis, which means that the other party is not online.

At this point: we need to store the message in the offline message table. When the other party logs in next time, NettyServer checks the offline message table and sends the message to the logged in user (preferably in batches to improve performance).

Offline message processing in IM is not a simple technical point. If you are interested, you can study it in depth:

"Implementation of IM Message Delivery Guarantee Mechanism (2): Ensuring Reliable Delivery of Offline Messages"
"Alibaba IM Technology Sharing (6): Optimization of Offline Push Reach Rate of Xianyu Billion-level IM Messaging System"
"IM development dry goods sharing: how do I solve a large number of offline messages that cause the client to freeze"
"IM development dry goods sharing: how to elegantly achieve reliable delivery of a large number of offline messages"
"Architecture Design Practice of Offline Message Push System with 100 Million Users in Himalaya"

16. Write at the end

The code is written like this, and it can be regarded as confirming my wish to use IM by hand. The only regret is that the time is relatively tight, and I haven't had time to implement the message ack mechanism to ensure that the message will be delivered, which the author will add in the future.

Well, this is the simple chat system I developed. Although the sparrow is small, it has all the internal organs. If you have any questions, you can leave a message directly below, and the author will reply one by one. Thank you.

17. Series of articles

"Learn IM from the source code (1): teach you to use Netty to implement the heartbeat mechanism, disconnection and reconnection mechanism"
"Learn IM with source code (2): Is it difficult to develop IM by yourself? Teach you to play an Android version of IM"
"Learn IM with the source code (3): Based on Netty, develop an IM server from scratch"
"Learn IM with the source code (4): Pick up the keyboard and do it, teach you to develop a distributed IM system with your bare hands"
"Learn IM from the source code (5): correctly understand the IM long connection, heartbeat and reconnection mechanism, and implement it by hand"
"Learn IM from the source code (6): teach you how to quickly build a high-performance and scalable IM system with Go"
"Learn IM from the source code (7): teach you how to use WebSocket to create web-side IM chat"
"Learn IM from the source code (8): 10,000-character long text, teach you how to use Netty to create IM chat"
"Learn IM with Source Code (9): Implementing a Distributed IM System Based on Netty"
"Learn IM with source code (ten): Based on Netty, build a high-performance IM cluster (including technical ideas + source code)" (* this article)

18. References

[1] Getting Started for Novices: The most thorough analysis of Netty's high-performance principles and framework architecture so far
[2] For beginners: learning methods and advanced strategies of Java high-performance NIO framework Netty
[3] Introduction to the strongest Java NIO in history: If you are worried about getting started and giving up, please read this!
[4] Java's BIO and NIO are difficult to understand? Use code practice to show you, if you don't understand me again, I will change careers!
[5] The most popular Netty framework entry long article in history: basic introduction, environment construction, hands-on combat
[6] Combining theory with practice: a detailed explanation of a typical IM communication protocol design
[7] Talking about the architecture design of IM system
[8] Briefly describe the pits of mobile IM development: architecture design, communication protocol and client
[9] A set of practice sharing of mobile IM architecture design for massive online users (including detailed pictures and texts)
[10] A set of original distributed instant messaging (IM) system theoretical architecture scheme
[11] Design practice of a set of high-availability, easy-to-scale, high-concurrency IM group chat and single chat architecture solutions
[12] A set of IM architecture technology dry goods for 100 million users (Part 1): overall architecture, service splitting, etc.
[13] A set of IM architecture technology dry goods for 100 million users (Part 2): reliability, orderliness, weak network optimization, etc.
[14] From novice to expert: how to design a distributed IM system with hundreds of millions of messages
[15] Based on practice: Summary of technical points of a small-scale IM system with one million messages

(This article has been simultaneously published at: http://www.52im.net/thread-3816-1-1.html )

Follow the source code to learn IM (ten): Based on Netty, build a high-performance IM cluster (including technical ideas + source code)

1 Introduction

2. The source code of this article

3. Knowledge preparation

4. System Architecture

5. How the server works

6. How the client works

7. The role of Session

8. Efficient data transmission

9. Definition of chat protocol

10. Principle of sending private chat messages

11. Principle of group chat message sending

11. Technical key point 1: How do clients communicate when they are connected to different IM instances?

12. Technical key point 2: The link is broken, how to deal with it?

13. Technical key point 3: The role of ping and pong

14. Technical key point 4: Add Hook to Server and Client

15. Technical key point 5: How to deal with messages when the other party is not online?

16. Write at the end

17. Series of articles

18. References

JackJiang

引用和评论

长连接网关技术专题(十二)：大模型时代多模型AI网关的架构设计与实现

即时通讯安全篇（一）：正确地理解和使用Android端加密算法

全民AI时代，大模型客户端和服务端的实时通信到底用什么协议？

融云数据监控平台「北极星」教程，聊天室洪峰、连接异常、消息未达正确解法

极致出海友好，融云 IM 支持消息免打扰设置时区

如何基于 Go 语言设计一个简洁优雅的分布式任务系统

视频直播技术干货(十三)：B站实时视频直播技术实践和音视频知识入门