2
头图

The author of this article, "Carson", is currently working at Tencent. The original title "Efficient Keep-Alive and Long-Term Connections: Hand-in-Hand Teach You to Implement an Adaptive Heartbeat Keep-Alive Mechanism" has many revisions and changes.

1 Introduction

When we want to achieve high real-time requirements such as IM instant messaging, message push, etc., we generally choose a long-connection communication method.

However, when the long connection method is implemented, many technical problems will be encountered, such as the most common problem of long connection keep-alive.

Today, through this article, I will teach you how to implement a set of adaptive heartbeat keep-alive mechanism, so as to efficiently and stably maintain long connections such as IM chat.

study Exchange:

(This article is simultaneously published at: http://www.52im.net/thread-3908-1-1.html )

2. Related article "Why does mobile IM based on TCP protocol still need heartbeat keep-alive mechanism? 》
"Understanding the Network Heartbeat Packet Mechanism in Instant Messaging Applications: Function, Principle, Implementation Ideas, etc."
"Discussion on the design and implementation of an Android-side IM intelligent heartbeat algorithm (including sample code)"
"Is it so difficult to develop IM by yourself? Teach you to create a simple Android version of IM by yourself (with source code)"
"Learn IM from the source code (1): teach you to use Netty to implement the heartbeat mechanism, disconnection and reconnection mechanism"
"Learn IM from the source code (5): correctly understand the IM long connection, heartbeat and reconnection mechanism, and implement it by hand"

3. What is a long connection

Know about long connections:

The main function of a long connection is to maintain the connection between the two parties for a long time, thereby:

1) Improve the communication speed;
2) Ensure real-time performance;
3) Avoid the waste of channel resources and network resources caused by repeated connections in a short time.

The difference between long connection and short connection:

PS: For developers like IM, usually everyone refers to the HTTP protocol as a "short connection" and a socket directly based on TCP, UDP or WebSocket as a "long connection".

4. Reasons for long connection disconnection

4.1 Basic Concepts As can be seen from the previous section, in the case of using long connections, all communications between the two parties are established on one long connection (such as one TCP connection). Therefore, the long connection needs to keep the connection between the two parties so that the two parties can continue to communicate.

However, the reality is that long connections will be disconnected.

The main reasons for these disconnections are:

1) The process where the long connection is located is killed (this mainly refers to the mobile terminal);
2) NAT timeout;
3) The network status changes;
4) Other force majeure factors (poor network status, DHCP lease, etc.).

Below, I will analyze each reason.

4.2 Specific analysis
1) Reason 1: Process is killed

When the process is killed, the long connection will also be disconnected. Process killing is the most common problem on the Android side. Due to space limitations, I will not expand this topic here. If you are interested, you can read this article: "The official version of Android P is coming: the real nightmare of background application keep alive and message push".

2) Reason 2: NAT timeout (focus on)

The NAT timeout phenomenon is as follows:

The NAT timeout period for each operator and region is as follows:

PS: The above data comes from the article "Mobile IM Practice: Realizing the Intelligent Heartbeat Mechanism of Android WeChat" by the WeChat team. With the popularization of 4G and 5G, these data may have changed, please refer to the actual test results.

Special attention: Excluding other external factors (network switching, NAT timeout, human reasons), TCP long connections will not be automatically interrupted in nature if both parties are not disconnected (that is, no heartbeat packets are required to maintain, which can be verified One: let 2 computers connect to the same Wifi, one of them is a server, and the other is a client connection server (no KeepAlive is set). As long as the computer and router are continuously disconnected from the network, then the two computers will be connected for a long time. is not automatically interrupted).

Jack Jiang's Note: The above discussion may not be accurate. Readers with new interests can read "Unplug the network cable and plug it in again, is the TCP connection still there?" Understand it in one sentence! ".

3) Reason 3: The network state has changed

When the mobile client network status changes (such as mobile network & Wifi switching, disconnection, reconnection), the long connection will also be disconnected.

4) Reason 4: Other Force Majeure Factors

For example, poor network status, expiration of DHCP lease, etc., will cause occasional disconnection of long-term connections. DHCP lease expires: For Android systems, DHCP will not automatically renew the lease (continue to use expired IP) after the lease expires, resulting in disconnection of long-term connections.

5. Solutions to maintain long-term connections efficiently

5.1 Basic introduction After understanding the reasons for long connection disconnection, for these reasons, here is my efficient solution for maintaining long connection (as shown in the figure below).

For this reason, if you want to maintain a long connection effectively, you need to do:

To put it simply, the key to maintaining a long connection efficiently is:

1) Keep-alive: try not to break when connected;
2) Reconnection: After the connection is broken, it must be able to continue to reconnect.

5.2 Specific measures
1) Measure 1: Process keep alive

The overall summary is as follows:

PS: This topic is very popular about Android's process keeping alive. If you are interested, you can read it in detail along the following articles:

"Ultimate Summary of Application Keep Alive (1): Dual-process Guardian Keep Alive Practices Below Android 6.0"
"Ultimate Summary of Application Keep Alive (2): Keep Alive Practice of Android 6.0 and Above (Process Killing)"
"Ultimate Summary of Application Keep Alive (3): Keep Alive Practice for Android 6.0 and Above (Killed and Resurrected)"
"Detailed explanation of Android process keeping alive: an article to solve all your doubts"
"WeChat team original sharing: Android version WeChat background keep alive actual combat sharing (process keep alive)"
"The official version of Android P is coming: the real nightmare of background application keep-alive and message push"
"Comprehensive inventory of the real operating effects of the current Android background keep-alive solution (before 2019)"
"In 2020, is there still a drama to keep the Android background alive? See how I do it elegantly! 》
"The Strongest Android Keep Alive Ideas in History: In-depth Analysis of Tencent TIM's Process Immortality Technology"
"The ultimate reveal of Android process immortality technology: the underlying principle of the process being killed, and the skills of APP to deal with being killed"
"Android Keep Alive from Entry to Abandonment: Obediently guide users to add whitelist (with 7 models of whitelisting examples)"

2) Measure 2: Heartbeat Keep Alive Mechanism

This is the focus of this article, which will be analyzed in detail at the beginning of the next section

3) Measure 3: disconnection and reconnection mechanism

The principle is: to detect changes in network status and to judge the validity of the connection in time.

Specific implementation: This is actually a complete set of logic with the heartbeat keep-alive mechanism, so the following will explain it together in the heartbeat keep-alive mechanism.

6. Introduction to the heartbeat keep-alive mechanism

The overall introduction of the heartbeat keep-alive mechanism is as follows:

However, many people tend to confuse the heartbeat mechanism with the traditional HTTP polling mechanism.

The difference between the two is given below:

7. Analysis and comparison of the heartbeat mechanism of mainstream IM

A simple analysis and comparison of the heartbeat mechanism of domestic and foreign mainstream mobile IM products (WhatsApp, Line, WeChat).

Please see the following figure for details:

PS: The above data comes from the article "Mobile IM Practice: Analysis of WhatsApp, Line, and WeChat's Heartbeat Strategy" shared by the WeChat team.

8. Overall design of the heartbeat keep-alive mechanism scheme

Below, I will design a heartbeat mechanism scheme based on the mainstream heartbeat mechanism on the market.

The basic process of the heartbeat mechanism scheme:

The main considerations for the design of the heartbeat mechanism scheme are:

1) To ensure the real-time nature of the message;
2) Consider the resource consumption of the device (network traffic, power, CPU, etc.).

As can be seen from the above figure, the main points of the design of the heartbeat mechanism scheme are:

1) The specification of the heartbeat packet (content & size);
2) The interval time of heartbeat sending;
3) Disconnection and reconnection mechanism (core = how to judge the validity of a long connection).

In the following scheme design, detailed solutions will be given for these three problems.

9. Detailed design of the heartbeat mechanism scheme
9.1 Specifications of Heartbeat Packets In order to reduce traffic and improve transmission efficiency, it is necessary to simplify the design of heartbeat packets.

Mainly from the content and size of the heartbeat package, the design principles are as follows:

Design:

Heartbeat packet = 1 packet carrying a small amount of information & size within 10 bytes

9.2 Heartbeat sending interval In order to prevent NAT from timing out and reduce the consumption of device resources (network traffic, power, CPU, etc.), the heartbeat sending interval is the focus of the whole heartbeat mechanism scheme design.

The design principles of the heartbeat sending interval are as follows:

9.3 The most commonly used heartbeat interval scheme In general, the most direct and commonly used heartbeat sending interval setting scheme is mostly adopted: "Send heartbeat packets once every estimated x minutes". Among them, x < 5 minutes is sufficient (comprehensive mainstream mobile IM products, x = 4 minutes is recommended here).

However, there are some problems with this scheme:

PS: For the specific implementation of the fixed heartbeat interval, you can refer to:

"Learn IM with the source code (1): teach you to use Netty to implement the heartbeat mechanism, disconnection and reconnection mechanism";
"Learn IM from the source code (5): Correctly understand the IM long connection, heartbeat and reconnection mechanism, and implement it by hand";
"Is it so difficult to develop IM by yourself? Teach you to create a simple Android version of IM by yourself (with source code)".
9.4 Adaptive Heartbeat Interval Scheme Next, I will explain the design scheme of the adaptive heartbeat interval in detail.

Basic logic:

There are two core problems that the solution needs to solve.

1) How to adaptively calculate the heartbeat interval so that the heartbeat interval is close to the current NAT timeout?

A: Continuously increase the heartbeat interval time to perform the heartbeat response test until the heartbeat fails 5 times, and then find the heartbeat interval closest to the current NAT timeout.

Please see the following figure for details:

Note: Only when the heartbeat interval is close to the NAT timeout period, can the problem of uninterrupted long connections & minimum device resource consumption be maximized.

2) How to detect that the NAT timeout time of the current network environment has changed?

Answer: The current maximum interval for successfully sending heartbeat packets (that is, the heartbeat interval closest to the NAT timeout period) If the sending fails 5 times, it is judged that the NAT timeout period of the current network environment has changed.

Please see the following figure for details:

Note: After detecting the change of the NAT timeout time, the heartbeat interval is recalculated adaptively so that the heartbeat interval is close to the NAT timeout time

To sum up: coordinating the above two core issues, the adaptive heartbeat interval design scheme is summarized as the following figure:

PS: For the design and implementation of the adaptive heartbeat mechanism, you can refer to:

"Mobile IM Practice: Realizing the Intelligent Heartbeat Mechanism of Android WeChat";
"A discussion on the design and implementation of an Android-side IM intelligent heartbeat algorithm (including sample code)".

10. Implementation of disconnection and reconnection mechanism

Technically speaking: the heartbeat keep-alive of a long connection depends on the heartbeat mechanism. When the heartbeat mechanism works, the disconnection and reconnection mechanism is started in a timely manner, and the real heartbeat can only be realized under the combined action of the heartbeat mechanism and the disconnection and reconnection mechanism. keep alive. But in order to make the logic clearer, I will explain the disconnection reconnection mechanism and the heartbeat mechanism as separate sections. This section is about the fragment line reconnection mechanism.

The core of this mechanism is how to judge the validity of a long connection. That is: under what circumstances is a long connection disconnected?

1) Design principles:

The basic logic is: the criterion for judging whether the long connection is valid = whether the server returns a heartbeat response.

Here we need to distinguish the difference between the "live & valid" states of long connections:

2) Specific plans:

Implementation idea: Through counting calculation, if there is no heartbeat response from the server after sending heartbeats for 5 consecutive times, it is considered that the long connection is invalid.

Judgment process:

3) Schemes circulating on the Internet:

There are some solutions for judging whether a long connection is valid or not circulating on the Internet. The details are as follows:

So far, the heartbeat keep-alive mechanism has been explained.

11. Program summary

It is necessary to summarize the heartbeat mechanism and disconnection reconnection mechanism that I shared in the previous two sections. These two mechanisms constitute the complete logic of long-connection heartbeat keep-alive in this article.

Design:

Process Design:

Note: Refer to the above description for the judgment process of marking "gray".

12. Further optimize and improve the heartbeat keep-alive scheme

12.1 Basic situation The solutions in the above two sections will still have technical defects, which will lead to the disconnection of long connections (for example: long connections themselves are not available (it is useless to reconnect many times at this time)).

The following will optimize and improve the above scheme, so as to ensure that the client and the server still maintain a communication state.

The optimization points are mainly:

1) Ensure the validity and stability of the current network before starting a long connection;
2) Adaptively calculate the timing of the heartbeat packet interval.
12.2 Ensure the validity and stability of the network before starting the long connection problem description:

solution:

Add to the main process of the original heartbeat keep-alive mechanism:

12.3 Timing of adaptive calculation of heartbeat packet interval description:

Design:

Added to the main process of the original heartbeat keep-alive mechanism:

12.4 Summary

13. Additional thinking: Can the KeepAlive mechanism that comes with the TCP protocol replace the heartbeat mechanism?

Many people think that the TCP protocol itself has a KeepAlive mechanism. Why does it still need to implement an additional heartbeat keep-alive mechanism at the application layer based on its communication link?

The conclusion is: irreplaceable;
The reason is: the role of the TCP KeepAlive mechanism is to detect whether the connection is alive or not, but it cannot detect whether the connection is valid.
NOTE: Definition of "connection valid" = both parties have the ability to send & receive messages.

Let's take a look at what the KeepAlive mechanism is:

The specific reasons why the KeepAlive mechanism cannot replace the heartbeat mechanism are as follows:

pay attention:

1) The KeepAlive mechanism is only a passive mechanism at the bottom of the operating system and should not be used by the upper application layer;
2) When the system closes a dead connection checked by the KeepAlive mechanism, it will not actively notify the upper-layer application, but can only be found by calling the return value of the corresponding IO operation.

To sum up, the KeepAlive mechanism cannot replace the heartbeat mechanism. It is necessary to implement the heartbeat mechanism at the application layer to detect the validity of the long connection, so as to maintain the long connection efficiently.

Jack Jiang's Note: Regarding the KeepAlive mechanism of TCP itself, you may read in detail:

"Why does the mobile IM based on the TCP protocol still need the heartbeat keep-alive mechanism? 》
"Comprehend the KeepAlive mechanism of the TCP protocol layer thoroughly"

14. Summary of this article

After reading this article, I believe that you can solve the problem perfectly with the need to maintain a long connection efficiently!

The main design of this project is:

The optimization and improvement of the program are as follows:

15. References

[1] TCP/IP Detailed Explanation Volume 1: Protocol
[2] Why does the TCP-based mobile IM still need the heartbeat keep-alive mechanism?
[3] Thoroughly understand the KeepAlive mechanism of the TCP protocol layer
[4] 10,000-character long article, an article to understand WebSocket: concept, principle, error-prone common sense, hands-on practice
[5] Mobile IM Practice: Realizing the Intelligent Heartbeat Mechanism of Android WeChat
[6] Mobile IM practice: heartbeat strategy analysis of WhatsApp, Line and WeChat
[7] WeChat team original sharing: Android version WeChat background keep-alive actual combat sharing (network keep-alive)
[8] Rongyun Technology Sharing: Network Link Keep-Alive Technology Practice of Rongyun Android IM Products
[9] Alibaba IM Technology Sharing (5): Timeliness Optimization Practice of Xianyu Billion-level IM Messaging System
[10] In 2020, is there still a drama to keep the Android background alive? See how I do it elegantly!

(This article is simultaneously published at: http://www.52im.net/thread-3908-1-1.html )


JackJiang
1.6k 声望808 粉丝

专注即时通讯(IM/推送)技术学习和研究。