1

Author: Bao Fengqi

A development member of the dble team at Axon, mainly responsible for dble requirements development, troubleshooting and community question answering. Stop talking nonsense and come here.

Source of this article: original contribution

*The original content is produced by the open source community of Aikesheng, and the original content shall not be used without authorization. For reprinting, please contact the editor and indicate the source.


background

In a stable environment, when dble initializes the back-end connection pool, the back-end connection pool will have a mismatch between the connection counter (totalConnections) and the actual number of connections (allConnections). In theory, the two variables will maintain eventual consistency.

Later, I found the relevant documents by consulting the relevant documents on the Internet: https://mp.weixin.qq.com/s/FEybf-jRL8gMUYPBDH3aQg . For the detailed analysis process, please refer to this article.

To put it simply, in the process of dble initializing the backend connection pool, the number of instantaneously created connections may be too large, causing the TCP syn_cookie mechanism to be triggered during the handshake of some TCP connections, and the ACK message of the third TCP handshake is lost, thus lead to the above situation.

Subsequently, this situation was temporarily resolved after the TCP syn_cookie was turned off in a stable environment.

But suppose the following three abnormal situations occur in the normal TCP three-way handshake:

  • TCP first handshake packet SYN lost
  • TCP second handshake packet SYN, ACK packet loss
  • TCP third handshake ACK packet lost

How does the client and server handle it, and if it retransmits, how many times? How long is each interval?

lab environment

Start the MySQL service on a server, the port is 3306, the IP address: 10.186.60.69

Use MySQL client to connect to MySQL service on one server, IP address: 10.186.60.60

first scenario

What happens when the SYN packet of the first handshake packet of TCP is lost?

Execute on the MySQL server to block all TCP packets sent by the client through iptables:

 $ iptables -i eth0 -A INPUT -p tcp --dport 3306 -j DROP

Start capturing packets on the MySQL server:

 $ tcpdump -i eth0 tcp and port 3306 -w tcp_syn_timeout.cap

Connect to the MySQL server through the MySQL client. After more than a minute, the client returns an error and stops capturing packets, as shown in the following figure:

 $ mysql -h10.186.60.69 -uroot -p123456 -P3306
mysql: [Warning] Using a password on the command line interface can be insecure.
ERROR 2003 (HY000): Can't connect to MySQL server on '10.186.60.69' (110)

Wireshark analyzes the captured file:

After the client does not receive the ACK message of the TCP SYN message, it will continue to retry. In the example, it will retry six times and the RTO is different each time:

  • The first time is retry after 1 second
  • The second time is to retry after 3 seconds, which is about 2s different from the first time
  • The third time is to retry after 7 seconds, which is about 4s different from the second time.
  • The fourth time is to retry after 15 seconds, which is about 8s different from the third time
  • The fifth time is to retry after 31 seconds, which is about 16s different from the fourth time.
  • The sixth time is to retry after 65 seconds, which is about 32s after the fifth time.

The RTO increases exponentially with each timeout. In addition, the number of retries here can be configured, specified by the following kernel parameters of the client machine:

 $ cat /proc/sys/net/ipv4/tcp_syn_retries
6 
 
# 不同的发行版本,参数可能不同
$ uname -a
Linux ubuntu 4.15.0-36-generic #39-Ubuntu SMP Mon Sep 24 16:19:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

You can try to modify this parameter to see the effect:

 $ echo 2 > /proc/sys/net/ipv4/tcp_syn_retries

therefore:

What happens when the SYN packet of the first handshake packet of TCP is lost?

The client will retransmit the SYN message until it receives an ACK or reaches the maximum number of times, and the time for each retry is doubled.

second scenario

The SYN + ACK packet of the TCP second handshake is lost, what will happen?

In order to simulate the packet loss situation of SYN + ACK, set up a firewall on the client side to intercept all messages from the MySQL server:

 $ iptables -A INPUT -p tcp -s 10.186.60.69 -j DROP

Capture packets on the MySQL server side:

 $ tcpdump -i eth0 tcp and port 3306 -w tcp_syn_ack_timeout.cap

Analyze the tcp flow with the flow graph function under wireshark Statistics:

From the above figure, it can be divided into two perspectives:

Client's perspective: After the client sends the SYN message, it does not receive the SYN + ACK message due to the firewall set up. Therefore, the client will continue to retry until it receives the SYN + ACK or reaches the maximum number of retries.

Server perspective: After receiving the SYN message, the server sends a SYN + ACK message, but cannot receive the ACK message of the last handshake, so the server will continue to retry to send the SYN + ACK message.

After the SYN packet retransmitted by the client overtime arrives at the server, the server then returns the SYN and ACK packets, but the retransmission timers of the SYN and ACK packets are not reset and continue to be retransmitted.

As can be seen from the figure, after the server receives the third SYN message and sends a SYN + ACK message, it retries four times, including the previous retry.

This number of retries is also controlled by a kernel parameter:

 $ cat /proc/sys/net/ipv4/tcp_synack_retries
5

After setting the client kernel parameter tcp_synack_retries to 1, the TCP interaction diagram:

This makes the number of retries more obvious.

third scenario

The ACK of the TCP third handshake is lost

Set up a firewall on the MySQL server side to intercept the ACK message of the TCP third handshake:

 $ iptables -A INPUT -p tcp --tcp-flag ack ack --dport 3306 -j DROP

Capture packets on the MySQL server side:

 $ tcpdump -i eth0 tcp and port 3306 -w tcp_3th_ack_timeout.cap

Analyze the captured file:

From the above figure, it can be divided into two perspectives:

Client perspective: For the client, the connection is actually established

View the status of the connection through the netstat command:

 $ netstat -napt|grep 3306
tcp 0 0 10.186.60.60:42490 10.186.60.69:3306 ESTABLISHED 14391/mysql

The state of the connection at this time is the ESTABLISHED state.

Server perspective: Since the third ACK message is not received, similar to the second scenario, the server will continue to resend the SYN + ACK message until the maximum number of times is reached

During the retry, the state of the server connection is always in the SYN_RECV state:

 $ netstat -napt|grep 3306
tcp        0      0 10.186.60.69:3306       10.186.60.60:42868      SYN_RECV    -

After about half a minute, that is, after the server reaches the number of retries, the TCP connection that was in the SYN_RECV state just now on the server disappears.

However, the client connection still exists at this time.

What to do after the client is connected?

Discuss the scenarios at this point:

One scenario is that the client sends data directly after the TCP connection is established.

Another scenario is that the client does nothing. Both cases are discussed below.

Client sends data

To simulate this scenario, we connect the MySQL server through the client in advance.

After the connection is made, the client's message is isolated through the firewall on the MySQL server side:

 $ iptables -A INPUT -p tcp -s 10.186.60.60 -j DROP

Capture packets on the MySQL server:

 $ tcpdump -i eth0 tcp and port 3306 -w tcp_data.cap

Then through the connection just established, issue the use test ; statement.

Let's analyze the packet capture file:

analyze:

We found that the client retransmitted a total of eleven times.

For data packet transmission after TCP establishes a connection, the maximum number of timeout retransmissions is specified by tcp_retries2. The default value is 15 times. Here, for the convenience of observation, the value is adjusted to 10 times, as follows:

 $ cat /proc/sys/net/ipv4/tcp_retries2
10

However, there are eleven packets transmitted in the packet capture file here, refer to the article here: https://perthcharles.github.io/2015/09/07/wiki-tcp-retries/

no action

In the MySQL protocol, after the TCP is established, the MySQL server will send a handshake packet. Since the MySQL server connection is no longer there, the handshake packet will not be sent, and the client will always hang.

 $ mysql -h10.186.60.69 -uroot -p123456 -P3306
mysql: [Warning] Using a password on the command line interface can be insecure.

At this time, the survival of the client connection is ensured by the TCP keep-alive mechanism.

keep-alive mechanism:

  1. First of all, there is a premise: within a certain period of time, if there is no action on the connection, the TCP keep-alive mechanism will start to work.
  2. The keep-alive mechanism will send a "probe message" every fixed time. If several consecutive probe messages are not responded, it is considered that the TCP connection has died, and the system kernel will notify the upper-layer application of the error message.

There are corresponding parameters in the Linux kernel to set the keep-alive time, the number of keep-alive detections, and the time interval of keep-alive detections. The following are the default values:

 $ sysctl -a|grep keepalive
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
  
# 也可以通过下面的方式来查看:
$ cat /proc/sys/net/ipv4/tcp_keepalive_intvl
$ cat /proc/sys/net/ipv4/tcp_keepalive_probes
$ cat /proc/sys/net/ipv4/tcp_keepalive_time

Parameter explanation:

  • tcp_keepalive_intvl=75: Indicates that each detection interval is 75 seconds;
  • tcp_keepalive_probes=9: Indicates that no response is detected for 9 times, and the other party is considered unreachable, thus interrupting this connection.
  • tcp_keepalive_time=7200: Indicates that the keepalive time is 7200 seconds (2 hours), that is, if there is no connection-related activity within 2 hours, the keepalive mechanism will be activated

We can modify the parameters to see the effect:

 $ echo 10 > /proc/sys/net/ipv4/tcp_keepalive_intvl
$ echo 40 > /proc/sys/net/ipv4/tcp_keepalive_time
$ echo 2 > /proc/sys/net/ipv4/tcp_keepalive_probes

View through the packet capture file:

It can be seen that at the time of 40s, the detection of the live package was started, and the detection was performed twice. Each time interval was 10s, which met the definition of parameter modification.

After modifying the parameters, the client reported an error about a minute later:

 $ mysql -h10.186.60.69 -uroot -p123456 -P3306
mysql: [Warning] Using a password on the command line interface can be insecure.
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 110

爱可生开源社区
426 声望209 粉丝

成立于 2017 年,以开源高质量的运维工具、日常分享技术干货内容、持续的全国性的社区活动为社区己任;目前开源的产品有:SQL审核工具 SQLE,分布式中间件 DBLE、数据传输组件DTLE。


引用和评论

0 条评论