"Murder Case" Caused by Database Failure

The title sounds very sensational, but it is not exaggerated, and it is indeed a "blood case" for us. This problem eventually caused a certain underlying core application to be unavailable within 15 minutes, and indirectly caused problems in many upper-level applications, especially some payment-related services. Due to the greater impact of the fault, the fault has a high internal rating. The troubleshooting process can be regarded as twists and turns, and there are more slots in the middle, especially the big guys who are more familiar with the network can see the problem at a glance. I have also been deeply involved in the troubleshooting of this fault. Here is a summary, hoping to give you some reference.

0. Article guide

This article is about 7000 words, with 26 pictures. The article is relatively long, so here is some introduction to the article structure. This article is divided into 5 chapters, and the contents of each chapter are summarized as follows:

Failure phenomenon: This chapter introduces the failure phenomenon. Before reading the follow-up content, you need to understand the failure phenomenon.
Troubleshooting process: This chapter introduces the troubleshooting process and gives preliminary conclusions.
Fault recurrence: Based on the conclusions of troubleshooting, this chapter specifically conducts fault recurrence and verification, and gives troubleshooting measures.
Explore again: re-examine the troubleshooting process, and re-explore the doubts in it, try to find the "truth"
Summary: This chapter reviews and summarizes some problems existing in the troubleshooting and troubleshooting process

It should be noted that, in order to reduce the size of the picture, some exception stack information has been deleted, but the core calls are all there.

1. Malfunction

At around 2 pm on a Sunday in April, a large number of alarms occurred in a core application, but it automatically recovered after a short while. From the monitoring point of view, the duration of the failure was about 15 minutes. Looking through the business log, I found that there are many druid-related errors, which seems to be something wrong with druid.

Figure 1: A large number of business threads throw a connection timeout exception

Figure 2: Druid connects to the producer thread and throws a network exception

Later, I learned from the DBA that Alibaba Cloud RDS performed an automatic active/standby switchover at 13:57 due to a physical machine failure. Since the RDS active/standby switchover time is very close to the time when our application fails, it is preliminarily judged that the failure is related to the Alibaba Cloud RDS switchover. From the perspective of phenomenon, after Alibaba Cloud RDS performs the active/standby switch, our application does not seem to be switched successfully, and it is still connected to the faulty machine. Since the DBA has done many exercises for switching between active and standby before, nothing has happened, so this problem seemed quite puzzling at the time.

The above is the background and phenomenon of the failure. Let's start to analyze the cause of the failure.

2. Troubleshooting process

Before starting the analysis, let me introduce druid's concurrency model and make some technical preparations.

2.1 Introduction to background knowledge

The druid data source uses the producer-consumer model to maintain the connection pool. When there is no connection in the connection pool, the consumer thread will notify the producer thread to create a connection. After the producer creates the connection, it puts the connection in the pool and notifies the consumer thread to fetch the connection. If the consumer thread fails to fetch the connection within the set time, a timeout exception will be thrown.

Figure 3: Druid concurrency model

Note that the producer thread is a single thread. If this thread is blocked under certain circumstances, it will cause a large number of consumer threads to fail to obtain connections.

2.2 Investigation process

2.2.1 Preliminary investigation

I took over this problem first. At that time, a lot of information was not available, only the exception log. When I first started the investigation, I checked the log on one of the machines and found that there were only a large number of GetConnectionTimeoutException exceptions in the log, and no exceptions thrown by the druid producer thread.

Figure 4: Consumer thread throws an exception

The exception information thrown by the consumer thread contains two data related to the producer, and the state of the producer can be understood through these two data. The first is creating, which means that the producer is currently creating a connection. The second is createElapseMillis, which represents the time it takes for the producer to create a connection when the consumer returns over time. In the above figure, the value of createElapseMillis is about 900 seconds, which is obviously problematic, indicating that the producer thread should be blocked. So based on this information, I give a preliminary conclusion:

The producer thread is stuck. The most likely reason is that the timeout period is not configured when the connection is created. This problem can be solved by appending a connectTimeout parameter to the database URL.

The investigation seems to be able to explain it here, but there are many questions that have not been solved here: which method is it stuck on? Is it really useful to configure this parameter, and can it be verified? Without answering these questions, this troubleshooting conclusion is obviously not convincing. Therefore, more people will participate in the investigation in the follow-up, and more and more information will be collected.

2.2.2 In-depth investigation

At this time, our DBA started to contact Alibaba Cloud technical support for communication, and got the reply that the RDS physical machine failed, which triggered the automatic active/standby switchover mechanism. On the other hand, other colleagues read the error logs of more machines in detail and found that the producer thread also threw an exception.

Figure 5: Druid production line thread throws network exception

The strange thing is that there are no producer exceptions in the logs of two machines. Later, I asked the person in charge of the application and learned that he restarted the application at 14:10, and the service was restored before the remaining machines had time to restart. Unfortunately, when I first started the investigation, the log I checked was the log of the restarted machine. This also highlights the importance of extensively collecting information when troubleshooting problems. In addition to this exception thrown by the producer thread, some consumer threads also throw the same exception:

Figure 6: Consumer thread throws network exception

As above, the producer and some consumer threads throw exceptions after being stuck for more than 900 seconds. By consulting TCP-related information, colleagues who investigated together believe that this time should be caused by the maximum timeout time of the underlying network. Therefore, it is inferred that after the MySQL physical machine fails, it no longer responds to any data packets from the client. After the underlying network of the client machine failed after many retries, the connection was closed by itself, and the upper-level application threw a SocketException. Through the data, we learned that the net.ipv4.tcp_retries2 kernel parameter is used to control TCP retransmission. Linux official document has related instructions:

tcp_retries2 - INTEGER
This value influences the timeout of an alive TCP connection,
when RTO retransmissions remain unacknowledged.
Given a value of N, a hypothetical TCP connection following
exponential backoff with an initial RTO of TCP_RTO_MIN would
retransmit N times before killing the connection at the (N+1)th RTO.
The default value of 15 yields a hypothetical timeout of 924.6
seconds and is a lower bound for the effective timeout.
TCP will effectively time out at the first RTO which exceeds the
hypothetical timeout.

Figure 7: Linux TCP timeout retransmission situation; picture source: Marco Pracucci

After the investigation, many things can be explained, and it seems that the case can be "closed". But is this really the case? Since there is no packet capture data at the time, everyone is still unable to determine whether this is the reason. So next, I followed this investigation conclusion to reproduce it.

3. Fault recurrence

3.1 Preparation

For more accurate reproduction, we deploy business applications on an ECS with the same online configuration. A MySQL service was built on another ECS on the intranet, and online data was synchronized to this database. Since we cannot make MySQL physical machine failures, we configure firewall policies through iptables to simulate network layer failures caused by machine failures, such as not responding to client requests. Finally, we use arthas to monitor the druid producer thread and observe the time-consuming connection creation.

3.2 Introduction to background knowledge

Before starting the follow-up content, let's first introduce the process of establishing a connection between the MySQL driver and the server. The schematic diagram is as follows:

Figure 8: MySQL client and server communication process

Here is a brief explanation. When a business application establishes a MySQL database connection, it must first complete the TCP three-way handshake. After the TCP connection is established, the application layer handshake is performed based on the TCP connection. Mainly to exchange some information, and login authentication. The server handshake information mainly returns the version of the server, the login authentication method accepted by the server, and other data used for authentication. After receiving the handshake information from the database, the client encrypts the user name and password for login authentication, and then the server returns the authentication result. Let's take a look at the interaction process between the two parties below.

Client ip: 172.***.***.141

Server ip: 172.***.***.251

To capture packets on the client machine, the command is as follows:

tcpdump -i eth0 host 172.***.***.251 -w mysql.pcap

The results of the packet capture are as follows:

Figure 9: Packet capture result 1 of the MySQL connection establishment process

Figure 10: Result of packet capture during MySQL connection establishment 2

We will use Version as the characteristic string of the server handshake segment in the future, and configure firewall rules based on this characteristic.

3.3 Fault recurrence

In this section, in addition to verifying the possible scenarios of the failure, we also need to verify whether our solution is effective. Before reproducing, we need to modify the configuration of druid to ensure that druid can quickly close idle connections to facilitate the verification of the connection creation process.

<bean id="dataSource" class="com.alibaba.druid.pool.DruidDataSource" init-method="init" destroy-method="close">
    <!-- 不配置超时时间，与线上配置保持一致 -->
    <property name="url" value="jdbc:mysql://172.***.***.251:3306/***db" />
    <!-- 最小空闲连接数 -->
    <property name="minIdle" value="0"/>
    <!-- 初始连接数 -->
    <property name="initialSize" value="0"/>
    <!-- 每隔1.5秒检测连接是否需要被驱逐 -->
    <property name="timeBetweenEvictionRunsMillis" value="1500" />
    <!-- 最小驱逐时间，连接空闲时间一旦超过这个时间，就会被关闭 -->
    <property name="minEvictableIdleTimeMillis" value="3000" />
</bean>

3.3.1 Fault scenario inference

According to the exception thrown by the producer thread, the producer thread is stuck during the application layer handshake with the MySQL server. The exception log is as follows:

Figure 11: Exception thrown by druid producer thread

According to this error log, we can determine that the problem is caused by the MySQL application layer handshake after the TCP connection is established. The reason for this is that the code has reached the application layer handshake stage, and the establishment of the TCP layer connection must be completed, otherwise there should be an error similar to the connection failure. Therefore, we only need to verify the scenario where the application layer handshake between the MySQL client and the server fails, but we can verify some other scenarios a little more to increase our understanding of the underlying network behavior. Below we carry out simulations of three scenarios:

Scenario 1: The server does not respond to the client's SYN segment, and TCP cannot complete the three-way handshake

Scenario 2: The TCP connection can be established normally, but the server does not perform an application-layer handshake with the client

Scenario 3: The established connection in the connection pool cannot communicate with the server, causing the consumer thread to be blocked

3.3.2 Failure Recurrence and Verification

3.3.2.1 Failure scenario one

Expected phenomenon: TCP packets are dropped, and the client keeps retrying. The druid producer thread is blocked and the database connection cannot be established

Step 1: Scene simulation

iptables -I INPUT -s 172.***.***.141 -j DROP

Step 2: Trigger database access and observe the phenomenon

Log in to the business machine and use arthas to observe the execution of the druid producer thread. The command is as follows:

trace com.alibaba.druid.pool.DruidAbstractDataSource createPhysicalConnection -n 1000 -j

Trigger a database access. Since there is no connection in the connection pool, druid will create a connection. The phenomenon is as follows:

Figure 12: Underlying network failure, druid new connection time-consuming situation

It can be seen that when the TCP connection cannot be established, the DruidAbstractDataSource#createPhysicalConnection method of druid will consume about 127 seconds, and an exception will be thrown. Cancel the firewall rules on the server side: observe the time-consuming situation of establishing a new connection under normal conditions:

iptables -D INPUT -s 172.***.***.141 -j DROP

Figure 13: Under normal circumstances, druid's new connection time-consuming situation

As above, the time-consuming comparison between normal conditions and abnormal conditions is still obvious. So how did the 127-second time-consuming come from? Let's analyze it below.

Abnormal situation analysis

In the process of TCP connection establishment, if the first SYN segment is lost (no ACK message from the other end is received), the Linux system will retry and terminate after a certain number of retries. The initial number of SYN retries is controlled by the kernel parameter tcp_syn_retries, which defaults to 6. The kernel parameters are described as follows:

tcp_syn_retries - INTEGER
Number of times initial SYNs for an active TCP connection attempt

will be retransmitted. Should not be higher than 127. Default value
is 6, which corresponds to 63 seconds till the last retransmission
with the current initial RTO of 1 second. With this the final timeout
for an active TCP connection attempt will happen after 127 seconds.

Source: ip-sysctl.txt

Note that the last sentence is basically the same as the result we got using the arthas trace command. Let's grab another package and observe:

Figure 14: TCP SYN retransmission situation

As above, the first packet is the initial handshake packet of the client, and the next 6 packets are all retries. When the eighth packet was sent, exactly 127 seconds passed, which is consistent with the kernel parameter description. At this point, the problem that the TCP connection cannot be successfully created is clear. Since our application has been stuck for more than 900 seconds, it is clear that the TCP handshake failure can be ruled out. But although the reason is not this, but if the druid producer thread is stuck for 127 seconds, it is also quite dangerous. How to prevent it?

Prevent connection timeout problems

By configuring the connectTimeout parameter, you can reduce the TCP SYN retry time, the configuration is as follows:

jdbc:mysql://172.***.***.251:3306/***db?connectTimeout=3000

Initiate another database request to verify that the configuration is valid.

Figure 15: After configuring the connectTimeout parameter, the druid producer thread timeout situation

As shown in the figure above, it took 3 seconds for the druid producer thread to call DruidAbstractDataSource#createPhysicalConnection, which is in line with expectations, indicating that the configuration is valid. The exceptions that the client will throw are as follows:

Figure 16: Details of the exception thrown by the client

3.3.2.2 Failure Scenario Two

This time we will simulate the situation where the underlying network is normal but the application layer is abnormal, that is, the MySQL server does not shake hands with the client.

Expected phenomenon: the druid producer thread is stuck for about 900 seconds and returns with an error

Step 1: Simulate the scene

Configure firewall rules on the MySQL server to prohibit the outflow of MySQL handshake packets.

# 使用 iptables 的 string 模块进行字符串匹配
iptables -I OUTPUT -m string --algo bm --string "5.7.34-log" -j DROP

View firewall rules

Figure 17: List of firewall rules

Step 2: Observe the phenomenon

This time we observe the creation time of the druid producer thread through the exception log thrown by the consumer thread. The exception log is as follows:

Figure 18: Consumer thread exception log

As above, the exception information of the consumer thread shows that the producer thread has been blocked for 1247 seconds, and this value is getting longer and longer as time goes by. It was much longer than 900 seconds, which was not in line with expectations, and the reasons were not explored in depth. However, we seem to have found a problem that is more serious than the original fault, and it seems that we can stop here.

Prevent MySQL application layer handshake timeout

Since connectTimeout only works on the network layer and is invalid for communication at the application layer, we have to change the parameters this time. The parameter we used this time is socketTimeout, and the configuration is as follows:

jdbc:mysql://172.***.***.251:3306/***db?connectTimeout=3000&socketTimeout=5000

Set the socketTimeout parameter value to 5000 milliseconds to verify whether the client can time out and return after 5 seconds if the server does not shake hands with the client. arthas monitors as follows:

Figure 19: arthas trace monitoring

The result is in line with expectations. The druid producer thread did report an error and returned after 5 seconds. The error message is as follows:

Figure 20: Druid producer thread timeout returns

As above, by configuring socketTimeout, you can ensure that druid will return in a timeout when the server does not perform a handshake.

3.3.2.3 Failure scenario three

The last scenario is used to replicate the failure of the MySQL physical machine and does not respond to the client's SQL request.

Expected phenomenon: The client keeps retrying until it times out and returns after 924 seconds

To simulate this scenario, you can discard all data packets from the client on the MySQL server. Refer to the above for the firewall configuration. Just look at the phenomenon.

Figure 21: Consumer thread timeout returns

As above, the consumer thread timeout return time is very close to 924 seconds. It can basically be determined that the timeout returns after multiple retries have failed. Grab the packet to prove it:

Figure 22: TCP connection retry situation

The time when the last packet was sent proves that our judgment is correct. As can be seen from the above packet capture results, a total of 16 retries were performed. As mentioned earlier, TCP non-SYN retries are affected by the kernel parameter ipv4.tcp_retries2, which defaults to 15. Why are there 16 retries here? Here is a brief explanation. The 15 here is used to calculate the total timeout period, and does not represent the number of retransmissions. I will explain STFW in more detail.

The prevention method of Scenario Three is the same as Scenario Two, which is through configuring socketTimeout, so I won't go into details.

3.4 Summary

At this point, the entire recurrence process is over. By reproducing the fault, it seems to prove that our previous investigation results are correct. At the same time, we also verified the validity of the two configurations of connectTimeout and socketTimeout, and the results are also in line with expectations. However, during the reproduction process, the time the producer was stuck was much longer than 900 seconds, which did not meet expectations, which is still very confusing. At the time, everyone thought that this seemed to be more serious than the original problem and that it could be circumvented through socketTimeout, so the reason was not investigated in detail. At this point, the investigation work is over, and the subsequent DBA promotes each application to configure two timeout parameters, and this matter has come to an end.

4. Explore again

A week before I was preparing to write this article, I spent some time reviewing the previous troubleshooting documents and error logs, and there was a problem that remained in my mind and was puzzled. The exception information thrown by the producer and some consumer threads in the business log contains "Connection reset" instead of the "Read timed out" that appeared in the previous recurrence process. Obviously, connection reset and connection timeout are not the same situation, so what is the problem? Is there any similarity between the environment we reproduced and Alibaba Cloud RDS, causing the reproduced results to be inconsistent with the actual situation? So I checked some information related to Alibaba Cloud RDS, and there was an architecture diagram that attracted me.

Figure 23: Alibaba Cloud RDS High Availability Architecture; Source: Cloud RDS Technical White Paper

It can be seen from this picture that the recurring environment we built lacks SLB, but SLB is just a load balancer, will it interrupt the connection? The answer is, I don't know. So I went to find some information to understand under what circumstances the RST packet will be generated. There are many situations, some of which are as follows:

The target port is not listening
One of the two communication parties restarts after a power failure, which is essentially the target port not listening
There is a firewall in the communication path, and the connection is interrupted by the firewall policy
The server listening queue is full and can no longer accept new connections. Of course, the server can also directly discard the SYN packet and let the client try again
TCP buffer overflow

After understanding the several situations caused by RST, let's try to reason about what strategies Alibaba Cloud RDS will implement after a similar failure occurs. Taking into account that their technicians also need to troubleshoot the problem, the most likely strategy is to only remove the faulty machine from the SLB, do not restart the faulty machine, and keep the site. If so, then the faulty machine will not send RST packets. Other strategies, such as killing the MySQL process or shutting down the faulty machine, will send a FIN packet to normally close the connection. I have verified this by capturing the packet. After the investigation, the train of thought is broken again. I really can't figure out where a RST packet will be sent out in more than 900 seconds. The only thing that can be suspected is SLB, but SLB is just a load balancer after all, and it should block connections based on policies like a firewall. Since there are no other ideas, I can only focus on SLB now. Since I'm not familiar with SLB, I'm looking for information again, this time it seems a bit eye-catching.

At first, I searched for SLB RST through Google, but no useful information was found. Considering that SLB is implemented based on LVS, I changed SLB to LVS and continued searching. This time I found some useful information. According to this article Load Balancing Timeout Time understood that Alibaba Cloud SLB will interrupt the connection after the connection is idle for more than the set value. So I found the official document of SLB, and found the desired information in a FAQ.

Figure 24: Alibaba Cloud official document about load balancing FAQ

Now consider this situation. The underlying network of the faulty machine is fine, but MySQL at the application layer is in a dead state and does not respond to client requests. When the TCP connection is idle for more than 900 seconds, the SLB client sends a RST to disconnect. I was very happy when I saw this sentence, so I bought two ECSs and one SLB from Alibaba Cloud to verify it, and the SLB timeout time kept the default value of 900 seconds. Use the nc tool to start a TCP server on one ECS. On another machine, you can also use the nc command to connect to this server. It's just that we can't connect directly, but connect with SLB. The schematic diagram is as follows:

Figure 25: SLB verification environment

After setting up the environment, we captured the package and verified it.

Figure 26: SLB timeout mechanism verification

At this point, I have got the desired result. It seems that we can conclude that our application was stuck after being stuck for more than 900 seconds and was disconnected by the SLB sending a RST. However, in the subsequent verification, it was found that when the connection is idle, the TCP layer of RDS opens the keepalive mechanism, and sends a data packet every minute to keep the connection alive. At this time, the SLB will not actively disconnect the connection. Unless the faulty machine is isolated, it cannot communicate with the SLB. After investigation, I basically gave up treatment. If the RST packet is sent by the SLB, the connection should be in a truly idle state, and there is no keep-alive data packet. Due to the lack of packet capture data at the time and the lack of understanding of Alibaba Cloud's internal faulty machine handling mechanism, it is currently impossible to give a more reasonable conclusion. Let's use this conclusion to "prevaricate" for the first time, and I also like everyone to brainstorm, exchange and study together.

5. Summary and reflections

This troubleshooting is very time-consuming and also cost me a lot of energy. Nevertheless, I still haven't found the final cause, and I feel rather regretful. But no matter what the result is, in general, this troubleshooting process has benefited me a lot. On the one hand, I learned a lot of knowledge, on the other hand, I also found some deficiencies. The following is a brief summary of this failure.

First of all, this time the failure is very sporadic. It is the first time such a situation has occurred in the company's use of Alibaba Cloud services in several years. Secondly, the reason for the failure of the MySQL server is also more complicated. According to Alibaba Cloud's technical reply, the main/standby switchover is triggered because the file system check of the physical machine where the RDS is located is abnormal. Therefore, for us, the cost of recurrence is also very high. At the same time, this failure is closely related to the network, but unfortunately, we do not have the packet capture data at that time. In addition, the cost of fault recurrence is too high, and it may not be able to reproduce 100%, resulting in the lack of this key information. In turn, we don't know what the underlying communication situation is, such as whether retransmission occurs, whether the server sends TCP keep-alive packets, etc. Finally, under the influence of the previous factors, we can only make guesses by collecting various information and combining the failure phenomena, and then verify the rationality of the guesses one by one. Of course, limited to the boundaries of our knowledge, there may be many situations that we have not considered, or the existing conjectures are obviously unreasonable, and you are welcome to correct them.

In the process of troubleshooting this time, due to lack of personal experience, many problems were also exposed. At the end of the article, I also make a summary, hoping to provide you with some reference.

In the early stage of the investigation, information was not collected extensively. At this time, neither the guesses made nor the conclusions reached are unreliable
Too much attention to certain failure phenomena resulted in falling into the details and failed to examine the whole. Not only did a lot of useless work, but also no results
Insufficient knowledge of computer networks, turning a blind eye to some obvious phenomena, and making some wrong guesses at the same time. The follow-up needs to focus on supplementing this knowledge
Ignoring some important differences, the phenomenon that appeared in the recurrence process was a timeout, but the actual situation was that the connection was reset. This difference was not investigated at the time.
There is only one truth. If some phenomena cannot be reasonably explained, it means that the root of the problem has not been found. Don’t force an explanation at this time.

This is the end of this article, thanks for reading.

Reference content

This article is published under the Creative Commons License 4.0, please indicate the source for reprinting
Author: Tian Xiaobo
The original articles will be published to the personal website first, welcome to visit: https://www.tianxiaobo.com

This work is Creative Commons Attribution-Non-Commercial Use-No Derivation 4.0 International License Agreement .