1 Introduction
This article records the troubleshooting process of a slow interface check problem. The phenomenon caused by the problem is highly confusing. At the same time, due to the high occurrence of the problem, it is also difficult to troubleshoot. The investigation process started with a slow investigation phenomenon "caused" by the druid data source as an entry point. After analyzing it step by step, there is still no solution after excluding many possibilities. After re-examining the failure phenomenon and analyzing from another angle, the root cause of the problem was found. Finally, the cause of the problem was verified and confirmed, and the result was in line with expectations. At this point, the troubleshooting process is over, this article records and archives the problem.
2. Problem description
Some time ago, I received feedback from business colleagues, saying that a certain machine they used had two slow checks on multiple interfaces in two consecutive days (highly occasional). But the duration was relatively short, but it quickly recovered. Pinpoint call chain information is as follows:
Figure 1: Long RT interface call chain information
The above picture is the information feedback from business students. It can be seen that there is an interval of 2111 milliseconds between the SqlSessionTemplate#selectList method of MyBatis and the ConnectionImpl#prepareStatement method of MySQL. It is this interval that causes the interface RT to rise. Next, analyze this phenomenon.
3. Investigation process
3.1 Go straight to the topic
From the link information given by the full link tracking system, the cause of the problem seems obvious, that is, there is a long time-consuming operation between selectList and prepareStatement. In theory, as long as you find out where the long time-consuming operation occurred and why it happened, the problem will be solved. So don't care about the three-seven-one, directly analyze the source code of MyBatis.
public class SimpleExecutor extends BaseExecutor {
public <E> List<E> doQuery(MappedStatement ms, Object parameter, RowBounds rowBounds, ResultHandler resultHandler, BoundSql boundSql) throws SQLException {
Statement stmt = null;
try {
Configuration configuration = ms.getConfiguration();
StatementHandler handler = configuration.newStatementHandler(wrapper, ms, parameter, rowBounds, resultHandler, boundSql);
stmt = prepareStatement(handler, ms.getStatementLog()); // 预编译
return handler.<E>query(stmt, resultHandler);
} finally {
closeStatement(stmt);
}
}
private Statement prepareStatement(StatementHandler handler, Log statementLog) throws SQLException {
Statement stmt;
Connection connection = getConnection(statementLog); // 获取数据库连接
stmt = handler.prepare(connection, transaction.getTimeout()); // 执行预编译
handler.parameterize(stmt);
return stmt;
}
}
The call chain of MyBatis's SqlSessionTemplate#selectList method is relatively long, so I will not analyze them one by one here. If you are interested, please see my article MyBatis Source Code Analysis-SQL Execution Process . SqlSessionTemplate#selectList will eventually be executed to SimpleExecutor#doQuery. This method will call SimpleExecutor#prepareStatement for pre-compilation before executing subsequent logic. The prepareStatement method involves two external operations, one is to obtain a database connection, and the other is to perform pre-compilation by calling the relevant methods of the MySQL driver.
From the monitoring in Figure 1, the pre-compilation speed is very fast, and it can be determined that there is no problem with the pre-compilation. Now, turn your attention to the getConnection method, which will eventually request a database connection from the underlying druid data source. Druid uses the producer-consumer model to maintain the connection pool. For a more detailed introduction, please refer to my other article . If there is no available connection in the connection pool, this time the business thread will wait for the druid producer to create a connection. If the producer creates a connection slowly, or the number of active connections reaches the maximum, the business thread must wait. Fortunately, the business application is connected to druid monitoring, and we can understand the connection pool through monitoring.
Figure 2: Druid monitoring diagram
The above picture is drawn with excel, the data has been edited, and there is a certain deviation from the current situation, but it does not affect the subsequent analysis. From the monitoring point of view, the idle connections in the connection pool are all loaned out, but it is still not enough, so the producer keeps creating connections. This triggered our thinking, why does the number of active connections suddenly rise so much? It may be a slow search. At the same time, the number of waits and waiting time of business threads have risen sharply, which shows that the "capacity" of druid connection producers seems to be insufficient, so that some business threads have waited. Under what circumstances will the producer create a connection problem? The situation we considered at the time was as follows:
- There is a delay or packet loss on the network, causing retransmission
- The Alibaba Cloud RDS instance load is very high and cannot respond to the client's connection establishment request in a timely manner
Let me talk about the second situation first. The RDS load problem is well confirmed, and RDS monitoring is available on Alibaba Cloud. We confirmed the monitoring curve when the two problems occurred, and found that the instance load was not high and there was no obvious fluctuation, so the second situation can be ruled out. In case 1, will there be problems with the network? This conjecture is not easy to rule out. The reasons are as follows:
Under normal circumstances, the number of active connections will be very low, and it is generally not a normal phenomenon to skyrocket. If a Java thread takes 5ms from issuing a SQL request to receiving data, then a connection can reach a QPS of 200. And the QPS of this single machine is less than 200, so you shouldn't use so many connections unless there is a slow check. However, we did not find slow SQL in the SQL Insight service of Alibaba Cloud, so the possibility of slow checking can be ruled out. It's not a slow check, it seems that I can only give it to the network. It must be a problem with the network at that time (a good hand).
If there is a problem with the network, the problem of insufficient "capacity" by druid producers seems to be able to be reasonably explained. But after our analysis, we found some contradictions, as follows:
Figure 3: Link tracking data for a long RT request
From the link data point of view, the pre-compilation and execution time of SQL is very fast, and there is no obvious waiting time. If the above situation is accidental, but we looked at a lot of link data and found that the SQL pre-compilation and execution speed are very fast, and there is no obvious delay. Therefore, it is inappropriate to throw the pot to the network.
After the investigation, the train of thought was broken. Firstly, no slow search was found. Secondly, the database resources were sufficient. Finally, the network seemed to be fine. There is no problem, so where is the problem?
3.2 Expanding the information area
3.2.1 Basic information
First, I checked the QPS of the problematic machine and found that there was no burst traffic. Although there were certain fluctuations, the overall situation remained stable.
Figure 4: QPS line chart
Then I looked at the CPU usage rate of the application and found a little problem. The usage rate suddenly increased a lot.
Figure 5: Line graph of CPU usage
I asked my business classmates. There is no timed task at this point. QPS is similar to the past and there is nothing unusual. I don't know why the CPU suddenly rises so much. This information is temporarily unable to provide effective help, so let it go first. Finally, look at network I/O monitoring.
Figure 6: Network traffic monitoring diagram
Both the inbound traffic and the outbound traffic have increased during the time period when the problem occurred. Among them, the outbound traffic has increased greatly, indicating that a large amount of data should be sent out at that time. But the specific behavior of the interface is not yet known.
3.2.2 Business log information
After analyzing the business log, we found some WARN level information:
# 业务线程打出的 WARN 日志,表示等待连接超时,重试一次
2021-07-20 09:56:42.422 [DubboServerHandler-thread-239] WARN com.alibaba.druid.pool.DruidDataSource [DruidDataSource.java:1400] - get connection timeout retry : 1
2021-07-20 09:56:42.427 [DubboServerHandler-thread-242] WARN com.alibaba.druid.pool.DruidDataSource [DruidDataSource.java:1400] - get connection timeout retry : 1
2021-07-20 09:56:42.431 [DubboServerHandler-thread-240] WARN com.alibaba.druid.pool.DruidDataSource [DruidDataSource.java:1400] - get connection timeout retry : 1
# Dubbo TimeoutFilter 答疑超时日志
2021-07-20 09:56:42.432 [DubboServerHandler-thread-254] WARN org.apache.dubbo.rpc.filter.TimeoutFilter [TimeoutFilter.java:60] - [DUBBO] invoke time out. method: xxx arguments: [yyy] , url is dubbo://172.***.***.***:20880/****
2021-07-20 09:56:42.489 [DubboServerHandler-thread-288] WARN org.apache.dubbo.rpc.filter.TimeoutFilter [TimeoutFilter.java:60] - [DUBBO] invoke time out. method: **** arguments: [***, ***] , url is dubbo://172.***.***.***:20880/****
These two logs are useless to tell the truth, because these logs are the result and should have happened. Except for WARN information, no abnormal information was found in the business log. It should be noted that we did not set the business thread to obtain the number of connection retries, and the default number of retries is 0. But here is a retry, and we expect that after the business thread fails to get the connection within the specified time, a GetConnectionTimeoutException should be thrown. This should be a problem with druid's code. I filed an issue for them ( #4381 ), but no one responded. This problem is also very simple to fix. It is a good first issue, and friends who are interested can submit a Pull Request to fix it.
There is not much information available in the business log. We continue to look for other clues. This time we found clues from the JVM monitoring.
3.2.3 GC information
During the short period of time when the problem occurred, the JVM had multiple YGC and FGC, and part of the GC took a long time.
Figure 7: GC frequency and time-consuming monitoring
So now the question is, is GC the cause or the result? Since the time when the GC occurs and the time when the interface slow check occurs are both after 9.56:30, everyone overlaps in time. It is not clear who affected whom or affected each other. The introduction of GC seems to make the investigation more complicated.
The information collected here is almost the same, but the cause of the problem is still not inferred, which is very depressing.
3.3 Revisit
3.3.1 Comprehensive analysis
If we are still investigating from the druid direction, it is difficult to find out the cause. Time is precious, and now I have to rethink the information collected. Now that GC has occurred, the memory consumption of the application must have risen. In the integrated network I/O situation, relatively large fluctuations appeared in a short period of time, which seems to be able to explain this situation. But at that time, the network traffic was not particularly large, and it seemed that it was not enough to cause GC. Insufficient support, put it aside first. In addition, the QPS of the application has not changed much, but the CPU load has suddenly risen a lot. In addition, several GCs take a long time, and maybe there is a correlation between the two, that is, a long GC causes the CPU load to rise. Currently, there are two troubleshooting directions, one is from the network I/O direction, and the other is from the GC direction. As far as the phenomenon is concerned, the problem of GC seems to be greater, so the follow-up chooses to investigate from the GC direction.
3.3.2 Another way of thinking
JVM performs GC, indicating that the memory usage must have gone up. Memory rise is a cumulative process. If we push the investigation time forward from 9:57:00 when the time-consuming GC occurs by one minute, we may find something. So I went to the full link tracking platform in reverse order of time and pulled up the long RT interface list that the problem was applied to within a minute of 9:56:00, and found that the top dozen time-consuming methods were queryAll. as follows:
Figure 8: QueryAll method time-consuming reverse sequence list
Let's take a look at the call chain information of the longest request:
Figure 9: Link information for the longest request
First of all, we can see that the execute execution time of MySQL driver is extremely long, and the reason will be analyzed later. Secondly, the read time of the redis cache is very short and there is no problem. But the redis client takes a long time to write data, which is very abnormal.
So I immediately asked the application developer for code permissions and analyzed the code. The pseudo code is as follows:
public XxxService {
// 备注:该方法缓存实际上使用了 Spring @Cacheable 注解,而非显示操作缓存
public List queryAll(int xxxId) {
// 1、检查缓存,命中则立即返回
List xxxDTOs = customRedisClient.get(xxxId);
if (xxxDTOs != null) return list;
// 2、缓存未命中,查询数据库
List xxxDOs = XxxDao.queryAll(xxxId);
xxxDTOS = convert(xxxDOs)
// 3、写入缓存
customRedisClient.set(xxxId, xxxDTOs, 300s);
return list;
}
}
What the code does is very simple, so why does it take so much time? The reason is this. If the cache fails, the queryAll method will retrieve tens of thousands of data from the database at one time, and the table structure contains some complex fields, such as business rules, communication addresses, etc., so a single-line record is relatively large. In addition, after the data was taken out, two more model conversions (DO → DTO → TO) were carried out. The number of converted models was about 15,000, which was more than half of the original model. The number of TOs was the same as that of DTOs. After the model is converted, write caching is followed, and write caching involves serialization. One call of the queryAll method will generate about five relatively large data in the new generation. The first is the data set ResultSet, the second is the DO list, the third is the DTO list, the fourth is the TO list, and the last is the DTO. The serialized content of the list. In addition, there are more than 20 calls within two seconds, which intensifies memory consumption, which should explain why the number of GCs suddenly rises so much. There are a few more questions below, I use FAQ to answer:
Q: How to explain how long GC takes?
A: I guess it may be caused by the garbage collector sorting and copying large amounts of memory data.
-------------------------------------------------✂-------------------------------------------------
Q: Why is there such a long interval between the execute method and the set method?
A: The current guess is that the conversion of the model class and the serialization itself takes a certain amount of time. Secondly, there should be multiple serialization processes at the same time during this period, but this does not explain why the time is so long. But if we take GC into consideration, we will get a relatively reasonable explanation. From 9:56:33 to 5:56:52, there were multiple GCs, and some GCs took a long time (long stop the world), which caused the phenomenon that the interval between the two methods was very long. In fact, we can look at the call chain information of the first queryAll request at 9:56:31, and we will find that the interval is not so long:
Figure 10: The time-consuming situation of queryAll under normal conditions
Therefore, we can think that the excessively long interval between the execute and set methods in the subsequent call chain is caused by factors such as CPU usage and GC.
-------------------------------------------------✂-------------------------------------------------
Q: Why is the interval between the first set and the second set so long?
A: The first set is our custom logic, and there seems to be nothing special between the second set, and no problems were detected at that time. Fortunately, when it reappeared, I found a clue, and I will explain it in the following chapters.
-------------------------------------------------✂-------------------------------------------------
Finally, we turned our attention to the initial problem, that is, the problem of some interface RT rises reported by business colleagues. The following is still answered in the way of FAQ.
Q: Why does the RT of multiple interfaces increase? Refer to the following figure for the call chain:
Figure 11: Link information of a long RT interface
A: Some business threads are waiting for druid to create a database connection. Due to the occurrence of GC, STW is caused. GC will affect the waiting logic. For example, a thread calls awaitNanos and waits for 3 seconds. As a result, a 5-second GC (STW) occurs during this period. Then when the GC ends, the thread immediately times out. In the druid data source, the maxWait parameter controls the waiting time of the business thread, the code is as follows:
public class DruidDataSource extends DruidAbstractDataSource {
private DruidPooledConnection getConnectionInternal(long maxWait) throws SQLException {
// ...
final long nanos = TimeUnit.MILLISECONDS.toNanos(maxWait);
if (maxWait > 0) {
// 尝试在设定时间内从连接池中获取连接
holder = pollLast(nanos);
} else {
holder = takeLast();
}
// ...
}
}
3.4 Preliminary conclusion
After the previous investigation, many phenomena have been reasonably explained, it is time to give a conclusion:
In this xxx application, multiple interfaces have experienced two consecutive RT increases in two consecutive days. After investigation, it is initially believed that the queryAll method cache is invalid. In a short period of time, dozens of requests for large-scale data query, model conversion, and serialization operations , Consuming a lot of memory, triggering many long-term GCs. As a result, some business threads waited for timeout, which in turn caused the Dubbo thread pool to be full.
Next, we have to reproduce this conclusion to prove that our conclusion is correct.
4. Problem recurring
It is still necessary to simulate the situation at the time as much as possible to reproduce the problem, otherwise it may cause a relatively large error. In order to more accurately simulate the interface call situation at that time, I wrote a verification logic that can control the QPS and the total number of requests.
public class XxxController {
@Resource
private XxxApi xxxApi;
public Object invokeQueryAll(Integer qps, Integer total) {
RateLimiter rl = RateLimiter.create(qps.doubleValue());
ExecutorService es = Executors.newFixedThreadPool(50);
for (Integer i = 0; i < total; i++) {
es.submit(() -> {
rl.acquire();
xxxApi.queryAll(0);
});
}
return "OK";
}
}
The recurring effect is in line with expectations. The CPU usage rate and network I/O are all up (because the monitoring belongs to the company's internal system, no screenshots will be taken). At the same time GC also happened, and it took a long time. The GC log is as follows:
2021-07-29T19:09:07.655+0800: 631609.822: [GC (Allocation Failure) 2021-07-29T19:09:07.656+0800: 631609.823: [ParNew: 2797465K->314560K(2831168K), 2.0130187 secs] 3285781K->1362568K(4928320K), 2.0145223 secs] [Times: user=3.62 sys=0.07, real=2.02 secs]
2021-07-29T19:09:11.550+0800: 631613.717: [GC (Allocation Failure) 2021-07-29T19:09:11.551+0800: 631613.718: [ParNew: 2831168K->314560K(2831168K), 1.7428491 secs] 3879176K->1961168K(4928320K), 1.7443725 secs] [Times: user=3.21 sys=0.04, real=1.74 secs]
2021-07-29T19:09:13.300+0800: 631615.467: [GC (CMS Initial Mark) [1 CMS-initial-mark: 1646608K(2097152K)] 1965708K(4928320K), 0.0647481 secs] [Times: user=0.19 sys=0.00, real=0.06 secs]
2021-07-29T19:09:13.366+0800: 631615.533: [CMS-concurrent-mark-start]
2021-07-29T19:09:15.934+0800: 631618.100: [GC (Allocation Failure) 2021-07-29T19:09:15.934+0800: 631618.101: [ParNew: 2831168K->2831168K(2831168K), 0.0000388 secs]2021-07-29T19:09:15.934+0800: 631618.101: [CMS2021-07-29T19:09:17.305+0800: 631619.471: [CMS-concurrent-mark: 3.668/3.938 secs] [Times: user=6.49 sys=0.01, real=3.94 secs]
(concurrent mode failure): 1646608K->1722401K(2097152K), 6.7005795 secs] 4477776K->1722401K(4928320K), [Metaspace: 224031K->224031K(1269760K)], 6.7028302 secs] [Times: user=6.71 sys=0.00, real=6.70 secs]
2021-07-29T19:09:24.732+0800: 631626.899: [GC (CMS Initial Mark) [1 CMS-initial-mark: 1722401K(2097152K)] 3131004K(4928320K), 0.3961644 secs] [Times: user=0.69 sys=0.00, real=0.40 secs]
2021-07-29T19:09:25.129+0800: 631627.296: [CMS-concurrent-mark-start]
2021-07-29T19:09:29.012+0800: 631631.179: [GC (Allocation Failure) 2021-07-29T19:09:29.013+0800: 631631.180: [ParNew: 2516608K->2516608K(2831168K), 0.0000292 secs]2021-07-29T19:09:29.013+0800: 631631.180: [CMS2021-07-29T19:09:30.733+0800: 631632.900: [CMS-concurrent-mark: 5.487/5.603 secs] [Times: user=9.29 sys=0.00, real=5.60 secs]
(concurrent mode failure): 1722401K->1519344K(2097152K), 6.6845837 secs] 4239009K->1519344K(4928320K), [Metaspace: 223389K->223389K(1269760K)], 6.6863578 secs] [Times: user=6.70 sys=0.00, real=6.69 secs]
Next, let's take a look at the interface time-consuming situation during that time:
Figure 12: QueryAll time-consuming reverse sequence list when the problem reappears
The consumption time of all interfaces is very long, which is in line with expectations. Finally, look at the link information of a long RT interface:
Figure 13: Link information of a long RT interface
You will find that it is consistent with the picture 1, that is, the problem reported by the business classmates, indicating that the recurring effect is in line with expectations.
Verification to this point can prove that our conclusion is correct. The root cause of the problem has been found, and the problem can be filed.
5. Further exploration
5.1 Time-consuming calculation
For performance considerations, the link information given by Pinpoint is too thick, so that we cannot know in detail what the time-consuming structure of the queryAll method is. In order to clarify the details, I made a more detailed calculation of the time-consuming situation of the queryAll method. Trigger a request when the application load is relatively low, and use the trace command of Arthas to measure the link time. The obtained monitoring is as follows:
Figure 14: queryAll method link information under normal conditions
Here I have numbered the three method calls. There is an interval of 252 milliseconds between methods ① and ②, and an interval of 294 milliseconds between methods ② and ③. The link time-consuming situation printed by Arthas is as follows:
Figure 15: Time-consuming measurement of queryAll method
The numbers here correspond to the above picture one by one, and I have marked out the calls that are more time-consuming in color. First, we analyze the composition of interval 1. There are two time-consuming operations between methods ① and ②, one is batch model conversion, which is to convert DO to DTO, which consumes about 79.6 milliseconds. The second time-consuming operation is the toString method of Object, about 171.6. The two add up to 251.2, which is consistent with the data given by the full link tracking system. Everyone must be curious about why the toString method consumes so much time. The answer is as follows:
public void put(Object key, Object value) {
// 省略判空逻辑
// 把 key 和 value 先转成字符串,再进行判空
if (StringUtils.isBlank(key.toString()) || StringUtils.isBlank(value.toString())) {
return;
}
CacheClient.set(key, value);
}
This is a method in the path of methods ① and ②. This code looks harmless to humans and animals, and the problem occurs in the logic of the null (historical code, and its rationality is not discussed). When value is a very large collection or array, toString will consume a lot of time. At the same time, toString will also generate a large string, which virtually consumes memory resources. The seemingly inconspicuous piece of code here is actually a performance black hole. This also reminds us that we should pay attention to time and space consumption when operating large quantities of data.
Finally, let's take a look at the interval between methods ② and ③. The reason is already obvious, that is, the value serialization process consumes a lot of time. In addition, the serialized byte array will also be temporarily stored in the heap memory, which will also consume a lot of memory resources.
At this point, the entire analysis process is over. Through the above analysis, we can see that a simple query can lead to so many problems. Many logics that seemed sparse and ordinary in the past can occasionally become performance killers. In daily work, we still have to pay attention to the performance problems of the application frequently to protect the stable operation of the application.
5.2 Memory consumption calculation
Since the JVM only performed FGC when the problem occurred, the memory did not overflow, so there was no memory dump data at that time. Fortunately, the problem can be reproduced stably. By configuring the JVM, we can make the JVM automatically dump the memory before FGC occurs. Here use the jinfo command to set parameters for the running JVM process:
jinfo -flag +HeapDumpBeforeFullGC $JVM_PID
After getting the memory data, let's analyze it with the mat tool. First look at the memory leak:
Figure 16: Application memory leak analysis
From the perspective of memory consumption ratio, there are indeed some problems, mainly related to dubbo threads. Pick a thread at random and view the information about the objects dominated by the thread in the dominator tree view:
Figure 17: Dubbo thread dominance tree situation
As can be seen from the above figure, the retained heap of the dubbo thread is about 66 MB, which is mainly composed of two large objects, ArrayList and StringBuilder. The ArrayList stores DTOs, and the retained heap size of a single DTO is about 5.1 KB. StringBuilder is mainly produced by the toString process and consumes nearly 26 MB of memory. The DO list is not found in the dumped memory data, it should be recovered during YGC.
Well, that's it for the analysis of memory. It's probably enough to know what memory consumption is, and I won't do more in-depth analysis.
6. Summary
Due to the lack of troubleshooting experience, this problem took a lot of time intermittently. There was depression when the reason was not found, and joy when some conjectures were in line with expectations. In any case, the cause of the problem was finally found. While helping others, I also learned a lot. In general, the effort is worth it. At the end of this article, let's briefly summarize the troubleshooting process.
At the beginning, I started investigating directly from the specific phenomenon, hoping to find the cause of the phenomenon. Various conjectures were made, but no reasonable conclusion could be drawn. Then expand the information area, still to no avail. After synthesizing all kinds of information and thinking about it, I checked in another direction and found the reason. Finally, verify and explain some doubts, and the whole process ends.
Finally, let's talk about the problems in this investigation process. The first problem is that I did not pay attention to screening the information feedback from others, did not quickly confirm the information, but went straight into it. I spent a lot of time trying various conjectures, but ultimately no results. Because the information feedback from others is usually fragmented, or even incorrect. The information can be quickly confirmed, and if you are lucky, you can find the reason directly. But if you find that this road is not working, don't go to the horns, because this phenomenon may only be one of many phenomena. In this case, the interface long RT seems to be caused by druid, but it is actually caused by STW caused by GC. Continue to investigate along the direction of druid, it must be totally different in the end. The other issues are minor issues, so I won't talk about them. In addition, during the investigation process, you should pay attention to saving some log data at the time to avoid data loss due to expiration. For example, Alibaba Cloud RDS SQL Insight has a time limit.
This is the end of this article, thanks for reading!
This article is published under the Creative Commons License 4.0, please indicate the source for reprinting
Author: Tian Xiaobo
The original articles will be published to the personal website first, welcome to visit: https://www.tianxiaobo.com
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。