Performance problems caused by bad code: CPU usage soared to 900%!

Students who have read the book "Refactoring-Improving the Design of Existing Code" should all know the "bad smell of code". Of course, determining what a code "bad smell" is is subjective, and it will vary with language, developer, and development method. In the work, many times it is to maintain the previous project and add some new functions on this basis. In order to make the project code easy to understand and maintain, always pay attention to the "bad smell" in the code, when you find that the code is bad Smell, it is necessary to refactor it in time to make it into excellent clean code. Today we are going to talk about the impact of "bad smelling code" on system performance. The author will show you a few cases, hoping to inspire and help you.

FGC actual combat: Bad code causes frequent services and FGC unresponsiveness analysis

problem

network problem?

Starting at 7 o'clock in the evening, I began to receive alarm emails continuously, which showed that several interfaces detected were timed out. Most of the execution stack is in:

java.io.BufferedReader.readLine(BufferedReader.java:371)
java.io.BufferedReader.readLine(BufferReader.java:389)
java_io_BufferedReader$readLine.call(Unknown Source)
com.domain.detect.http.HttpClient.getResponse(HttpClient.groovy:122)
com.domain.detect.http.HttpClient.this$2$getResponse(HttpClient.groovy)

I have seen the error report of this thread stack a lot. The HTTP DNS timeout we set is 1s, the connect timeout is 2s, and the read timeout is 3s. This kind of error is that the detection service sends the HTTP request normally, and the server is also processing the request normally. After the normal response, but the data packet is lost in the network layer forwarding, so the execution stack of the request thread will stay at the place where the interface response is obtained. The typical feature of this situation is that the corresponding log records can be found on the server. And the log will show that the server response is completely normal. In contrast to it, the thread stack stays at Socket connect, which fails when the connection is established, and the server is completely unaware.

I noticed that one of the interfaces reports errors more frequently. This interface needs to upload a 4M file to the server, and then go through a series of business logic processing, and then return 2M text data, while the other interfaces are simple business logic, I guess It may be that too much data needs to be uploaded and downloaded, so the probability of packet loss caused by timeout is also greater.

According to this conjecture, the group logs on to the server and uses the requested request_id to search in the recent service logs. As expected, the interface timeout caused by the network packet loss problem.

Of course, this leader will not be satisfied. This conclusion has to be answered by someone. So hurriedly contacted the operation and maintenance and network team to confirm to them the network status at that time. The classmates of the network group replied that the switch in the computer room where our detection service is located is old and there is an unknown forwarding bottleneck, which is being optimized. This makes me more at ease.

problem broke out

I thought that there was such a small wave on duty this time, but at 8 o'clock in the evening, alarm emails from various interfaces swarmed, and I was caught off guard when I was ready to pack my things for a single break on Sunday.

This time, almost all interfaces are overtime, and our massive network I/O interface must time out every time it is probed. Could it be that the entire computer room is malfunctioning.

Through the server and monitoring again, I saw that the indicators of each interface were normal. I tested the interface and it was completely OK. Since it does not affect the online service, I am going to stop the detection task through the interface of the detection service first and then check it slowly.

As a result, I sent a request to the interface for suspending the detection task for a long time and there was no response. At this time, I realized that it was not that simple.

solve

Memory leak
So hurry up to log in to the detection server, first of all top free df Sanlian, the result really found some abnormalities.

The CPU usage of our detection process is extremely high, reaching 900%.

Our Java process does not do a lot of CPU calculations. Under normal circumstances, the CPU should be between 100% and 200%. When this kind of CPU surge occurs, either it is in an infinite loop or it is doing a lot of GC.

Use the jstat -gc pid [interval] command to check the GC status of the java process. Sure enough, FULL GC reached once per second.

With so many FULL GCs, it should be that the memory leak did not run away, so use jstack pid> jstack.log to save the thread stack scene, use jmap -dump:format=b,file=heap.log pid to save the heap scene, and then restart With the detection service, the alarm email finally stopped.

jstat

jstat is a very powerful JVM monitoring tool, the general usage is: jstat [-options] pid interval

The viewing items it supports are:

class View class loading information
compile compile statistics
gc garbage collection information
gcXXX detailed information of each area GC such as -gcold

Using it is very helpful for locating JVM memory problems.

Troubleshoot

Although the problem is solved, in order to prevent it from happening again, we still need to find out the root cause.

analysis stack
The analysis of the stack is very simple, see if there are too many threads, and what most stacks are doing.

> grep 'java.lang.Thread.State' jstack.log  | wc -l
> 464

There are only more than 400 threads, and there is no exception.

> grep -A 1 'java.lang.Thread.State' jstack.log  | grep -v 'java.lang.Thread.State' | sort | uniq -c |sort -n

     10  at java.lang.Class.forName0(Native Method)
     10  at java.lang.Object.wait(Native Method)
     16  at java.lang.ClassLoader.loadClass(ClassLoader.java:404)
     44  at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    344  at sun.misc.Unsafe.park(Native Method)

The thread status seems to be normal, so the heap file is analyzed next.

Download heap dump file
The heap files are all binary data, which is very troublesome to view on the command line. The tools provided by Java are all visual, and they cannot be viewed on the Linux server, so first download the files to the local.

Since the heap memory we set is 4G, the heap file from dump is also very large. It is indeed very troublesome to download it, but we can compress it once.

gzip is a very powerful compression command, especially we can set -1 ~ -9 to specify its compression level. The larger the data, the greater the compression ratio and the longer it takes. It is recommended to use -6~7, -9 It is too slow, and the benefits are not large. With this compression time, the extra files are also downloaded.

Use MAT to analyze jvm heap
MAT is a powerful tool for analyzing Java heap memory. Use it to open our heap file (change the file suffix to .hprof), it will prompt us to analyze the type, for this analysis, decisively choose memory leak suspect.

It can be seen from the above pie chart that most of the heap memory is occupied by the same memory. Then check the details of the heap memory and trace back to the upper level, and the culprit is quickly found.

Analysis code
Find the memory leaked object, search the object name globally in the project, it is a Bean object, and then locate one of its attributes of type Map.

This Map uses an ArrayList to store the results of each detection interface response according to the type. After each detection, it is stuffed into the ArrayList for analysis. Since the Bean object will not be recycled and this attribute has no clearing logic, it has not been served for ten days. In the case of online restart, this Map will grow larger and larger until the memory is full.

After the memory is full, no more memory can be allocated for the HTTP response result, so it has been stuck at readLine. And our interface with a large number of I/Os has a very large number of alarms, which is probably related to the large response and the need for more memory.

A PR was given to the code owner, and the problem was solved satisfactorily.

summary

In fact, I still have to reflect on myself. At the beginning, there was such a thread stack in the alarm email:

groovy.json.internal.JsonParserCharArray.decodeValueInternal(JsonParserCharArray.java:166)
groovy.json.internal.JsonParserCharArray.decodeJsonObject(JsonParserCharArray.java:132)
groovy.json.internal.JsonParserCharArray.decodeValueInternal(JsonParserCharArray.java:186)
groovy.json.internal.JsonParserCharArray.decodeJsonObject(JsonParserCharArray.java:132)
groovy.json.internal.JsonParserCharArray.decodeValueInternal(JsonParserCharArray.java:186)

I saw this error thread stack but didn’t think about it carefully. You need to know that TCP can guarantee the integrity of the message. Moreover, the value of the variable will not be assigned to the variable after the message has not been received. This is obviously an internal error. If you pay attention to it, Checking can find out where the problem is in advance. Checking the problem really doesn't work at any point.

Source | https://zhenbianshu.github.io/

Remember a jvm crazy gc causes the problem of CPU soaring to solve

background

The online web server is stuck from time to time. The server's CPU is very high after logging in to the server's top command. After restarting the tomcat, the CPU returns to normal. The same problem will occasionally occur after half a day or a day. To solve the problem, we must first find the breaking point of the problem. It is very difficult to locate the occasional problem.

After restarting the server, I can only wait for the problem to reappear. At this time, I first suspected whether a certain timing task caused a large number of calculations or a certain request caused an infinite loop, so I first analyzed all the suspected places in the code and added a little log. The result The problem reappeared the next afternoon,

The strategy this time is to protect the crime scene first, because there are two points on the line. After one point is restarted and restored, the other point is only offline and not restarted to keep the crime scene.

Troubleshoot

First, I looked at the business log on the server in question, but did not find a large number of duplicate logs. The possibility of infinite loops was initially ruled out. Next, I can only analyze the jvm.

The first step: the top command to view the pid that occupies the CPU

This is a screenshot after the fact. At that time, the cpu soared to more than 500, and the pid was 27683.

Then ps aux | grep 27683 search to confirm whether it is the CPU occupied by our tomcat, this is basically certain, because the CPU drops immediately after tomcat restarts.

You can also use jps to display the pid of java

The second step: top -H -p 27683 find the thread id under 27683, display the thread's cpu occupancy time and occupancy ratio. It is found that there are many threads that will occupy a high CPU, which can only be checked for each.

The third step: jstack view thread information, command: jstack 27683 >> aaaa.txt output to the text and then search, or you can directly pipe search jstack 27683 | grep "6c23" This thread id is hexadecimal representation, you need to turn it, you can use this command to turn printf "%x\n" tid or you can turn it on your own calculator one time.

The tragedy is that I was introduced into a misunderstanding when I was investigating. When I searched for the 6c26 thread, I found that I was doing gc. The thread caused by crazy gc was too high, but I could not find where it caused so many objects. Looking for all possible infinite loops and possible memory leaks.

Then by naming jstat -gcutil [PID] 1000 100 to check the status of gc every second.

This is a screenshot of the post-event review, the screenshot at that time is gone

I found that S0 kept creating new objects, and then gc, and kept repeating this gc, looking at the stack information, nothing found, nothing more than String and map objects are the most, I can’t determine the code of the infinite loop, and I can’t find the problem. The breaking point, so far fell into complete confusion. After some searching, it was confirmed that it was not a memory leak, and I was lost in thought.

The CPU has been kept at a super high level. In desperation, jstack 27683 looked at the thread stack aimlessly, but found a problem. At the current point, I am offline, that is, there is no user access. The CPU is still like this. High, and the thread stack is constantly printing, which means that the currently running thread is likely to be the culprit. Analyze the thread stack information immediately and make a major discovery.

A lot of information about this thread appears, and the thread of httpProxy_jsp jsp is constantly active. What the hell is this jsp? Is the server being attacked? I went to the code immediately and found that there is indeed this jsp. Check the git submission record. It is the code submitted by another colleague a few days ago. The time point is very consistent with the time when the problem first appeared. I feel lucky that I should find the problem. The point is, I called my colleague over to analyze his code. This jsp is actually very simple. It is to make a proxy request and initiate a back-end http request.

HttpRequestUtil is as follows. It is a tool class written by a colleague. No public tools are used. In one of the post methods, connection does not set the link timeout time and read timeout time:

There is a fatal problem here. His http request does not set a timeout waiting time. If the connection does not set the timeout time or 0, it will be considered as infinite, that is, it will never time out. At this time, if the requested third-party page does not respond Or the response is very slow, the request will be waiting forever, or the request does not come back and then comes again, causing the thread to get stuck, but the thread accumulates here without crashing, and it will produce a lot of things. Then it triggered jvm's constant crazy GC to push the server CPU to the limit, and then the server response became very slow. The problem was finally found, and it was very consistent with the characteristics of the problem. Sure enough, this place was changed and added. With a timeout of 2 seconds, the problem did not reappear.

The process of solving this problem has some ideas:

1. The problem of jvm is very complicated. What you can see through the log is probably not the source of the problem. There is also luck in solving the problem. Analysis of the log + business scenarios + blindness are all methods to solve the problem. Do not analyze the problem in one go. Black, combined with the current scene and add some guesses.

2. The root cause of this problem is the soaring CPU. At first, I always thought that there was an infinite loop in the code, and then I thought it was a memory leak, so I took a lot of detours, and finally used the most dumb way to see the log of the thread stack. Therefore, there is no unified solution to the problem. Specific problems must be dealt with in a specific way. You cannot stick to past experience.

3. In the process of writing code, try to use public tools that have been widely used in the original project, and try not to introduce your own self-created code that has not been tested by the project into the project, even if a piece of code that seems very simple may introduce disaster to the project , Unless you have sufficient confidence to understand the underlying level of your code, such as this timeout setting.

Remember once the Synchronized keyword is used unreasonably, and the thread blocking problem under multi-threading caused by the troubleshooting

When performing performance diagnosis and tuning for customers, I encountered a situation where the Synchronized keyword was unreasonably used, which caused thread blocking under multi-threading. I recorded the entire discovery-investigation-analysis-optimization process of the problem in words. During the investigation process, our commercial product-XLand performance analysis platform was used. Through the article, I mainly hope to share the analysis and optimization ideas and points of attention with you. Students who are interested in learning more can comment and exchange.

Phenomenon

When performing the single-interface load "judge whether the login is normal interface" test, it was found that 10 users increased to 50 concurrent users, TPS remained unchanged, response time was continuously increasing, application CPU was 27%, database CPU was 3%, and resource consumption Maintain a stable state, which determines that the application may have a bottleneck.

analyze

Through thread analysis of the XLand analysis platform, it is found that there is a lock waiting situation in a thread. Through the x analysis in XLand, it is found that the getAccessor method in the AuthProvider class has the Synchronized keyword. When two or more threads call the synchronization method at the same time, it can only be used each time. One thread can enter the method, and other threads must wait for the previous thread to finish executing the synchronization method before they have a chance to enter.
马赛克4.png

risk point

The Synchronized keyword solves the synchronization of accessing resources between multiple threads. The Synchronized keyword can ensure that the method or code block modified by it can only be executed by one thread at any time. Use the Synchronized keyword carefully to prevent unnecessary thread blocking and affect response time.

Optimization measures

Remove the Synchronized keyword in the AuthProvider class, and find that the TPS interface for judging whether the login is normal under 10 concurrent users has increased from 174 transactions/sec to 624 transactions/sec, which is a three-fold increase. Use synchronized in daily programming with caution. If there is no need for multithreading to modify static variables or singleton attributes, do not use it. If necessary, it is recommended to lock only necessary code blocks instead of the entire method.

postscript

There are many bottlenecks in Java application performance, such as system factors such as disk, memory, network I/O, Java application code, JVM GC, database, cache, etc. Generally, Java performance optimization is divided into 4 levels: application layer, database layer, framework layer, and JVM layer. The difficulty of each level of optimization increases step by step, and the knowledge involved and the problems to be solved will be different. But to be honest, in fact, most problems do not require you to understand the framework source code, JVM parameters, and GC working mechanism. You only need to analyze SQL, understand the code logic, and locate the problematic Java code and modify it. . After all, there is no such thing as saying-80% of performance problems are caused by bad code you wrote, hahaha. Although a bit sharp, maintaining good coding habits, reasonably using certain keywords that may cause problems, and careful use of memory resources can indeed avoid a large part of the problem. Okay, finally, I wish you all a thousand things with bare hands and no bugs!

Performance problems caused by bad code: CPU usage soared to 900%!

FGC actual combat: Bad code causes frequent services and FGC unresponsiveness analysis

problem

solve

Troubleshoot

summary

Remember a jvm crazy gc causes the problem of CPU soaring to solve

background

Troubleshoot

Remember once the Synchronized keyword is used unreasonably, and the thread blocking problem under multi-threading caused by the troubleshooting

Phenomenon

analyze

risk point

Optimization measures

postscript

HeapDump性能社区

引用和评论

HeapDump性能社区专题1:彻底搞懂Java OOM和调优案例实战！

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性