Remember an oom troubleshooting

Hello everyone, my name is Dabin~

Today I will share with you the recent OOM problem.

Last Friday morning, the test students reported that the subsystem service of the test environment has been timed out and the request has not responded.

After receiving this question, I was a little puzzled. Recently, the system has not changed the code logic, how can it suddenly report the problem of service timeout. In order to avoid affecting the test progress, I quickly logged into the bastion machine to check the log to see what happened.

First look at the system load, use the top command to check. It was found that the CPU of one of the Java processes continued to stay between 100% and 200%. Because this system does not involve a large number of operation logic, it can be guessed that it is either an infinite loop problem, or it is caused by frequent full gc.

Looking at the system log, we found that java.lang.OutOfMemoryError: Metaspace appeared. Obviously, the metaspace memory overflowed.

Then check the system gc situation, use the following command to check. pid is the corresponding Java process id, obtained by the top command. Parameter 1000 means to print records every 1000ms.

jstat -gc pid 1000

Looking at the execution results, as expected, the full gc has been triggered hundreds of times from the application startup to the sampling time! This is why the cpu is always 100%.

There is another parameter MC (metaspace allocation memory size), which is close to the set maximum metaspace size (configured --XX:MaxMetaspaceSize=128m).

Here is also a brief introduction to the metaspace.
Metadata is a unique data structure in jdk8, and jdk7 is called permanent generation. When jdk8 permanent generation is reached, it will be abandoned and replaced by metaspace. Metaspace is allocated in local memory (not on the heap), and memory usage is not limited by default. You can use MaxMetaspaceSize to specify the maximum value.
The metaspace consists of two parts
Klass Metaspace , used to store klass, klass is the runtime data structure of class files in jvm.
NoKlass Metaspac e, specially used to store other content related to klass, such as method, constant pool, etc. This memory is composed of multiple memory blocks.
MC is the total memory size allocated by both Klass Metaspace and NoKlass Metaspace In the above figure, the MC is close to the upper limit of the metaspace setting, that is, the metaspace memory is not enough at this time, resulting in full gc being triggered all the time.

Then dump the memory for analysis to see what caused the metaspace memory overflow. Use the command ./jmap -dump:live,format=b,file=/xxx export the memory heap to the xxx location (hprof format), and then use the MAT tool for analysis.

Import the hprof file into the MAT tool and open the memory leak analysis (involving the company's internal source code, so mosaic):

After seeing this, you probably know what the problem is. Because the company is currently promoting a vulnerability monitoring tool internally, it is necessary to deploy an agent program on the server side. This tool will collect and monitor application runtime function execution and data transmission, and can identify common security flaws and vulnerabilities. The coding part is the application package name of this vulnerability monitoring tool, which is likely to be the problem caused by the introduction of this tool!

Confirm the problem further. Open the Histogram:

Shallow Heap represents object structure itself , excluding the memory occupied by its attribute reference objects.
Retained Heap is the free memory size after an object is reclaimed by GC, which is equal to the sum of Shallow Heap of all objects in the Retained Heap of the released object.

In the Histogram view, select one of the classes and right-click to pop up a menu, select Merge shortest paths to GC Roots, view the path from the current object to the GC Root, and filter some types of references.

The result is as follows:

The class that occupies the most memory space is the class of the vulnerability monitoring tool, which can basically determine the problem.

Finally, after removing this vulnerability monitoring tool and redeploying it, there will be no service timeout problem.

The above is the whole process of OOM problem analysis in this issue~

It is not easy to code words. If you think it is helpful to you, you can a like and encourage it!

I am Dabin, a programmer, focusing on Java back-end hard core knowledge sharing, welcome everyone to pay attention~

This article is published by the blog OpenWrite !

Remember an oom troubleshooting

程序员大彬

引用和评论

设计规则：模块化的力量

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性