Overview
Summarize the methods and experience of troubleshooting memory by troubleshooting the memory growth problem in the search cluster. for record and reference.
problem performance
- After the release, the machine memory keeps rising. Requires a reboot to be resolved.
solution process
jmap check heap memory stage
- Dump the memory in the traditional way, then analyze it with Jprofiler or mat.
- The comparison method is used here, that is, dump the heap immediately after the restart and dump the snapshot after running for one day. .
The detailed operation method is not repeated here.
Use NMT to troubleshoot off-heap memory (native memory)
- Need to add parameter
-XX:NativeMemoryTracking=detail
It is said that after adding it, the performance will drop by 5% ~ 10%. I added it, but it did not drop. However, it is still recommended to be cautious online and add a machine to troubleshoot the problem. - Set the baseline of NMT:
jcmd <pid> VM.native_memory baseline
setting the baseline, you can mark a baseline memory state. After a period of time, you can compare the memory changes and which part has grown the most. - After a period of time, use
jcmd <pid> VM.native_memory detail.diff scale=MB
check the memory changes. - After a period of time the changes are as follows:
Native Memory Tracking:
Total: reserved=15048MB +73MB, committed=13993MB +74MB
- Java Heap (reserved=10240MB, committed=10240MB)
(mmap: reserved=10240MB, committed=10240MB)
- Class (reserved=1224MB, committed=223MB)
(classes #30779 +1)
(malloc=6MB #100242 +153)
(mmap: reserved=1218MB, committed=218MB)
- Thread (reserved=1457MB +5MB, committed=1457MB +5MB)
(thread #1444 +4)
(stack: reserved=1449MB +5MB, committed=1449MB +5MB)
(malloc=5MB #7227 +20)
(arena=3MB #2887 +8)
- Code (reserved=286MB, committed=251MB)
(malloc=42MB #41476 +94)
(mmap: reserved=244MB, committed=209MB)
- GC (reserved=520MB +16MB, committed=520MB +16MB)
(malloc=108MB +16MB #153709 +200)
(mmap: reserved=412MB, committed=412MB)
- Compiler (reserved=5MB, committed=5MB)
(malloc=5MB #5673 +3)
- Internal (reserved=610MB +3MB, committed=610MB +3MB)
(malloc=609MB +3MB #168363 +191)
- Symbol (reserved=672MB +50MB, committed=672MB +50MB)
(malloc=667MB +50MB #465680 +6396)
(arena=5MB #1)
- Native Memory Tracking (reserved=15MB, committed=15MB)
(malloc=1MB #9771 +3390)
(tracking overhead=15MB)
- Unknown (reserved=20MB, committed=0MB)
(mmap: reserved=20MB, committed=0MB)
It can be seen that the Symbol part has risen significantly, and this part mainly stores information such as String intern. So it can be preliminarily judged that this part is leaked.
find a solution
- There is a basic knowledge here, that is, the changes of jdk8 to the metaspace. You can google to see the changes yourself. After jdk8, the permanent generation is removed, and the metaspace is stored in native memory.
- When searching for keywords such as NMT memory leaks, I found that jdk seems to have bugs that will cause local memory leaks: https://bugs.openjdk.java.net/browse/JDK-8180048
- It can be seen that the performance is basically the same:
- It can be seen that the jdk1.8 131 version found this problem, and we are using the 101 version, so we suspect that this problem also exists, so try to modify the jdk version to solve this problem.
- The solution is to upgrade jdk, so upgrade to jdk1.8.0_202.
Result verification
- Like the previous step, this time a comparison method is used, that is, one machine uses the original jdk version (version 101), and the other uses version 202. After 2 days of running, the following gaps can be seen:
- It is obvious that there is a difference in the application memory, and the difference mainly comes from the Symbol. It is basically the same as the bug in jdk.
- You can also set
-XX:MaxMetaspaceSize
to limit the meta space, but it has not been tested. Because the main reason for memory leaks is bugs, not too much metadata or String interned.
Overall investigation idea
- Use jmap to view the memory situation and view the memory allocation.
- dump heap snapshot to analyze the situation in the heap and check for memory leaks. Note that the dump will have FGC, and it needs to be done online and offline. And after thinking about it, if it is a heap memory leak, it is unlikely that the physical memory will continue to grow. Because the size of the heap is fixed.
- After confirming that there is no problem with the memory in the heap, check the memory outside the heap, use NMT to check, set the baseline, and then compare at intervals.
- After locating the problem, find a solution and try to solve it.
- After trying to solve the problem, compare the control variables to confirm that the problem is really solved.
- After the adjustment, the online runs stably for 48 hours, and it is confirmed that the adjustment does not bring other side effects.
References
- Oracle official bug information: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8180048
- JDK bug info: https://bugs.openjdk.java.net/browse/JDK-8180048
- The content stored in SymbolTable: https://blog.csdn.net/weixin_34360651/article/details/91460994
- Another example that also uses NMT to troubleshoot the problem: https://blog.csdn.net/qianshangding0708/article/details/100978730
- PermGen and MetaSpace : https://segmentfault.com/a/1190000012577387
- NMT usage example: https://blog.51cto.com/u_15127657/4321515
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。