Like it first, then watch it, develop a good habit
friendly last time, I stopped for a few days. Today I came to (greet) me again...
Operation and maintenance: This service is yours too. Look at this JAVA process, the memory usage is almost 3G, this machine is only 4G, and your stack is only configured with 2G, so you must be alerted! This time, the memory leaked, right? It's not that I am ignorant!
Come on again... Does this big brother have an opinion on me? With the previous experience, this time I feel more confident. Or according to the usual practice, go back first
Me: "Impossible, our service is very stable, there will be no such problem!"
Operation and maintenance: Where does your confidence and courage come from? Did Liang Jingru give it? You go back and check it before loading
It seems that the eldest brother came here prepared this time, could it really be a problem? A little panicked...
But I still can’t persuade me, let’s just play the operation and maintenance brother first, and find an excuse to go back and check the monitoring first.
Me: OK, I'll see later, I'll go and have a meeting with someone first...
Analyze monitoring data
The heap memory of this service is configured to 2G, and the historical memory has indeed reached 2G, although it is only a few hundred megabytes now... It seems that there is no problem.
In addition to some of the VM's own overhead, it occupies a total of more than 2 G... It seems to be said in the past
Then I found the operation and maintenance brother again, (friendly) communicate...
Me: It’s the same as last time, no difference. I explained the memory management mechanism last time. Why do you still say we have problems?
Operation and maintenance: Look at your service, the CMS+ParNew garbage collector is configured, right? Last time you said that the combination of the collector will release memory to the operating system, right? So why does it still occupy 2G, where is it released?
Me: Although the result of the last test is that it will be released, there are some other sayings that it will be released incrementally when idle, and the release cost is so high, what if it is not released?
O&M: Didn't you just slap yourself in the face? Last time I said it could be released, but now you say it’s normal if you haven’t released it. Do you think I’m stupid?
The operation and maintenance brother seems to see that I am quibbling...
It’s normal if you don’t release it. The release cost is so high. After release, you have to reapply and reorganize the memory structure balabalabala...
I don’t have the confidence to say that...After all, I tested CMS+ParNew last time and it will indeed be released. It’s just a matter of time.
Operation and maintenance: You continue to quibble, the memory of this service follows this trend, it is estimated that you will have OOM tomorrow, and then the system will give you an OOM Killer lore, you will be happy!
Me: No problem, this memory is normal, my own service, can I still understand
At this point, I'm already a little uneasy, what my eldest brother said makes sense, in case OOM Killer is over, it's not over!
Me: I'll take a closer look when I'm free later, there should be no problem, you will be busy with you first, don't worry!
Check real-time indicators on the server
After dismissing the operation and maintenance brother, I thought about it for a while. The data in this memory does not seem to be normal. This service belongs to that kind of back-end application. Although the business is complicated, it is only used by the business leaders. But this memory usage is indeed a bit unreasonable, where is the memory going...
Let's talk about the data, go to the host to find out if there are any clues
- Memory-ok
- CPU utilization-ok
- CPU load-ok
There is nothing unusual, CPU utilization and load are very low...Where is the problem?
It's too hard to look at the indicators so little, let's take Arthas to see, it basically shows the JVM-related indicators, which are more comprehensive:
[arthas@20727]$ dashboard
Memory used total max usage GC
heap 498M 1963M 1963M 25.1% gc.ps_scavenge.count 3
ps_eden_space 98M 512M 512M 19.5% gc.ps_scavenge.time(ms) 44
ps_survivor_space 0K 87040K 87040K 0.00% gc.ps_marksweep.count 1
ps_old_gen 39M 1365M 1365M 2.88% gc.ps_marksweep.time(ms) 87
nonheap 32M 180M -1 17.7%
code_cache 10M 50M 240M 20%
metaspace 20M 128M -1 15.6%
compressed_class_space 2M 2M 1024M 0.25%
It seems that there is no problem with JVM-level memory. Let's take a look at threads:
[arthas@20727]$ thread
Threads Total: 9831, NEW: 0, RUNNABLE: 8, BLOCKED: 0, WAITING: 9789, TIMED_WAITING: 2, TERMINATED: 0, Internal threads
: 17
What the hell is this! 1w threads faster! It's basically WAITING!
Quickly take a look at these WAITING threads:
[arthas@20727]$ thread --state WAITING
ID NAME GROUP PRIORITY STATE %CPU DELTA_TIM TIME INTERRUPT DAEMON
# 此处省略 9000+ 行……
9822 pool-9813-thread-1 main 5 WAITING 0.0 0.000 0:0.000 false false
9823 pool-9814-thread-1 main 5 WAITING 0.0 0.000 0:0.000 false false
9824 pool-9815-thread-1 main 5 WAITING 0.0 0.000 0:0.000 false false
9825 pool-9816-thread-1 main 5 WAITING 0.0 0.000 0:0.000 false false
9826 pool-9817-thread-1 main 5 WAITING 0.0 0.000 0:0.000 false false
9827 pool-9818-thread-1 main 5 WAITING 0.0 0.000 0:0.000 false false
9828 pool-9819-thread-1 main 5 WAITING 0.0 0.000 0:0.000 false false
9829 pool-9820-thread-1 main 5 WAITING 0.0 0.000 0:0.000 false false
9830 pool-9821-thread-1 main 5 WAITING 0.0 0.000 0:0.000 false false
9831 pool-9822-thread-1 main 5 WAITING 0.0 0.000 0:0.000 false false
9832 pool-9823-thread-1 main 5 WAITING 0.0 0.000 0:0.000 false false
9833 pool-9824-thread-1 main 5 WAITING 0.0 0.000 0:0.000 false false
9834 pool-9825-thread-1 main 5 WAITING 0.0 0.000 0:0.000 false false
9835 pool-9826-thread-1 main 5 WAITING 0.0 0.000 0:0.000 false false
9836 pool-9827-thread-1 main 5 WAITING 0.0 0.000 0:0.000 false false
9837 pool-9828-thread-1 main 5 WAITING 0.0 0.000 0:0.000 false false
9838 pool-9829-thread-1 main 5 WAITING 0.0 0.000 0:0.000 false false
9839 pool-9830-thread-1 main 5 WAITING 0.0 0.000 0:0.000 false false
9840 pool-9831-thread-1 main 5 WAITING 0.0 0.000 0:0.000 false false
Seeing this thread name, I probably understand that it must be some little genius who poisoned the code.
From the thread name, it must be the thread in the thread pool, and it is the default thread name generation rule. The threads in the thread pool are created through ThreadFactory, and the default ThreadFactory generation rule is as follows:
DefaultThreadFactory() {
SecurityManager s = System.getSecurityManager();
group = (s != null) ? s.getThreadGroup() :
Thread.currentThread().getThreadGroup();
// 前缀,每一个新的 ThreadFactory 都有一个新的前缀
// 每一个线程池,都只有一个 ThreadFactory
namePrefix = "pool-" +
poolNumber.getAndIncrement() +
"-thread-";
}
public Thread newThread(Runnable r) {
// 每个线程都会使用 factory的前缀,然后加上自增的线程数
Thread t = new Thread(group, r,
namePrefix + threadNumber.getAndIncrement(),
0);
if (t.isDaemon())
t.setDaemon(false);
if (t.getPriority() != Thread.NORM_PRIORITY)
t.setPriority(Thread.NORM_PRIORITY);
return t;
}
So this problem must be caused by a little genius who creates a new thread pool every time in the code and then does not shut down! Just find a thread and look at its stack:
"pool-1-thread-1" Id=10 WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6ba7592
at sun.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6ba7592
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The real hammer is that every time the new thread pool is not shut down... Now you only need to find the place where the thread pool was created, and then flip through the submission records, you can know which little genius it is...
But with so much code, it is still a bit troublesome to find the place of this newly created thread pool, and searching for the code is not very convenient. Here we still use Arthas to take a look, stack will know at a glance
However, if the stack ThreadPoolExecutor#execute method is directly used here, there may be too much interference information. After all, there are too many places to call execute, which is not easy to catch the point.
So it is better to initialize the stack thread pool:
[arthas@20727]$ stack java.util.concurrent.ThreadPoolExecutor <init>
Affect(class count: 0 , method count: 0) cost in 4 ms, listenerId: 5
No class or method is affected, try:
1. Execute `sm CLASS_NAME METHOD_NAME` to make sure the method you are tracing actually exists (it might be in your parent class).
2. Execute `options unsafe true`, if you want to enhance the classes under the `java.*` package.
3. Execute `reset CLASS_NAME` and try again, your method body might be too large.
4. Check arthas log: /home/jiangxin/logs/arthas/arthas.log
5. Visit https://github.com/alibaba/arthas/issues/47 for more details.
Is this... not supported?
Arthas's enhancement strategy is relatively conservative. Some system-level classes are not enhanced. java.*
package are not enhanced by default and need to be manually enabled:
[arthas@20727]$ options unsafe true
NAME BEFORE-VALUE AFTER-VALUE
-----------------------------------
unsafe false true
After executing stack again, it can be used. After two minutes, finally there is a call:
ts=2021-06-12 12:04:03;thread_name=main;id=1;is_daemon=false;priority=5;TCCL=sun.misc.Launcher$AppClassLoader@4e0e2f2a
@java.util.concurrent.ThreadPoolExecutor.<init>()
at java.util.concurrent.Executors.newFixedThreadPool(Executors.java:89)
at XXXService.sendSms(XXXService.java:782)
//...
Finally found this Sao operation code... It is written like this:
ExecutorService executorService = Executors.newFixedThreadPool(8);
executorService.submit(smsTask);
//...
As I guessed, it is true every time it is new, and then it is not shut down.
This can also be compared with the above situation. The extra memory usage is due to this 10,000 threads smaller... And if there are no tasks to be executed in these thread pools, the threads must be in the WAITING state and will not occupy the CPU utilization. , There will be no impact on the load. I really can't see the problem if I'm not careful
, We still have to combine various indicators to make a comprehensive analysis.
Solving this problem is simple, let the person who wrote this shit code change it, and then pull it out to sacrifice to the heavens.
But the operation and maintenance eldest brother...has been pretended to be out, and now he has lost his face. Last time I finally got back the face, but this time I lost everything.
Account for operation and maintenance
Me: There is indeed a problem with the service. We will post an emergency repair version, which will be solved by going online at night.
Operation and maintenance: Didn't you say that there is no problem? Where is the confidence
Me: It’s not that I didn’t eat breakfast, I was hungry and fainted, my mind was not clear enough... I didn’t see the problem, my pot was my pot.
Operation and maintenance: It must be caused by your shit mountain code. If you have nothing to do, do code review and check the code. In the future, there will be fewer low-level problems, otherwise I will deal with you all day long.
Unexpectedly, this operation and maintenance is still panting, and if I give some face, I will go to heaven... But who will let me take care of it, I can only respond
Me: Yes, yes, we must do more code review in the future, strengthen code review, to avoid this kind of shit code and then submit it, affecting production. Don't trouble the boss
Originality is not easy, unauthorized reprinting is prohibited. If my article is helpful to you, please like/favorite/follow to encourage and support it ❤❤❤❤❤❤
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。