5
头图
Like it first, then watch it, develop a good habit

previous article "160ca9ea056de6 Operation and Maintenance: Your JAVA service memory usage is too high, and it has only increased! Warning, come and pick up

friendly last time, I stopped for a few days. Today I came to (greet) me again...

Operation and maintenance: This service is yours too. Look at this JAVA process, the memory usage is almost 3G, this machine is only 4G, and your stack is only configured with 2G, so you must be alerted! This time, the memory leaked, right? It's not that I am ignorant!

Come on again... Does this big brother have an opinion on me? With the previous experience, this time I feel more confident. Or according to the usual practice, go back first

Me: "Impossible, our service is very stable, there will be no such problem!"

Operation and maintenance: Where does your confidence and courage come from? Did Liang Jingru give it? You go back and check it before loading

It seems that the eldest brother came here prepared this time, could it really be a problem? A little panicked...

But I still can’t persuade me, let’s just play the operation and maintenance brother first, and find an excuse to go back and check the monitoring first.

Me: OK, I'll see later, I'll go and have a meeting with someone first...

Analyze monitoring data

The heap memory of this service is configured to 2G, and the historical memory has indeed reached 2G, although it is only a few hundred megabytes now... It seems that there is no problem.

In addition to some of the VM's own overhead, it occupies a total of more than 2 G... It seems to be said in the past

Then I found the operation and maintenance brother again, (friendly) communicate...

Me: It’s the same as last time, no difference. I explained the memory management mechanism last time. Why do you still say we have problems?

Operation and maintenance: Look at your service, the CMS+ParNew garbage collector is configured, right? Last time you said that the combination of the collector will release memory to the operating system, right? So why does it still occupy 2G, where is it released?

Me: Although the result of the last test is that it will be released, there are some other sayings that it will be released incrementally when idle, and the release cost is so high, what if it is not released?

O&M: Didn't you just slap yourself in the face? Last time I said it could be released, but now you say it’s normal if you haven’t released it. Do you think I’m stupid?

The operation and maintenance brother seems to see that I am quibbling...

It’s normal if you don’t release it. The release cost is so high. After release, you have to reapply and reorganize the memory structure balabalabala...

I don’t have the confidence to say that...After all, I tested CMS+ParNew last time and it will indeed be released. It’s just a matter of time.

Operation and maintenance: You continue to quibble, the memory of this service follows this trend, it is estimated that you will have OOM tomorrow, and then the system will give you an OOM Killer lore, you will be happy!

Me: No problem, this memory is normal, my own service, can I still understand

At this point, I'm already a little uneasy, what my eldest brother said makes sense, in case OOM Killer is over, it's not over!

Me: I'll take a closer look when I'm free later, there should be no problem, you will be busy with you first, don't worry!

Check real-time indicators on the server

After dismissing the operation and maintenance brother, I thought about it for a while. The data in this memory does not seem to be normal. This service belongs to that kind of back-end application. Although the business is complicated, it is only used by the business leaders. But this memory usage is indeed a bit unreasonable, where is the memory going...

Let's talk about the data, go to the host to find out if there are any clues

  1. Memory-ok
  2. CPU utilization-ok
  3. CPU load-ok

There is nothing unusual, CPU utilization and load are very low...Where is the problem?

It's too hard to look at the indicators so little, let's take Arthas to see, it basically shows the JVM-related indicators, which are more comprehensive:

[arthas@20727]$ dashboard

Memory                    used    total    max     usage    GC
heap                      498M     1963M    1963M   25.1%    gc.ps_scavenge.count          3
ps_eden_space             98M      512M     512M    19.5%    gc.ps_scavenge.time(ms)       44
ps_survivor_space         0K      87040K   87040K  0.00%    gc.ps_marksweep.count         1
ps_old_gen                39M     1365M    1365M   2.88%    gc.ps_marksweep.time(ms)      87
nonheap                   32M     180M      -1      17.7%
code_cache                10M      50M      240M    20%
metaspace                 20M     128M      -1      15.6%
compressed_class_space    2M      2M       1024M   0.25%

It seems that there is no problem with JVM-level memory. Let's take a look at threads:

[arthas@20727]$ thread

Threads Total: 9831, NEW: 0, RUNNABLE: 8, BLOCKED: 0, WAITING: 9789, TIMED_WAITING: 2, TERMINATED: 0, Internal threads
: 17

What the hell is this! 1w threads faster! It's basically WAITING!

Quickly take a look at these WAITING threads:

[arthas@20727]$ thread --state WAITING

ID   NAME                          GROUP          PRIORITY  STATE     %CPU      DELTA_TIM TIME      INTERRUPT DAEMON
# 此处省略 9000+ 行……
9822 pool-9813-thread-1            main           5         WAITING   0.0       0.000     0:0.000   false     false
9823 pool-9814-thread-1            main           5         WAITING   0.0       0.000     0:0.000   false     false
9824 pool-9815-thread-1            main           5         WAITING   0.0       0.000     0:0.000   false     false
9825 pool-9816-thread-1            main           5         WAITING   0.0       0.000     0:0.000   false     false
9826 pool-9817-thread-1            main           5         WAITING   0.0       0.000     0:0.000   false     false
9827 pool-9818-thread-1            main           5         WAITING   0.0       0.000     0:0.000   false     false
9828 pool-9819-thread-1            main           5         WAITING   0.0       0.000     0:0.000   false     false
9829 pool-9820-thread-1            main           5         WAITING   0.0       0.000     0:0.000   false     false
9830 pool-9821-thread-1            main           5         WAITING   0.0       0.000     0:0.000   false     false
9831 pool-9822-thread-1            main           5         WAITING   0.0       0.000     0:0.000   false     false
9832 pool-9823-thread-1            main           5         WAITING   0.0       0.000     0:0.000   false     false
9833 pool-9824-thread-1            main           5         WAITING   0.0       0.000     0:0.000   false     false
9834 pool-9825-thread-1            main           5         WAITING   0.0       0.000     0:0.000   false     false
9835 pool-9826-thread-1            main           5         WAITING   0.0       0.000     0:0.000   false     false
9836 pool-9827-thread-1            main           5         WAITING   0.0       0.000     0:0.000   false     false
9837 pool-9828-thread-1            main           5         WAITING   0.0       0.000     0:0.000   false     false
9838 pool-9829-thread-1            main           5         WAITING   0.0       0.000     0:0.000   false     false
9839 pool-9830-thread-1            main           5         WAITING   0.0       0.000     0:0.000   false     false
9840 pool-9831-thread-1            main           5         WAITING   0.0       0.000     0:0.000   false     false

Seeing this thread name, I probably understand that it must be some little genius who poisoned the code.

From the thread name, it must be the thread in the thread pool, and it is the default thread name generation rule. The threads in the thread pool are created through ThreadFactory, and the default ThreadFactory generation rule is as follows:

DefaultThreadFactory() {
    SecurityManager s = System.getSecurityManager();
    group = (s != null) ? s.getThreadGroup() :
    Thread.currentThread().getThreadGroup();

    // 前缀,每一个新的 ThreadFactory 都有一个新的前缀
    // 每一个线程池,都只有一个 ThreadFactory
    namePrefix = "pool-" +
        poolNumber.getAndIncrement() +
        "-thread-";
}

public Thread newThread(Runnable r) {
    
    // 每个线程都会使用 factory的前缀,然后加上自增的线程数
    Thread t = new Thread(group, r,
                          namePrefix + threadNumber.getAndIncrement(),
                          0);
    if (t.isDaemon())
        t.setDaemon(false);
    if (t.getPriority() != Thread.NORM_PRIORITY)
        t.setPriority(Thread.NORM_PRIORITY);
    return t;
}

So this problem must be caused by a little genius who creates a new thread pool every time in the code and then does not shut down! Just find a thread and look at its stack:

"pool-1-thread-1" Id=10 WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6ba7592
    at sun.misc.Unsafe.park(Native Method)
    -  waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6ba7592
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

The real hammer is that every time the new thread pool is not shut down... Now you only need to find the place where the thread pool was created, and then flip through the submission records, you can know which little genius it is...

But with so much code, it is still a bit troublesome to find the place of this newly created thread pool, and searching for the code is not very convenient. Here we still use Arthas to take a look, stack will know at a glance

However, if the stack ThreadPoolExecutor#execute method is directly used here, there may be too much interference information. After all, there are too many places to call execute, which is not easy to catch the point.

So it is better to initialize the stack thread pool:

[arthas@20727]$ stack java.util.concurrent.ThreadPoolExecutor <init>

Affect(class count: 0 , method count: 0) cost in 4 ms, listenerId: 5
No class or method is affected, try:
1. Execute `sm CLASS_NAME METHOD_NAME` to make sure the method you are tracing actually exists (it might be in your parent class).
2. Execute `options unsafe true`, if you want to enhance the classes under the `java.*` package.
3. Execute `reset CLASS_NAME` and try again, your method body might be too large.
4. Check arthas log: /home/jiangxin/logs/arthas/arthas.log
5. Visit https://github.com/alibaba/arthas/issues/47 for more details.

Is this... not supported?

Arthas's enhancement strategy is relatively conservative. Some system-level classes are not enhanced. java.* package are not enhanced by default and need to be manually enabled:

[arthas@20727]$ options unsafe true

 NAME    BEFORE-VALUE  AFTER-VALUE
-----------------------------------
 unsafe  false         true

After executing stack again, it can be used. After two minutes, finally there is a call:

ts=2021-06-12 12:04:03;thread_name=main;id=1;is_daemon=false;priority=5;TCCL=sun.misc.Launcher$AppClassLoader@4e0e2f2a
    @java.util.concurrent.ThreadPoolExecutor.<init>()
        at java.util.concurrent.Executors.newFixedThreadPool(Executors.java:89)
        at XXXService.sendSms(XXXService.java:782)
        //...

Finally found this Sao operation code... It is written like this:

ExecutorService executorService = Executors.newFixedThreadPool(8);
        
executorService.submit(smsTask);

//...

As I guessed, it is true every time it is new, and then it is not shut down.

This can also be compared with the above situation. The extra memory usage is due to this 10,000 threads smaller... And if there are no tasks to be executed in these thread pools, the threads must be in the WAITING state and will not occupy the CPU utilization. , There will be no impact on the load. I really can't see the problem if I'm not careful
, We still have to combine various indicators to make a comprehensive analysis.

Solving this problem is simple, let the person who wrote this shit code change it, and then pull it out to sacrifice to the heavens.

But the operation and maintenance eldest brother...has been pretended to be out, and now he has lost his face. Last time I finally got back the face, but this time I lost everything.
image.png

Account for operation and maintenance

Me: There is indeed a problem with the service. We will post an emergency repair version, which will be solved by going online at night.

Operation and maintenance: Didn't you say that there is no problem? Where is the confidence

Me: It’s not that I didn’t eat breakfast, I was hungry and fainted, my mind was not clear enough... I didn’t see the problem, my pot was my pot.

Operation and maintenance: It must be caused by your shit mountain code. If you have nothing to do, do code review and check the code. In the future, there will be fewer low-level problems, otherwise I will deal with you all day long.

Unexpectedly, this operation and maintenance is still panting, and if I give some face, I will go to heaven... But who will let me take care of it, I can only respond

Me: Yes, yes, we must do more code review in the future, strengthen code review, to avoid this kind of shit code and then submit it, affecting production. Don't trouble the boss

image.png

Originality is not easy, unauthorized reprinting is prohibited. If my article is helpful to you, please like/favorite/follow to encourage and support it ❤❤❤❤❤❤

空无
3.3k 声望4.3k 粉丝

坚持原创,专注分享 JAVA、网络、IO、JVM、GC 等技术干货