4

01 Background

In view of the CPU spikes, deadlocks, and suspended threads that may occur in some services, it is very important to summarize and refine the ideas and solutions for troubleshooting. The above problems will involve thread stacks. This article will explain the function of thread stacks in combination with actual cases.

02 Basic knowledge

2.1 What is a thread stack

The thread stack is the running state of the thread at a certain moment in the system (ie, an instant snapshot).

Thread stack information contains

  • Thread name, ID, number of threads
  • The running state of the thread, the state of the lock (which thread is the lock held by which thread is waiting for the lock)

<!---->

  • The call stack (that is, the calling hierarchy of functions). The call stack contains the full class name, the method executed, the number of lines of source code

2.2 Thread stack can analyze problem types

Thread stack positioning problems can only locate problems that leave traces on the current thread

2.3 The problem that the thread stack cannot be analyzed

The thread stack cannot be located on the thread stack without leaving a trace:

  • Data confusion caused by concurrency bugs does not have any track in the thread stack, so this kind of problem thread stack cannot provide any help.
  • The problem of database lock table, the table is locked, often because a transaction is not committed/rolled back, but this information cannot be represented in the stack

Problems that leave no trace on the thread can only be located by other means. In a real system, the problems of the system are divided into several types:

  • If the problem can be shown in the stack, use the thread stack to locate
  • The problem location that cannot leave traces in the thread needs to rely on a good log design
  • Very hidden problems that can only rely on rich code experience, such as data confusion caused by multi-threading, and the ghost code mentioned later

2.4 How to output thread stack

The kill -3 pid command can only print the stack information of the java process at that moment. It is suitable for use in abnormal situations such as slow server response, rapid cpu and memory surge, etc. It can easily locate the java class that caused the abnormality, and solve problems such as deadlocks , system abnormality caused by connection timeout and other reasons. The command does not kill the process.

Remark:

Write screen output to a file, overwriting file contents

Add screen output to end of file

Linux common commands

Orderdescribe
psFind the pid of a process
pstackPrint the stack information of a process or thread
straceCount the time spent in each step of the system call

2.5 Analysis of thread stack - analysis of thread state

2.5.1 Overview

2.5.2 Runnable

The thread is in a running state from the point of view of the virtual machine.

So does a thread in RUNNABLE necessarily consume CPU? Actually not necessarily. The thread stack below indicates that the thread is reading data from the network. Although the thread below shows the RUNNABLE state, in fact, the network IO and thread are suspended most of the time. Only when the data arrives, the thread is reawakened. . The suspension occurs in the native code (Native), and the virtual machine does not know it at all. Unlike methods such as Java's sleep() or wait() that are explicitly called, the virtual machine can know the true state of the thread, but for the native code. Suspended, the virtual machine has no way of really knowing the thread state, so it simply says RUNNABLE. Socket IO operations like this do not consume a lot of CPU, because most of the time is waiting, and only after the data arrives, it consumes a little CPU.

Thread-39" daemon prio=1 tid=0x08646590 nid=0x666d runnable [5beb7000..5beb88b8]
      java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:183)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:201)
        - locked <0x47bfb940> (a java.io.BufferedInputStream)
        at org.postgresql.PG_Stream.ReceiveChar(PG_Stream.java:141)
        at org.postgresql.core.QueryExecutor.execute(QueryExecutor.java:68)
        - locked <0x47bfb758> (a org.postgresql.PG_Stream)
        at org.postgresql.Connection.ExecSQL(Connection.java:398)

The thread below is executing pure Java code instructions and is really a thread that consumes CPU.

"Thread-444" prio=1 tid=0xa4853568 nid=0x7ade runnable [0xafcf7000..0xafcf8680]
      java.lang.Thread.State: RUNNABLE
//实实在在再对应CPU运算指令
at org.apache.commons.collections.ReferenceMap.getEntry(Unknown Source) at org.apache.commons.collections.ReferenceMap.get(Unknown Source)
at org.hibernate.util.SoftLimitMRUCache.get(SoftLimitMRUCache.java:51) at org.hibernate.engine.query.QueryPlanCache.getNativeSQLQueryPlan()
at org.hibernate.impl.AbstractSessionImpl.getNativeSQLQueryPlan()
at org.hibernate.impl.AbstractSessionImpl.list()
at org.hibernate.impl.SQLQueryImpl.list(SQLQueryImpl.java:164)
at com.mogoko.struts.logic.user.LeaveMesManager.getCommentByShopId()
at com.mogoko.struts.action.shop.ShopIndexBaseInfoAction.execute() ......

The following thread is making a JNI native method call. Whether it consumes CPU depends on the implementation of TcpRecvExt. If TcpRecvExt is pure operation code, it is actually consuming CPU. If there is suspended code in TcpRecvExt(), then the thread Although it is displayed as RUNNABLE, it actually does not consume CPU.

 "ClientReceiveThread" daemon prio=1 tid=0x99dbacf8 nid=0x7988 runnable [...]
      java.lang.Thread.State: RUNNABLE
        at com.pangu.network.icdcomm.htcpapijni.TcpRecvExt(Native Method)
        at com.pangu.network.icdcomm.IcdComm.receive(IcdComm.java:60)
        at com.msp.client.MspFactory$ClientReceiveThread.task(MspFactory.java:333)
        at com.msp.system.TaskThread.run(TaskThread.java:94)

2.5.3 TIMED_WAITING(on object monitor)

indicates that the current thread is suspended for a period of time, indicating that the thread is executing the obj.wait(int time) method.

The thread stack below indicates that the current thread is in the TIMED_WAITING state and is currently being suspended for the duration specified in the parameter, such as obj.wait(2000). So the thread is not currently consuming CPU.

 "JMX server" daemon prio=6 tid=0x0ad2c800 nid=0xdec in Object.wait() [...]
      java.lang.Thread.State: TIMED_WAITING (on  object  monitor)
       at java.lang.Object.wait(Native Method)
       - waiting on <0x03129da0> (a [I)
       at com.sun.jmx.remote.internal.ServerComm$Timeout.run(ServerComm.java:150)
       - locked <0x03129da0> (a [I)
       at java.lang.Thread.run(Thread.java:620)

2.5.4 TIMED_WAITING(sleeping)

indicates that the current thread is suspended for a period of time, that is, the Thread.sleep(int time) method is being executed.

The following thread is in the TIMED_WAITING state, which means that it is currently suspended for a period of time, the duration is the duration specified in the parameter, such as Thread.sleep(100000). So the thread is not currently consuming CPU.

 "Comm thread" daemon prio=10 tid=0x00002aaad4107400 nid=0x649f waiting on condition
    [0x000000004133b000..0x000000004133ba00]
       java.lang.Thread.State: TIMED_WAITING (sleeping)
            at java.lang.Thread.sleep(Native Method)
            at org.apache.hadoop.mapred.Task$1.run(Task.java:282)
            at java.lang.Thread.run(Thread.java:619)

2.5.5 TIMED_WAITING(parking)

The current thread is suspended for a period of time, that is, the Thread.sleep(int time) method is being executed.

The following thread is in the TIMED_WAITING state, which means that it is currently suspended for a period of time as specified in the parameter, such as LockSupport.parkNanos(blocker, l10000). So the thread is not currently consuming CPU.

"RMI TCP" daemon prio=6 tid=0x0ae3b800 nid=0x958 waiting on condition [0x17eff000..0x17effa94]
 java.lang.Thread.State: TIMED_WAITING (parking)
       at sun.misc.Unsafe.park(Native Method)
       - parking to wait for  <0x02f49f58> (a java.util.concurrent.SynchronousQueue$TransferStack)
       at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:179)
       at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:424)
       at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:323)
       at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:871)
       at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:495)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:693)
       at java.lang.Thread.run(Thread.java:620)

2.5.6 WAINTING(on object monitor)

The current thread is suspended, that is, executing the obj.wait() method (wait() method without parameters).

The following thread is in the WAITING state, indicating that the current thread is suspended, such as obj.wait() (can only be woken up by notify()). So the thread is not currently consuming CPU.

 "IPC Client" daemon prio=10 tid=0x00002aaad4129800 nid=0x649d in Object.wait() [0x039000..0x039d00]
       java.lang.Thread.State: WAITING (on object monitor)
            at java.lang.Object.wait(Native Method)
            - waiting on <0x00002aaab3acad18>; (aorg.apache.hadoop.ipc.Client$Connection)
            at java.lang.Object.wait(Object.java:485)
            at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:234)
            - locked <0x00002aaab3acad18> (aorg.apache.hadoop.ipc.Client$Connection)
            at org.apache.hadoop.ipc.Client$Connection.run(Client.java:273)

2.5.7 Summary

Threads in the TIMED_WAITING and WAITING states must not consume CPU. Threads in Runable must be judged based on the nature of the current thread code to determine whether they consume CPU.

  • If pure java operation code, it consumes CPU
  • If it is network IO, it rarely consumes CPU
  • If it is native code, it can be judged based on the nature of the native code (the native thread stack can be obtained through pstack/gstack), if it is pure operation code, it will consume CPU, if it is suspended, it will not consume CPU, if it is IO, it will not consume much. consumes CPU.

03 CPU soaring

3.1 Overview

There are many reasons for the high CPU of the system. Here are some common scenarios:

3.2 The infinite loop causes the CPU to soar

An infinite loop does not necessarily lead to 100% utilization of the CPU. If the code in the infinite loop is not CPU-intensive, but like Socket or database IO operations, these operations do not consume much CPU. But if there are CPU-intensive operations in the looping code, that's the likely cause of 100% CPU utilization.

The method for locate the problem is: by printing the thread stack multiple times

The main process of is as follows:

  1. Get the first stack information
  2. Wait for a certain period of time, and then obtain the stack information for the second time.
  3. Preprocess the stack information twice, first remove the thread in the sleeping or waiting state, because this thread does not consume CPU
  4. Compare the preprocessed threads of the first stack and the second stack, and find out the threads that have been active during this period. If the same thread in the two stacks is in the same calling context, it should be listed as a key suspect. The specific situation needs to be analyzed in detail in combination with the code. A guess as to possible reasons for this is:

Guess 1: Check whether the code corresponding to the execution context of the thread is long-running code.

Guess 2: If it does not belong to the long-running code, then why this thread cannot execute the code for a long time, the possible reason is that the code has an infinite loop.

Causes an infinite loop:

  • Multithreading: An infinite loop caused by the exit condition in the for, while loop is never satisfied.
  • Multithreading: data structures such as linked lists are connected end to end, resulting in traversal that can never be stopped
  • its problem

3.3 Non-infinite loop causes CPU soaring

The possible cause of the CPU soaring:

The method for locate the problem is: by printing the thread stack

The main process of is as follows:

Step 1: Get the target process id (pid), use the command to get the process ID, jps -l or ps -ef | grep java

Step 2: Obtain the local thread ID that consumes the most CPU under the process through top -Hp pid.

Step 3: Print the thread stack, use the following command, jstack -l pid

Step 4: Convert the decimal local thread ID to hexadecimal, use the following command: printf "0x%x\n" 53841 (local thread ID), output Oxd251.

Step 5: Find nid=<the thread ID that consumes the most CPU time obtained in the first step (the local thread ID converted from decimal to hexadecimal)> in the java thread stack. The specific analysis is as follows

  • If the corresponding thread ID is found in the java thread stack, and the thread is executing pure java code, it means that the CPU is too high due to the java code.
  • If the corresponding thread ID is found in the java thread stack, and the thread is executing the native code code, it means that the high CPU problem caused by the code is in the JNI call.
  • If the corresponding thread ID cannot be found in the java thread stack, there are two possibilities:
    • To execute the re-created thread in the JNI call, there is no information about the thread in the java thread stack.
    • The CPU is too high due to the code of the virtual machine itself, such as frequent Full GC caused by exhaustion of heap memory, or virtual machine bugs, etc.
    • At this time, you can also obtain all the local thread stacks through the pstack pid command, and find the corresponding thread in the local thread stack according to the thread ID that took the most CPU time previously obtained, that is, the thread with high CPU consumption. With the help of native thread stack information, you can directly locate the native code to find the problem.

04 System suspended animation

4.1 Overview

System suspended animation is only a superficial phenomenon. The system suspended animation, ostensibly because the system did not handle the response. For the web system, the http request returns no data. All in all, the system is dead. There are many reasons for the suspended animation of the system, and specific problems need to be analyzed in specific scenarios, but the general reasons are as follows:

Remarks: In special cases, the system is suspended. When a service provides http and rpc services to the outside world, the two use different thread pools. Even if the threads in one of the thread pools are deadlocked, the other thread pool can still provide services.

4.2 Thread deadlock

4.2.1 Deadlock and positioning methods

what deadlock

A deadlock occurs when two or more threads are waiting for a lock held by each other. Deadlock will cause two threads to be unable to continue running and be suspended forever.

The way to locate the problem is: print the thread stack. If a deadlock occurs, the virtual machine will automatically give a deadlock prompt when printing the thread stack.

4.2.2 Background

On the refund-interfaces service on 2021.12.13 11:15, the machine 10.240.49.153 operation command mounts the sandbox, and operates the recording on the traffic playback platform at 11:32:19. At this time, the return operation fails, and the problem of the recording failure is solved. Investigation, at 11:34, the order service has an interface reporting service timeout, but from the refund-interface monitoring, the monitoring of this 10.240.49.153 does not exist, and the others are normal. It was only at 11:59 that the machine was found still In dubbo's registration center, there was no better way to save the scene at that time, and the machine could only be restarted, which broke the scene. The initial guess after the fact is that in the case of high concurrency, the recording function of the operating platform leads to a deadlock. When a deadlock occurs, in order to restore the system, the temporary and only way to avoid it is to restart the system.

In order to ensure the availability of the service, in the event of an abnormal situation, the downgrade plan and troubleshooting ideas

The usual downgrade scenarios are:

[Option 1] Preservation on-site method - removal of flow method, currently our company's plan is not perfect

1. For http, traffic needs to be removed from the gateway, that is, consul

2. For rpc, traffic needs to be removed from dubbo's registration center, which is nacos

3. For mq, the consumer side of the service access MQ removes the traffic

[Option 2] All-purpose method - machine restart

The usual problem-solving idea is: try to save the site as much as possible to facilitate troubleshooting afterwards! ! ! !

Note: For the occurrence of deadlock, if the site is not reserved, the method to be found is through stress testing. Because the potential for deadlock in the code does not mean that deadlock occurs every time, it only happens when it should happen. When deadlock occurs, it is often the most unfortunate time - in high-load production under the environment

4.2.3 Reproducing deadlock through stress testing

4.2.3.1 Phenomenon

There is no monitoring at 2021-12-27 11:58:15, the pressure test exists, the service process exists, and the thread stack dump is observed at 12:02, as above

4.2.3.2 dump thread stack
Found one Java-level deadlock:
=============================
"DubboServerHandler-10.246.144.236:20888-thread-200":
  waiting to lock monitor 0x00007faa6c00c128 (object 0x00000006bd2a0000, a com.pandora.jvm.sandbox.repeater.module.classloader.PluginClassLoader),
  which is held by "com.alibaba.nacos.client.naming.updater"
"com.alibaba.nacos.client.naming.updater":
  waiting to lock monitor 0x00007fa91c0568b8 (object 0x00000006bcf0de00, a com.alibaba.jvm.sandbox.core.classloader.ModuleJarClassLoader),
  which is held by "XNIO-1 task-44"
"XNIO-1 task-44":
  waiting to lock monitor 0x00007fa904004848 (object 0x0000000722c10c48, a java.lang.Object),
  which is held by "com.alibaba.nacos.client.naming.updater"

Java stack information for the threads listed above:
===================================================
"DubboServerHandler-10.246.144.236:20888-thread-200":
    at com.pandora.jvm.sandbox.repeater.plugin.dubbo.DubboProviderInvocationProcessor.assembleIdentity(DubboProviderInvocationProcessor.java:39)
    at com.pandora.jvm.sandbox.repeater.plugin.core.impl.api.DefaultEventListener.sample(DefaultEventListener.java:194)
    at com.pandora.jvm.sandbox.repeater.plugin.core.impl.api.DefaultEventListener.onEvent(DefaultEventListener.java:89)
    at com.alibaba.jvm.sandbox.core.enhance.weaver.EventListenerHandler.handleEvent(EventListenerHandler.java:117)
    at com.alibaba.jvm.sandbox.core.enhance.weaver.EventListenerHandler.handleOnBefore(EventListenerHandler.java:353)
    at java.com.alibaba.jvm.sandbox.spy.Spy.spyMethodOnBefore(Spy.java:164)
    at org.apache.dubbo.rpc.filter.ContextFilter.invoke(ContextFilter.java)
    at org.apache.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:81)
    at org.apache.dubbo.rpc.filter.GenericFilter.invoke(GenericFilter.java:147)
    at org.apache.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:81)
    at org.apache.dubbo.rpc.filter.ClassLoaderFilter.invoke(ClassLoaderFilter.java:38)
    at org.apache.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:81)
    at org.apache.dubbo.rpc.filter.EchoFilter.invoke(EchoFilter.java:41)
    at org.apache.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:81)
    at org.apache.dubbo.rpc.protocol.dubbo.DubboProtocol$1.reply(DubboProtocol.java:145)
    at org.apache.dubbo.remoting.exchange.support.header.HeaderExchangeHandler.handleRequest(HeaderExchangeHandler.java:100)
    at org.apache.dubbo.remoting.exchange.support.header.HeaderExchangeHandler.received(HeaderExchangeHandler.java:175)
    at org.apache.dubbo.remoting.transport.DecodeHandler.received(DecodeHandler.java:51)
    at org.apache.dubbo.remoting.transport.dispatcher.ChannelEventRunnable.run(ChannelEventRunnable.java:57)
    at com.alibaba.ttl.TtlRunnable.run(TtlRunnable.java:51)
    at com.alibaba.ttl.TtlRunnable.run(TtlRunnable.java:55)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:775)
"com.alibaba.nacos.client.naming.updater":
    at java.lang.ClassLoader.checkCerts(ClassLoader.java:881)
    - waiting to lock <0x00000006bcf0de00> (a com.alibaba.jvm.sandbox.core.classloader.ModuleJarClassLoader)
    at java.lang.ClassLoader.preDefineClass(ClassLoader.java:662)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:755)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
    at com.alibaba.jvm.sandbox.core.classloader.RoutingURLClassLoader.access$400(RoutingURLClassLoader.java:22)
    at com.alibaba.jvm.sandbox.core.classloader.RoutingURLClassLoader$1.loadClass(RoutingURLClassLoader.java:91)
    at com.alibaba.jvm.sandbox.core.classloader.ClassLoadingLock.loadingInLock(ClassLoadingLock.java:55)
    - locked <0x0000000722c10c48> (a java.lang.Object)
    at com.alibaba.jvm.sandbox.core.classloader.RoutingURLClassLoader.loadClass(RoutingURLClassLoader.java:64)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:406)
    - locked <0x00000006bd2a0000> (a com.pandora.jvm.sandbox.repeater.module.classloader.PluginClassLoader)
    at com.pandora.jvm.sandbox.repeater.module.classloader.PluginClassLoader.loadClass(PluginClassLoader.java:47)
    - locked <0x00000006bd2a0000> (a com.pandora.jvm.sandbox.repeater.module.classloader.PluginClassLoader)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
    at com.pandora.jvm.sandbox.repeater.plugin.exception.ExceptionEventListener.sample(ExceptionEventListener.java:34)
    at com.pandora.jvm.sandbox.repeater.plugin.core.impl.api.DefaultEventListener.onEvent(DefaultEventListener.java:89)
    at com.alibaba.jvm.sandbox.core.enhance.weaver.EventListenerHandler.handleEvent(EventListenerHandler.java:117)
    at com.alibaba.jvm.sandbox.core.enhance.weaver.EventListenerHandler.handleOnBefore(EventListenerHandler.java:353)
    at java.com.alibaba.jvm.sandbox.spy.Spy.spyMethodOnBefore(Spy.java:164)
    at java.lang.RuntimeException.<init>(RuntimeException.java:62)
    at java.lang.IllegalArgumentException.<init>(IllegalArgumentException.java:52)
    at java.lang.NumberFormatException.<init>(NumberFormatException.java:55)
    at java.lang.Long.parseLong(Long.java:552)
    at java.lang.Long.parseLong(Long.java:631)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1845)
    - locked <0x0000000722cc9928> (a sun.net.www.protocol.http.HttpURLConnection)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
    - locked <0x0000000722cc9928> (a sun.net.www.protocol.http.HttpURLConnection)
    at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
    at com.alibaba.nacos.common.http.client.response.JdkHttpClientResponse.getStatusCode(JdkHttpClientResponse.java:75)
    at com.alibaba.nacos.common.http.client.handler.AbstractResponseHandler.handle(AbstractResponseHandler.java:43)
    at com.alibaba.nacos.common.http.client.NacosRestTemplate.execute(NacosRestTemplate.java:483)
    at com.alibaba.nacos.common.http.client.NacosRestTemplate.exchangeForm(NacosRestTemplate.java:427)
    at com.alibaba.nacos.client.naming.net.NamingProxy.callServer(NamingProxy.java:599)
    at com.alibaba.nacos.client.naming.net.NamingProxy.reqApi(NamingProxy.java:524)
    at com.alibaba.nacos.client.naming.net.NamingProxy.reqApi(NamingProxy.java:491)
    at com.alibaba.nacos.client.naming.net.NamingProxy.reqApi(NamingProxy.java:486)
    at com.alibaba.nacos.client.naming.net.NamingProxy.queryList(NamingProxy.java:400)
    at com.alibaba.nacos.client.naming.core.HostReactor.updateService(HostReactor.java:339)
    at com.alibaba.nacos.client.naming.core.HostReactor$UpdateTask.run(HostReactor.java:420)
    at com.alibaba.ttl.TtlRunnable.run(TtlRunnable.java:51)
    at com.alibaba.ttl.TtlRunnable.run(TtlRunnable.java:55)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:775)
"XNIO-1 task-44":
    at com.alibaba.jvm.sandbox.core.classloader.ClassLoadingLock.loadingInLock(ClassLoadingLock.java:53)
    - waiting to lock <0x0000000722c10c48> (a java.lang.Object)
    at com.alibaba.jvm.sandbox.core.classloader.RoutingURLClassLoader.loadClass(RoutingURLClassLoader.java:64)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
    at com.pandora.jvm.sandbox.repeater.plugin.core.impl.dewuext.TraceExtension.dewuTrace(TraceExtension.java:51)
    at com.pandora.jvm.sandbox.repeater.plugin.core.impl.api.DefaultEventListener.onEvent(DefaultEventListener.java:76)
    at com.alibaba.jvm.sandbox.core.enhance.weaver.EventListenerHandler.handleEvent(EventListenerHandler.java:117)
    at com.alibaba.jvm.sandbox.core.enhance.weaver.EventListenerHandler.handleOnBefore(EventListenerHandler.java:353)
    at java.com.alibaba.jvm.sandbox.spy.Spy.spyMethodOnBefore(Spy.java:164)
    at io.jaegertracing.internal.JaegerSpan.finish(JaegerSpan.java)
    at com.poizon.fusion.monitor.instrument.jdbc.MonitorPreparedStatement.execute(MonitorPreparedStatement.java:97)
    at io.shardingsphere.shardingjdbc.jdbc.core.statement.MasterSlavePreparedStatement$3.executeSQL(MasterSlavePreparedStatement.java:152)
    at io.shardingsphere.shardingjdbc.jdbc.core.statement.MasterSlavePreparedStatement$3.executeSQL(MasterSlavePreparedStatement.java:149)
    at io.shardingsphere.spi.executor.MasterSlaveSQLExecuteCallback.execute0(MasterSlaveSQLExecuteCallback.java:17)
    at io.shardingsphere.shardingjdbc.jdbc.core.statement.MasterSlavePreparedStatement.execute(MasterSlavePreparedStatement.java:155)
    at com.poizon.fusion.monitor.instrument.jdbc.MonitorPreparedStatement.execute(MonitorPreparedStatement.java:89)
    at org.apache.ibatis.executor.statement.PreparedStatementHandler.query(PreparedStatementHandler.java:64)
    at org.apache.ibatis.executor.statement.RoutingStatementHandler.query(RoutingStatementHandler.java:79)
    at org.apache.ibatis.executor.SimpleExecutor.doQuery(SimpleExecutor.java:63)
    at org.apache.ibatis.executor.BaseExecutor.queryFromDatabase(BaseExecutor.java:324)
    at org.apache.ibatis.executor.BaseExecutor.query(BaseExecutor.java:156)
    at org.apache.ibatis.executor.CachingExecutor.query(CachingExecutor.java:109)
    at com.github.pagehelper.PageInterceptor.intercept(PageInterceptor.java:108)
    at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:61)
    at com.sun.proxy.$Proxy419.query(Unknown Source)
    at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:147)
    at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:140)
    at sun.reflect.GeneratedMethodAccessor490.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:426)
    at com.sun.proxy.$Proxy192.selectList(Unknown Source)
    at org.mybatis.spring.SqlSessionTemplate.selectList(SqlSessionTemplate.java:223)
    at org.apache.ibatis.binding.MapperMethod.executeForMany(MapperMethod.java:147)
    at org.apache.ibatis.binding.MapperMethod.execute(MapperMethod.java:80)
    at org.apache.ibatis.binding.MapperProxy.invoke(MapperProxy.java:93)
    at com.sun.proxy.$Proxy193.selectByExample(Unknown Source)
    at com.shizhuang.duapp.refund.infrastructure.db.dao.RefundOrderDao.getReturnBySubOrderNoList(RefundOrderDao.java:1049)
    at com.shizhuang.duapp.refund.infrastructure.db.dao.RefundOrderDao$$FastClassBySpringCGLIB$$41f7ca21.invoke(<generated>)
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:750)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
    at org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:139)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
    at org.springframework.aop.framework.adapter.MethodBeforeAdviceInterceptor.invoke(MethodBeforeAdviceInterceptor.java:56)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
    at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:93)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:689)
    at com.shizhuang.duapp.refund.infrastructure.db.dao.RefundOrderDao$$EnhancerBySpringCGLIB$$28f9b058.getReturnBySubOrderNoList(<generated>)
    at com.shizhuang.duapp.refund.application.service.refund.app.impl.QueryRefundDetailInBuyerOrderDetailService.hasRefund(QueryRefundDetailInBuyerOrderDetailService.java:147)
    at com.shizhuang.duapp.refund.interfaces.facade.refund.app.RefundOrderDesc2OldOrderDetailController.hasRefund(RefundOrderDesc2OldOrderDetailController.java:150)
    at com.shizhuang.duapp.refund.interfaces.facade.refund.app.RefundOrderDesc2OldOrderDetailController$$FastClassBySpringCGLIB$$da692b9e.invoke(<generated>)
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:750)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
    at org.springframework.aop.aspectj.MethodInvocationProceedingJoinPoint.proceed(MethodInvocationProceedingJoinPoint.java:88)
    at com.poizon.fusion.monitor.instrument.openfeign.aspect.FeignClientAspect.invoke(FeignClientAspect.java:109)
    at sun.reflect.GeneratedMethodAccessor486.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:644)
    at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:633)
    at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke(AspectJAroundAdvice.java:70)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:175)
    at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:93)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:689)
    at com.shizhuang.duapp.refund.interfaces.facade.refund.app.RefundOrderDesc2OldOrderDetailController$$EnhancerBySpringCGLIB$$25d2f6d1.hasRefund(<generated>)
    at sun.reflect.GeneratedMethodAccessor497.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:190)
    at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:138)
    at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:105)
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:893)
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:798)
    at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87)
    at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1040)
    at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:943)
    at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1006)
    at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:909)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:665)
    at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:883)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:750)
    at io.undertow.servlet.handlers.ServletHandler.handleRequest(ServletHandler.java:74)
    at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:129)
    at com.shizhuang.duapp.dubbo.environment.ServletEnvironmentFilter.doFilter(ServletEnvironmentFilter.java:28)
    at io.undertow.servlet.core.ManagedFilter.doFilter(ManagedFilter.java:61)
    at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:131)
    at com.alibaba.druid.support.http.WebStatFilter.doFilter(WebStatFilter.java:124)
    at io.undertow.servlet.core.ManagedFilter.doFilter(ManagedFilter.java:61)
    at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:131)
    at org.springframework.boot.actuate.web.trace.servlet.HttpTraceFilter.doFilterInternal(HttpTraceFilter.java:88)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
    at io.undertow.servlet.core.ManagedFilter.doFilter(ManagedFilter.java:61)
    at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:131)
    at com.shizhuang.duapp.sentinel.client.servlet.SentinelServletFilter.doFilter(SentinelServletFilter.java:90)
    at io.undertow.servlet.core.ManagedFilter.doFilter(ManagedFilter.java:61)
    at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:131)
    at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
    at io.undertow.servlet.core.ManagedFilter.doFilter(ManagedFilter.java:61)
    at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:131)
    at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
    at io.undertow.servlet.core.ManagedFilter.doFilter(ManagedFilter.java:61)
    at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:131)
    at org.springframework.web.filter.HiddenHttpMethodFilter.doFilterInternal(HiddenHttpMethodFilter.java:94)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
    at io.undertow.servlet.core.ManagedFilter.doFilter(ManagedFilter.java:61)
    at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:131)
    at com.poizon.fusion.webmvc.logging.AccessLoggingFilter.doFilterInternal(AccessLoggingFilter.java:63)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
    at io.undertow.servlet.core.ManagedFilter.doFilter(ManagedFilter.java:61)
    at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:131)
    at org.springframework.boot.actuate.metrics.web.servlet.WebMvcMetricsFilter.filterAndRecordMetrics(WebMvcMetricsFilter.java:114)
    at org.springframework.boot.actuate.metrics.web.servlet.WebMvcMetricsFilter.doFilterInternal(WebMvcMetricsFilter.java:104)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
    at io.undertow.servlet.core.ManagedFilter.doFilter(ManagedFilter.java:61)
    at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:131)
    at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
    at io.undertow.servlet.core.ManagedFilter.doFilter(ManagedFilter.java:61)
    at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:131)
    at com.poizon.fusion.monitor.instrument.spring.web.server.servlet.MonitorServletFilter.doFilter(MonitorServletFilter.java:170)
    at io.undertow.servlet.core.ManagedFilter.doFilter(ManagedFilter.java:61)
    at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:131)
    at io.undertow.servlet.handlers.FilterHandler.handleRequest(FilterHandler.java:84)
    at io.undertow.servlet.handlers.security.ServletSecurityRoleHandler.handleRequest(ServletSecurityRoleHandler.java:62)
    at io.undertow.servlet.handlers.ServletChain$1.handleRequest(ServletChain.java:68)
    at io.undertow.servlet.handlers.ServletDispatchingHandler.handleRequest(ServletDispatchingHandler.java:36)
    at io.undertow.servlet.handlers.RedirectDirHandler.handleRequest(RedirectDirHandler.java:68)
    at io.undertow.servlet.handlers.security.SSLInformationAssociationHandler.handleRequest(SSLInformationAssociationHandler.java:132)
    at io.undertow.servlet.handlers.security.ServletAuthenticationCallHandler.handleRequest(ServletAuthenticationCallHandler.java:57)
    at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)
    at io.undertow.security.handlers.AbstractConfidentialityHandler.handleRequest(AbstractConfidentialityHandler.java:46)
    at io.undertow.servlet.handlers.security.ServletConfidentialityConstraintHandler.handleRequest(ServletConfidentialityConstraintHandler.java:64)
    at io.undertow.security.handlers.AuthenticationMechanismsHandler.handleRequest(AuthenticationMechanismsHandler.java:60)
    at io.undertow.servlet.handlers.security.CachedAuthenticatedSessionHandler.handleRequest(CachedAuthenticatedSessionHandler.java:77)
    at io.undertow.security.handlers.AbstractSecurityContextAssociationHandler.handleRequest(AbstractSecurityContextAssociationHandler.java:43)
    at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)
    at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)
    at io.undertow.servlet.handlers.ServletInitialHandler.handleFirstRequest(ServletInitialHandler.java:269)
    at io.undertow.servlet.handlers.ServletInitialHandler.access$100(ServletInitialHandler.java:78)
    at io.undertow.servlet.handlers.ServletInitialHandler$2.call(ServletInitialHandler.java:133)
    at io.undertow.servlet.handlers.ServletInitialHandler$2.call(ServletInitialHandler.java:130)
    at io.undertow.servlet.core.ServletRequestContextThreadSetupAction$1.call(ServletRequestContextThreadSetupAction.java:48)
    at io.undertow.servlet.core.ContextClassLoaderSetupAction$1.call(ContextClassLoaderSetupAction.java:43)
    at io.undertow.servlet.handlers.ServletInitialHandler.dispatchRequest(ServletInitialHandler.java:249)
    at io.undertow.servlet.handlers.ServletInitialHandler.access$000(ServletInitialHandler.java:78)
    at io.undertow.servlet.handlers.ServletInitialHandler$1.handleRequest(ServletInitialHandler.java:99)
    at io.undertow.server.Connectors.executeRootHandler(Connectors.java:376)
    at io.uXNIO-1 task-44":
    at com.alibaba.jvm.sandbox.core.classloader.ClassLoadingLock.loadingInLock(ClassLoadingLock.java:53)
    - waiting to lock <0x0000000722c10c48> (a java.lang.Object)
    at com.alibaba.jvm.sandbox.core.classloader.RoutingURLClassLoader.loadClass(RoutingURLClassLoader.java:64)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
    at com.pandora.jvm.sandbox.repeater.plugin.core.impl.dewuext.TraceExtension.dewuTrace(TraceExtension.java:51)ndertow.server.HttpServerExchange$1.run(HttpServerExchange.java:830)
    at com.alibaba.ttl.TtlRunnable.run(TtlRunnable.java:51)
    at com.alibaba.ttl.TtlRunnable.run(TtlRunnable.java:55)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:775)

Found 1 deadlock.

Remark:

There are some troublesome deadlock scenarios, such as the deadlock that occurs during the class loading process, especially when the framework uses a lot of custom class loading, because it is often not in the code base of the application itself, and tools such as jstack may not be able to display it. All lock information.

4.2.3.3 dump file analysis

thread XNIO-1

XNIO-1 task-44":
    at com.alibaba.jvm.sandbox.core.classloader.ClassLoadingLock.loadingInLock(ClassLoadingLock.java:53)
    - waiting to lock <0x0000000722c10c48> (a java.lang.Object)
    at com.alibaba.jvm.sandbox.core.classloader.RoutingURLClassLoader.loadClass(RoutingURLClassLoader.java:64)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
    at com.pandora.jvm.sandbox.repeater.plugin.core.impl.dewuext.TraceExtension.dewuTrace(TraceExtension.java:51)

Thread com.alibaba.nacos.client.naming.updater

"com.alibaba.nacos.client.naming.updater":
    at java.lang.ClassLoader.checkCerts(ClassLoader.java:881)
    - waiting to lock <0x00000006bcf0de00> (a com.alibaba.jvm.sandbox.core.classloader.ModuleJarClassLoader)
    at java.lang.ClassLoader.preDefineClass(ClassLoader.java:662)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:755)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
    at com.alibaba.jvm.sandbox.core.classloader.RoutingURLClassLoader.access$400(RoutingURLClassLoader.java:22)
    at com.alibaba.jvm.sandbox.core.classloader.RoutingURLClassLoader$1.loadClass(RoutingURLClassLoader.java:91)
    at com.alibaba.jvm.sandbox.core.classloader.ClassLoadingLock.loadingInLock(ClassLoadingLock.java:55)
    - locked <0x0000000722c10c48> (a java.lang.Object)
    at com.alibaba.jvm.sandbox.core.classloader.RoutingURLClassLoader.loadClass(RoutingURLClassLoader.java:64)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:406)
    - locked <0x00000006bd2a0000> (a com.pandora.jvm.sandbox.repeater.module.classloader.PluginClassLoader)
    at com.pandora.jvm.sandbox.repeater.module.classloader.PluginClassLoader.loadClass(PluginClassLoader.java:47)
    - locked <0x00000006bd2a0000> (a com.pandora.jvm.sandbox.repeater.module.classloader.PluginClassLoader)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
    at com.pandora.jvm.sandbox.repeater.plugin.exception.ExceptionEventListener.sample(ExceptionEventListener.java:34)

Combined with code analysis

thread XNIO-1

Thread com.alibaba.nacos.client.naming.updater

4.2.3.4 Conclusion

In high concurrency situations, class loading locks can lead to deadlocks

4.2.3.5 Solutions

Load RepeatCache and Tracer classes early

4.3 Thread suspended animation

Thread suspended animation means that the thread is always running and does not exit.

The way to locate the problem is: print thread stack

The business needs to use spark for data processing, so the spark task is uploaded to the server for execution, but the main thread hangs all the time. By printing the thread stack, extract some of the thread stacks, as follows:

"main" #1 prio=5 os_prio=0 tid=0x00007f0d98009800 nid=0x6240b runnable [0x00007f0d9ec49000]
   java.lang.Thread.State: RUNNABLE
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at com.mysql.cj.protocol.ReadAheadInputStream.fill(ReadAheadInputStream.java:107)
    at com.mysql.cj.protocol.ReadAheadInputStream.readFromUnderlyingStreamIfNecessary(ReadAheadInputStream.java:150)
    at com.mysql.cj.protocol.ReadAheadInputStream.read(ReadAheadInputStream.java:180)
    - locked <0x000000072028a7b8> (a com.mysql.cj.protocol.ReadAheadInputStream)
    at java.io.FilterInputStream.read(FilterInputStream.java:133)
    at com.mysql.cj.protocol.FullReadInputStream.readFully(FullReadInputStream.java:64)
    at com.mysql.cj.protocol.a.SimplePacketReader.readHeaderLocal(SimplePacketReader.java:81)
    at com.mysql.cj.protocol.a.SimplePacketReader.readHeader(SimplePacketReader.java:63)
    at com.mysql.cj.protocol.a.SimplePacketReader.readHeader(SimplePacketReader.java:45)
    at com.mysql.cj.protocol.a.NativeProtocol.readMessage(NativeProtocol.java:514)
    at com.mysql.cj.protocol.a.NativeProtocol.readServerCapabilities(NativeProtocol.java:475)
    at com.mysql.cj.protocol.a.NativeProtocol.beforeHandshake(NativeProtocol.java:362)
    at com.mysql.cj.protocol.a.NativeProtocol.connect(NativeProtocol.java:1367)
    at com.mysql.cj.NativeSession.connect(NativeSession.java:133)
    at com.mysql.cj.jdbc.ConnectionImpl.connectWithRetries(ConnectionImpl.java:842)
    at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:823)
    - locked <0x0000000722c7fc40> (a com.mysql.cj.jdbc.ConnectionImpl)
    at com.mysql.cj.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:448)
    at com.mysql.cj.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:241)
    at com.mysql.cj.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:198)
    at org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
    at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider$.create(ConnectionProvider.scala:68)
    - locked <0x0000000722c7fe18> (a org.apache.spark.security.SecurityConfigurationLock$)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$createConnectionFactory$1(JdbcUtils.scala:62)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$$Lambda$1606/936828380.apply(Unknown Source)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:48)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
    - locked <0x0000000722c96c18> (a org.apache.spark.sql.execution.command.ExecutedCommandExec)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
    at org.apache.spark.sql.execution.SparkPlan$$Lambda$1492/203829039.apply(Unknown Source)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
    at org.apache.spark.sql.execution.SparkPlan$$Lambda$1516/376234567.apply(Unknown Source)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132)
    - locked <0x0000000722ca3c78> (a org.apache.spark.sql.execution.QueryExecution)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:131)
    at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
    at org.apache.spark.sql.DataFrameWriter$$Lambda$1379/456240898.apply(Unknown Source)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
    at org.apache.spark.sql.execution.SQLExecution$$$Lambda$1387/862090614.apply(Unknown Source)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
    at org.apache.spark.sql.execution.SQLExecution$$$Lambda$1380/1008095885.apply(Unknown Source)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
    at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:301)
    at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:817)
    at com.pandora.spark.task.job.EsSparkSqlRecordDataHandle.main(EsSparkSqlRecordDataHandle.java:111)

   Locked ownable synchronizers:
    - None

Through the above stack analysis, it can be found that the status of the main thread is: RUNNABLE, and it is blocked in the method

ReadAheadInputStream.fill(ReadAheadInputStream.java:107), by checking the configuration, it is found that the relevant parameters such as connectTimeout and socketTimeout are not configured. After configuring the above timeout parameters, other errors appear.

05 The system is running slower and slower

5.1 Overview

System slowness is generally caused by the following reasons:

  1. Insufficient memory caused by heap memory leaks, causing the system to get slower and slower until it stops.
  2. Too small Xmx setting causes insufficient heap memory, causing the system to become slower and slower until it stops.
  3. The system has an infinite loop, consuming too much CPU. Refer to the content of the previous section
  4. Insufficient resources cause the sexual system to run more and more slowly.

5.2 Memory leaks

5.2.1 Positioning method

The basis for judging whether there is a memory leak in the system is: if there is a memory leak in the system, the memory value after garbage collection should continue to rise. If this phenomenon can be observed in the field, it means that there is a memory leak in the system. When a memory leak is suspected in a system, first use the FULL GC information to perform a preliminary confirmation of the memory leak to confirm whether there is a memory leak in the system. Only check whether the available memory value after full garbage collection has been increasing again, the steps are as follows:

  • First intercept the GC information after the system is running stably (for example, the initialization has been completed), this is very important, the information during the unstable running period has no analytical value, because you cannot confirm whether the memory growth is normal growth or abnormal growth caused by memory leaks .
  • Filter out rows with FULL GC. Only FULL GC lines have analytical value. Because the memory after the GC is done is the amount of memory that the current Java object really uses. There are generally two possibilities for the system:
    • If the memory after garbage collection continues to grow and tends to grow to the value set by Xmx, then it can basically be concluded that there is a memory leak in the system at this time.
    • If the memory grows to a certain value after the current garbage collection is completed, it can fall back again, and it is generally in a dynamic balance, then the memory leak can basically be ruled out.

After analyzing the memory usage trend above, you can basically determine whether there is a heap memory leak in the system. Of course, this GC information analysis can only tell you whether there is a heap memory leak in the system, but it cannot tell you where the leak is. The precise location of the memory leak is to find the specific location of the memory leak. It is necessary to dump the memory stack and use its memory stack analysis tool to find the class or object that really causes the memory leak.

5.2.2 Cases

jvm monitor:

memory stack analysis:

Through the memory stack, locate the following code

Reason guess:

1. According to the memory monitoring of JVM, the leak is a few days, and the speed of the leak is a straight line (approximately constant speed). If the memory leak is caused by business calls, there should be ups and downs. It is guessed that it may be the logic of the scheduled task calling dubbo.

Verify

5.3 Insufficient resources

5.3.1 Features and positioning methods

Features: If the resources are insufficient, there will be a large number of threads waiting for resources. If the printed thread stack has this feature, it means that the system resources are the bottleneck. For performance degradation caused by insufficient resources, there will be a large number of threads in the printed thread stack that stop in the same calling context.

Insufficient resources, or inappropriate use of resources, often manifests as a performance issue. The system gets slower and slower and eventually stops responding. When encountering problems such as system slowdown, the way to locate the problem is to print the thread stack.

5.3.2 Possible causes

5.3.3 Cases

Note: The thread stack was not printed to the service at that time.

From the monitoring point of view, the connection pool of the database is full (the connection pool was configured at 60 at that time). From the client's point of view, the cpu and memory of the client's stand-alone machine were not fully used at that time, and from the server (mysql) point of view, there was no abnormality in the db. So it can be seen that the bottleneck is stuck on the database connection.

From the analysis of specific business scenarios, there are two solutions to this problem:

Option 1: Resource isolation method. Because from the perspective of this business, only the qps of the traffic reporting and playback reporting interfaces are high, while the qps of other interfaces of the platform are low, and they are the basic operation methods of the platform background. One database connection pool is used for traffic reporting and playback reporting, and another connection pool is used for other interfaces of the platform.

Option 2: Improve the connection pool of the database. The disadvantage of this method is that it will affect the page operation in the background, and the page will be stuck.

06 Performance Analysis

6.1 Overview

Thread stack for performance analysis Usage scenarios: Performance bottleneck location in multi-threaded situations. In particular, improper use of locks can lead to performance bottlenecks.

Improved performance means doing more with less. When the operation of a thread is blocked by a specific resource, it is called a restricted resource: restricted database, restricted processing power of the peer.

The ultimate goal of performance tuning is that the CPU utilization of the system is close to 100%. If the CPU of the system is not fully utilized, there may be the following problems: the pressure applied is insufficient, and the system under test may not be added to sufficient pressure (load). Failure rate and CPU usage. If the pressure increases, the system begins to experience partial service failures, the response time of the system slows down, or the CPU usage cannot increase any more, then the pressure at this time should be the saturation pressure of the system. That is, the capability at this time is the current maximum capability of the system.

There is a bottleneck in the system: When the system is under saturation pressure, if the CPU usage is not close to 100%, it means that there is still room for improvement in the performance of the system. If the system has the following problems, you can use the thread stack to find the performance bottleneck method to locate the problem.

  • Continues to run slowly. From time to time, the application is found to be slow. The overall response time cannot be effectively improved by changing environmental factors (such as the number of database connections, etc.)
  • System performance degrades gradually over time. When the load is stable, the longer the system runs, the slower the speed will be. It may be due to exceeding a certain threshold range, the system operates frequently with errors, resulting in system deadlock or crash.
  • The system performance gradually decreases as the load increases. As the number of users increases, the application runs more slowly. When several users log out of the system, the application can resume normal operation.

6.2 Common performance bottlenecks

  • Resource contention due to improper synchronization
    • Two unrelated functions share a lock or different shared variables share a lock, creating resource contention needlessly
    • The granularity of the lock is too large. After the access to the shared resource is completed, the subsequent code is not placed outside the synchronized synchronization code block.
  • Abuse of sleep: sleep is only suitable for waiting for a fixed period of time. If the polling code is mixed with sleep() calls, this design must be a bad design. This design can lead to serious performance bottlenecks in some cases.
  • String + abuse
  • Inappropriate threading model: In multi-threaded situations, if the threading model is not appropriate, it will also lead to poor performance.
  • Inefficient SQL statements or improper database design
  • Insufficient number of threads: When using a thread pool, if the thread configuration of the thread pool is too small, it will also lead to poor performance.
  • Frequent GC caused by memory leaks: Memory leaks will lead to more and more frequent GCs, and GC operations are CPU-intensive operations, and frequent GCs will seriously degrade the overall performance of the system.

6.3 Stress test found performance bottle

Characteristics of performance bottlenecks

  • There is only one current performance bottleneck, and only when this one is solved can the next one b

得物技术
846 声望1.5k 粉丝