More high-quality dry goods: see my GitHub: dotnetfly

One: background

1. Tell a story

This is the story of a friend who added wx to dump for help in mid-June. His program cpu burst high➕UI stuck and asked how to solve it. The screenshot is as follows:

After getting this dump, I found out that this is an MRS program for robotic arms. Haha, there is actually a use of .NET in the field of intelligent robots such as robotic arms, which is a bit beyond my cognition, I don’t know. There are several robotic arms controlled by .NET in the big strong son warehouse that treats employees as brothers. 😄😄😄.

I will not discuss the problem of interface stuck here. I will only discuss how to solve the problem of high cpu. After all, friends who chase this series are all memory leaks and memory explosions. Change your taste. Good.

Two: Windbg analysis

1. Phenomenon verification

Others say that the cpu is high, so I have to use data to prove whether this is really the case. The method is very simple, just use the !tp command.


0:000> !tp
CPU utilization: 100%
Worker Thread: Total: 151 Running: 151 Idle: 0 MaxLimit: 32767 MinLimit: 4
Work Request in Queue: 1
    AsyncTimerCallbackCompletion TimerInfo@000000001dc25bb0
--------------------------------------
Number of Timers: 0
--------------------------------------
Completion Port Thread:Total: 2 Free: 1 MaxFree: 8 CurrentLimit: 2 MaxLimit: 1000 MinLimit: 4

From the perspective of hexagrams, there are two indicators worthy of attention:

  1. CPU utilization: 100%

This shows that at the moment when the dump was captured, the CPU of the machine was indeed 100%, which was really bad. To be a digression, several friends said that he wanted to capture this 100% dump and found that the remote desktop of Alibaba Cloud could not be connected. It's so embarrassing. . .

  1. Total: 151 Running: 151

There are currently 151 work threads in the thread pool, and 151 are running. What does this mean? It shows that every thread is busy, and it also indicates that the current thread is not enough and urgently needs to recruit people. The current system definitely has a force pushing it.

2. View the thread list

Next, take a look at the current thread list and use the !t command.


0:000> !t
ThreadCount:      171
UnstartedThread:  1
BackgroundThread: 167
PendingThread:    1
DeadThread:       2
Hosted Runtime:   no
                                                                                                        Lock  
       ID OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   0    1  7e0 00000000028901c0    26020 Preemptive  00000000049C9360:00000000049C98A8 0000000000602420 1     STA (GC) 
   9    2  df8 00000000028bc850    2b220 Preemptive  0000000000000000:0000000000000000 0000000000602420 0     MTA (Finalizer) 
  11    3  144 000000001fdef570  102a220 Preemptive  0000000000000000:0000000000000000 0000000000602420 0     MTA (Threadpool Worker) 
  14    9  dbc 0000000020703650  202b220 Preemptive  0000000000000000:0000000000000000 0000000000602420 0     MTA 
  15   10  5a4 00000000206d5860    2b220 Preemptive  0000000000000000:0000000000000000 0000000000602420 0     MTA 
  16   11  17c 00000000206df220    2b220 Preemptive  0000000000000000:0000000000000000 0000000000602420 0     MTA 
  17   12  dd4 00000000205e49a0    2b220 Preemptive  00000000049A7A20:00000000049A98A8 0000000000602420 0     MTA 
  18   14  8fc 0000000020495000    2b220 Preemptive  00000000049A5A40:00000000049A78A8 0000000000602420 0     MTA 
  19   17  a84 0000000020817490  202b220 Preemptive  00000000049ADBB0:00000000049AF8A8 0000000000602420 0     MTA 
  ...
 180  167 12b8 0000000026436d70  1021220 Preemptive  0000000000000000:0000000000000000 0000000000602420 0     MTA (Threadpool Worker) 
 181  168 11a4 0000000026437540  1021220 Preemptive  0000000000000000:0000000000000000 0000000000602420 0     MTA (Threadpool Worker) 
 182  169  880 0000000026437d10  1021220 Preemptive  0000000000000000:0000000000000000 0000000000602420 0     MTA (Threadpool Worker) 
 183  170 1334 00000000264384e0  1021220 Preemptive  0000000000000000:0000000000000000 0000000000602420 0     MTA (Threadpool Worker) 
 184  171  278 0000000026438cb0     1400 Preemptive  0000000000000000:0000000000000000 0000000000602420 0     Ukn 

From the images look: ID=1 thread has a GC mark, I went, seems to be triggered GC, here to remind you, the worker thread in GC=Workstation mode, it can act as a recovery GC thread, and this GC=Server mode is different of.

3. View the thread stack

Since the GC is triggered, then dead horse will be a living horse doctor. According to the routines triggered by GC, the basic process is as follows:

  1. To call up all thread stacks, use the !EEStack command.

0:000> !EEStack 
---------------------------------------------
Thread   0
Current frame: ntdll!NtGetContextThread+0xa
Child-SP         RetAddr          Caller, Callee
...
  1. Look for the WaitUntilGCComplete keyword to see how many threads are waiting for GC collection

Use Ctrl+F retrieve, the screenshot is as follows:

It can be seen from the picture: 96 threads are waiting for the GC to complete. By this point, the corners of my mouth have risen, 😁😁😁. . .

  1. Find the try_allocate_more_space keyword to determine what business logic triggered the GC

I'm going, what's the matter? There is no end to this. Since there is no try_allocate_more_space keyword, it means that the current GC is probably not triggered automatically. Then who triggered it? It's a bit weird, huh?

4. How is GC triggered?

To find out, the easiest and rude way is to look at what GC does? Use !clrstack here.


0:000> !clrstack 
OS Thread Id: 0x7e0 (0)
        Child SP               IP Call Site
000000000043e470 00000000778c1fea [InlinedCallFrame: 000000000043e470] System.GC._Collect(Int32, Int32)
000000000043e470 000007feea38ce2a [InlinedCallFrame: 000000000043e470] System.GC._Collect(Int32, Int32)
000000000043e440 000007feea38ce2a System.GC.Collect()
000000000043e4f0 000007fe8bcd29ca xxx.xxx.T_Tick(System.Object, System.EventArgs)
000000000043e520 000007fee3d0ef6f System.Windows.Forms.Timer.OnTick(System.EventArgs)
000000000043e550 000007fee3d076fe System.Windows.Forms.Timer+TimerNativeWindow.WndProc(System.Windows.Forms.Message ByRef)
000000000043e580 000007fee3cea3c3 System.Windows.Forms.NativeWindow.Callback(IntPtr, Int32, IntPtr, IntPtr)
000000000043e620 000007fee43611f1 DomainBoundILStubClass.IL_STUB_ReversePInvoke(Int64, Int32, Int64, Int64)
000000000043e890 000007feeac1221e [InlinedCallFrame: 000000000043e890] System.Windows.Forms.UnsafeNativeMethods.DispatchMessageW(MSG ByRef)
000000000043e890 000007fee3d6a378 [InlinedCallFrame: 000000000043e890] System.Windows.Forms.UnsafeNativeMethods.DispatchMessageW(MSG ByRef)
000000000043e860 000007fee3d6a378 DomainBoundILStubClass.IL_STUB_PInvoke(MSG ByRef)
000000000043e920 000007fee3cff23e System.Windows.Forms.Application+ComponentManager.System.Windows.Forms.UnsafeNativeMethods.IMsoComponentManager.FPushMessageLoop(IntPtr, Int32, Int32)
000000000043ea10 000007fee3cfebd2 System.Windows.Forms.Application+ThreadContext.RunMessageLoopInner(Int32, System.Windows.Forms.ApplicationContext)
000000000043eab0 000007fee3cfe9df System.Windows.Forms.Application+ThreadContext.RunMessageLoop(Int32, System.Windows.Forms.ApplicationContext)
000000000043eb10 000007fe8b6208a6 xxx.Program.Main()
000000000043ede0 000007feeac16bb3 [GCFrame: 000000000043ede0] 

Let me go, from the thread stack GC.Collect() manually. What kind of business is this? Next, use ip2md + savemodule to xxx.xxx.T_Tick source code of 061381930824a3 to take a look.


0:000> !ip2md 000007fe8bcd29ca
MethodDesc:   000007fe8b50ae90
Method Name:  xxx.xxx.T_Tick(System.Object, System.EventArgs)
Class:        000007fe8b6ac628
MethodTable:  000007fe8b50b080
mdToken:      00000000060002b5
Module:       000007fe8b504118
IsJitted:     yes
CodeAddr:     000007fe8bcd29a0
Transparency: Critical
0:000> !savemodule 000007fe8b504118 D:\dumps\MRS-CPU\T_Tick.dll
3 sections in file
section 0 - VA=2000, VASize=1a85fc, FileAddr=200, FileSize=1a8600
section 1 - VA=1ac000, VASize=5088, FileAddr=1a8800, FileSize=5200
section 2 - VA=1b2000, VASize=c, FileAddr=1ada00, FileSize=200

Open T_Tick.dll with ILSpy, the screenshot is as follows:

From the code logic point of view, my friend did the business logic of triggering a GC for 3 minutes. I don't know what I want to do, so I talked with my friend on wx.

.png)

5. Is it really all about GC?

In fact, many of the dumps with high CPU bursts I have shared are due to frequent triggering of GC, such as large string splicing, misuse of regular expressions, etc., but a gc once every 3 minutes can kill the cpu , If Xiaobai can still fool around, in the eyes of friends who know a little bit, he can't stand scrutiny. The implication is that the real culprit has not been found. . . To find the suspicious culprit, the best way is to perform a carpet search on all thread stacks. The screenshot is as follows:

As you can see from the above figure, there are currently 112 threads stuck at System.Collections.Generic.Dictionary2[[System.Int32, mscorlib],[System.__Canon, mscorlib]].FindEntry(Int32) . You will definitely say that the stuck is caused by the GC triggering and freezing all threads. Of course, this theory does not need to be refuted. It is indeed the case.

I believe that experienced friends will definitely find a problem, that is, a thread-unsafe Dictionary appears in a multi-threaded environment. I also mentioned this problem in a previous car networking CPU explosion analysis. It will lead to FindEntry The weird phenomenon of infinite loop occurs during operation.

So far, the problem of high CPU burst has basically been found.

Three: Summary

This cpu explosion accident is mainly due to:

  1. The use of a non-thread-safe Dictionary in a multithreaded environment leads to an infinite loop.
  2. Calling GC.Collection() periodically makes it worse.

After finding the problem, the solution is simple. It is recommended to change Dictioanry to ConcurrentDictionary and remove the manual call GC.Collection()


一线码农
369 声望1.6k 粉丝