An analysis of the CPU burst of a .NET web site on an e-commerce trading platform

One: background

1. Tell a story

I have written a few real cases of memory skyrocketing, which is a bit numb. change my taste and share a case of 160b6e62bac260 CPU burst. Some time ago, a friend found me on wx and said that one of his old projects often accepts The warning message to CPU > 90% is quite embarrassing.

Now that you find me, use windbg to analyze and what else can I do.

`Two: windbg analysis`

`1. Exploration site`

Since CPU > 90% said, then let me verify if it is really the case?


0:359> !tp
CPU utilization: 100%
Worker Thread: Total: 514 Running: 514 Idle: 0 MaxLimit: 2400 MinLimit: 32
Work Request in Queue: 1
    Unknown Function: 00007ff874d623fc  Context: 0000003261e06e40
--------------------------------------
Number of Timers: 2
--------------------------------------
Completion Port Thread:Total: 2 Free: 2 MaxFree: 48 CurrentLimit: 2 MaxLimit: 2400 MinLimit: 32

From the perspective of the hexagram, it is spectacular. The CPU is directly full, and the 514 threads in the thread pool are also running at full capacity. So what are they all running? First of all, I have to wonder if these threads are locked by any lock.

`2. View the synchronization block table`

Observe the lock situation and check the synchronization block table first. After all, everyone likes to use lock to play multi-thread synchronization. You can use the !syncblk command to view.


0:359> !syncblk
Index SyncBlock MonitorHeld Recursion Owning Thread Info  SyncBlock Owner
   53 000000324cafdf68          498         0 0000000000000000     none    0000002e1a2949b0 System.Object
-----------------------------
Total           1025
CCW             3
RCW             4
ComClassFactory 0
Free            620

Let me go, this hexagram looks strange, what the hell is MonitorHeld=498 ? ? The textbooks say: owner + 1 , waiter + 2 , so what you see with the naked eye will always be an odd number. What does the even number mean? After checking the magical StackOverflow, it can be summarized into the following two situations:

Memory corruption

This situation is more difficult than winning the lottery, and I firmly believe that this kind of luck will not go away. . .

lock convoy (lock escort)

Some time ago I shared a real case: recorded a .NET CPU burst analysis of a travel agency website , it is because of the CPU burst caused by lock convoy, it is really small, and I encountered it again. . . To make it easier for everyone to understand, I'll post the picture.

After reading this picture, you should understand that a thread frequently competes for locks in the time slice, so it is easy to appear that a thread holding a lock has just exited, and there is no real thread waiting for the lock at this time. Holding the lock, the dump just caught is such a time difference. In other words, the current 498 is all the count of waiter threads, that is, 249 waiter threads. Then you can verify it and adjust the thread stacks of all threads. Come out, and then search for the Monitor.Enter keyword.

It can be seen from the figure that there are currently 220 threads stuck at Monitor.Enter , and 29 seem to be missing. No matter what, a large number of threads are stuck anyway. From the stack, it looks like they are stuck after setting the context xxx.Global.PreProcess Yes, in order to satisfy curiosity, I will export the problem code.

`3. View the problem code`

Still use the old command !ip2md + !savemodule .


0:359> !ip2md 00007ff81ae98854
MethodDesc:   00007ff819649fa0
Method Name:  xxx.Global.PreProcess(xxx.JsonRequest, System.Object)
Class:        00007ff81966bdf8
MethodTable:  00007ff81964a078
mdToken:      0000000006000051
Module:       00007ff819649768
IsJitted:     yes
CodeAddr:     00007ff81ae98430
Transparency: Critical
0:359> !savemodule 00007ff819649768 E:\dumps\PreProcess.dll
3 sections in file
section 0 - VA=2000, VASize=b6dc, FileAddr=200, FileSize=b800
section 1 - VA=e000, VASize=3d0, FileAddr=ba00, FileSize=400
section 2 - VA=10000, VASize=c, FileAddr=be00, FileSize=200

Then use ILSpy to open the problem code, the screenshot is as follows:

Nima, sure enough, every DataContext.SetContextItem() method has a lock, which perfectly hits lock convoy .

`4. Is it really over like this?`

Originally I was going to report, but I thought that more than 500 thread stacks were called up, and I was idle, so I just scanned it. As a result, I went and unexpectedly found that 134 threads were stuck at ReaderWriterLockSlim.TryEnterReadLockCore , as shown in the following figure:

As you can see from the name, this is an optimized version of the read-write lock: ReaderWriterLockSlim . Why are 138 threads stuck here? Really curious, export the problem again.



internal class LocalMemoryCache : ICache
{
    private string CACHE_LOCKER_PREFIX = "xx_xx_";

    private static readonly NamedReaderWriterLocker _namedRwlocker = new NamedReaderWriterLocker();

    public T GetWithCache<T>(string cacheKey, Func<T> getter, int cacheTimeSecond, bool absoluteExpiration = true) where T : class
    {
        T val = null;
        ReaderWriterLockSlim @lock = _namedRwlocker.GetLock(cacheKey);
        try
        {
            @lock.EnterReadLock();
            val = (MemoryCache.Default.Get(cacheKey) as T);
            if (val != null)
            {
                return val;
            }
        }
        finally
        {
            @lock.ExitReadLock();
        }
        try
        {
            @lock.EnterWriteLock();
            val = (MemoryCache.Default.Get(cacheKey) as T);
            if (val != null)
            {
                return val;
            }
            val = getter();
            CacheItemPolicy cacheItemPolicy = new CacheItemPolicy();
            if (absoluteExpiration)
            {
                cacheItemPolicy.AbsoluteExpiration = new DateTimeOffset(DateTime.Now.AddSeconds(cacheTimeSecond));
            }
            else
            {
                cacheItemPolicy.SlidingExpiration = TimeSpan.FromSeconds(cacheTimeSecond);
            }
            if (val != null)
            {
                MemoryCache.Default.Set(cacheKey, val, cacheItemPolicy);
            }
            return val;
        }
        finally
        {
            @lock.ExitWriteLock();
        }
    }

After reading the above code, I probably want to implement a GetOrAdd operation for MemoryCache, and it seems that for the sake of safety, each cachekey is equipped with a ReaderWriterLockSlim. This logic is a bit weird. After all, MemoryCache itself brings thread safety to realize this logic. Methods, such as:


public class MemoryCache : ObjectCache, IEnumerable, IDisposable
{
    public override object AddOrGetExisting(string key, object value, DateTimeOffset absoluteExpiration, string regionName = null)
    {
        if (regionName != null)
        {
            throw new NotSupportedException(R.RegionName_not_supported);
        }
        CacheItemPolicy cacheItemPolicy = new CacheItemPolicy();
        cacheItemPolicy.AbsoluteExpiration = absoluteExpiration;
        return AddOrGetExistingInternal(key, value, cacheItemPolicy);
    }
}

`5. Is there any problem with ReaderWriterLockSlim?`

Haha, surely many friends ask this? 😅😅😅, indeed, what's the problem with this? First, take a look at how many ReaderWriterLockSlims are currently in the _namedRwlocker collection? It's easy to verify, just search on the managed heap.


0:359> !dumpheap -type System.Threading.ReaderWriterLockSlim -stat
Statistics:
              MT    Count    TotalSize Class Name
00007ff8741631e8    70234      6742464 System.Threading.ReaderWriterLockSlim

You can see that the current managed heap has ReaderWriterLockSlim of 7w+, what can we do? ? ? Don’t forget, ReaderWriterLockSlim brings a Slim because it can spin , so the 160b6e62bac982 spin bit of CPU. If you zoom in a few hundred times? Can the CPU not be lifted?

`Three: Summary`

In general, there are two reasons why CPU reflected by this dump is full.

The frequent contention and context switching caused by lock convoy gave the CPU a crit.
ReaderWriterLockSlim's one hundred times user mode spin gave the CPU another crit.

After knowing the reason, the response plan is simple.

Batch operation, reduce the number of serialized locks, and don't play with the lock volume.
Remove ReaderWriterLockSlim and use the thread-safe method that comes with MemoryCache.

More high-quality dry goods: see my GitHub: dotnetfly

An analysis of the CPU burst of a .NET web site on an e-commerce trading platform

One: background

1. Tell a story

`Two: windbg analysis`

`1. Exploration site`

`2. View the synchronization block table`

`3. View the problem code`

`4. Is it really over like this?`

`5. Is there any problem with ReaderWriterLockSlim?`

`Three: Summary`

一线码农

`引用和评论`

记一次 .NET 某供应链WEB网站 CPU 爆高事故分析

C# virtual 和 abstract 详解

.NET内存居高不下排查怎么解决

用C#在Excel工作表中创建数据透视表和数据透视图

JetBrains Rider 2025.1 发布 - 快速且强大的跨平台 .NET IDE

dotnet 编译模式使用教程

.NET用C#提取PDF中的图片