I haven’t posted for about 11 days. It’s really not because of laziness. I wanted to take the time to write a few days ago. I don’t know why there are so many friends who have asked for help recently. I can get 2-3 dumps for help a day. Analysis, it’s a bit interesting that most of my friends have analyzed the moss problem of the company several times or for many years. It is really too tired, but fortunately, tired is going uphill 😂😂😂.

Back to the topic again, a month ago, a friend wx found me. He is also learning how to analyze dump recently. Maybe he is not very experienced and cannot continue the analysis. The screenshot is as follows:

Although the problems in the dump are strange and strange, there are still some rules to follow if you want to group them into big categories, such as: frequent triggering of gc, a large number of locks, etc., detailed summary can observe my planet, well, since the analysis can't go on, Then come to windbg.

Two: Windbg analysis

1. View CPU utilization

Since it was reported that the cpu is too high, I have to use data to verify that it is not, the old command !tp .


0:057> !tp
CPU utilization: 100%
Worker Thread: Total: 51 Running: 30 Idle: 0 MaxLimit: 400 MinLimit: 4
Work Request in Queue: 11
    Unknown Function: 6a0bbb30  Context: 1b4ca258
    Unknown Function: 6a0bbb30  Context: 1b4ca618
    Unknown Function: 6a0bbb30  Context: 1b4ca758
    Unknown Function: 6a0bbb30  Context: 1cb88d60
    Unknown Function: 6a0bbb30  Context: 1b4ca798
    Unknown Function: 6a0bbb30  Context: 1b5a54d0
    AsyncTimerCallbackCompletion TimerInfo@01f6e530
    Unknown Function: 6a0bbb30  Context: 1b5a5a50
    Unknown Function: 6a0bbb30  Context: 1cb892a0
    Unknown Function: 6a0bbb30  Context: 1b4ca8d8
    Unknown Function: 6a0bbb30  Context: 1cb88da0
--------------------------------------
Number of Timers: 1
--------------------------------------
Completion Port Thread:Total: 1 Free: 1 MaxFree: 8 CurrentLimit: 1 MaxLimit: 400 MinLimit: 4

Let me go, the cpu is full. By the way, here is a reminder that CPU utilization: 100% refers to the current machine and not the program. The implication is that when the machine's CPU is 100%, it is not necessarily caused by the program you dumped.

2. Is it a GC trigger?

In the face of this unfamiliar dump, first perform some empirical investigations, such as whether it is caused by GC trigger? So how to verify this hypothesis? To make the result more accurate, use !t -special export the thread list and see if there is GC SuspendEE .


0:057> !t -special
ThreadCount:      109
UnstartedThread:  0
BackgroundThread: 74
PendingThread:    0
DeadThread:       35
Hosted Runtime:   no

          OSID Special thread type
       14 2594 DbgHelper 
       15 2be4 GC SuspendEE 
       16  dc4 GC 
       17 2404 GC 
       18  bb4 GC 
       19 2498 Finalizer 
       20 312c ProfilingAPIAttach 
       21  858 Timer 
       22 3a78 ADUnloadHelper 
       27 290c GC 
       28 2e24 GC 
       29 28b0 GC 
       30 1e64 GC 
       38 3b24 ThreadpoolWorker 
       ...
       90 2948 Gate 

From the output point of view, Nima is indeed there, which indicates that it is indeed caused by GC trigger. If you still don't believe it, you can refer to the coreclr source code.


size_t
GCHeap::GarbageCollectGeneration(unsigned int gen, gc_reason reason)
{
    dprintf (2, ("triggered a GC!"));

    gc_heap::gc_started = TRUE;

    {
        init_sync_log_stats();

#ifndef MULTIPLE_HEAPS
        cooperative_mode = gc_heap::enable_preemptive ();

        dprintf (2, ("Suspending EE"));
        BEGIN_TIMING(suspend_ee_during_log);
        GCToEEInterface::SuspendEE(SUSPEND_FOR_GC);
        END_TIMING(suspend_ee_during_log);
        gc_heap::proceed_with_gc_p = gc_heap::should_proceed_with_gc();
        gc_heap::disable_preemptive (cooperative_mode);
        if (gc_heap::proceed_with_gc_p)
            pGenGCHeap->settings.init_mechanisms();
        else
            gc_heap::update_collection_counts_for_no_gc();

#endif //!MULTIPLE_HEAPS
    }
}

See the above SuspendEE , its full name is Suspend CLR Execute Engine , then we use ~*e !dumpstack see which thread triggered the GarbageCollectGeneration method in the CLR.

It can be seen from the figure that thread 53 is triggered. !clrstack

From the thread stack point of view, the program did a XXX.GetAll() operation. The name is pretty scary at first glance. Next, let's take a look at this piece of source code. What exactly was done? The simplified source code is as follows:


        public static List<xxxx> GetAll()
        {
            string text = "xxxProperty_GetAll";
            SqlDatabase val = new SqlDatabase(m_strConnectionString);
            xxxPropertyTreeInfo xxxPropertyTreeInfo = null;
            List<xxxPropertieInfo> list = new List<xxxPropertieInfo>();
            DbCommand storedProcCommand = ((Database)val).GetStoredProcCommand(text);
            using (IDataReader reader = ((Database)val).ExecuteReader(storedProcCommand))
            {
                while (DataBase.DataReaderMoveNext(reader))
                {
                    xxxPropertyTreeInfo = new xxxPropertyTreeInfo();
                    xxxPropertyTreeInfo.LoadDataReader(reader);
                    list.Add(xxxPropertyTreeInfo);
                }
            }
            return list;
        }

        public virtual void LoadDataReader(MethodBase method, object obj, IDataReader reader)
        {
            Hashtable hashtable = new Hashtable();
            for (int i = 0; i < reader.FieldCount; i++)
            {
                hashtable.Add(reader.GetName(i).ToLower(), reader.GetValue(i));
            }
            Hashtable fieldProperties = GetFieldProperties(method, FieldType.DBField);
            foreach (object key in fieldProperties.Keys)
            {
                PropertyInfo p = (PropertyInfo)fieldProperties[key];
                object v = null;
                if (hashtable.Contains(key))
                {
                    v = hashtable[key];
                }
                if (v != null)
                {
                    SetPropertieValue(ref obj, ref p, ref v);
                }
            }
        }

From the source logic point of view: it executed a stored procedure xxxProperty_GetAll , and then made a mapping mapping between the reader and xxxPropertyTreeInfo that obtained the data, and triggered the GC during the mapping process.

3. Is it caused by excessive data?

According to past experience, it should be caused by too much data obtained from the database. Isn't this dump? To find the answer, first use the !dso command to export all the variables of the thread stack, and then use !do xxx view List<xxxPropertieInfo> list , as shown in the following figure:

From the figure, this size is not large, so why does it cause frequent triggering of gc? Even if the reflection produces a lot of small objects, it should not have much effect. . . This made me fall into contemplation again. . .

4. Find the source of the problem

After a search, I found a few suspicious points.

  1. There are 24 threads executing the XXX.GetALL() method.

  1. 123 lists were found in the managed heap, and the largest size is 1298, so the total is not small. . .

0:053> !dumpheap -mt 1b9eadd0
 Address       MT     Size
02572a9c 1b9eadd0       24     
026eca58 1b9eadd0       24     
0273d2a0 1b9eadd0       24 
...

Statistics:
      MT    Count    TotalSize Class Name
1b9eadd0      123         2952 System.Collections.Generic.List`1[[xxxPropertieInfo, xxx.Model]]

0:053> !DumpObj /d 28261894
Name:        System.Collections.Generic.List`1[[xxxPropertieInfo, xxx.Model]]
MethodTable: 1b9eadd0
EEClass:     6e2c6f8c
Size:        24(0x18) bytes
File:        C:\Windows\Microsoft.Net\assembly\GAC_32\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
6e6ff32c  4001891        4     System.__Canon[]  0 instance 23710638 _items
6e6f1bc0  4001892        c         System.Int32  1 instance     1298 _size
6e6f1bc0  4001893       10         System.Int32  1 instance     1298 _version
6e6f0100  4001894        8        System.Object  0 instance 00000000 _syncRoot
6e6ff32c  4001895        4     System.__Canon[]  0   static  <no information>
  1. The program is 32bit

It can be judged from the memory address that the current program is 32bit, which means that its segment segment will be small, which means more GC collection.

Three: Summary

The accident was due to:

  1. Multiple threads frequently repeatedly call the GetALL() method with size=1298.
  2. Using the inefficient reflection method for model mapping, a lot of small objects were generated during the mapping process.
  3. Too small segment (32M)

The combination of the three causes frequent GC triggers.

The improvement method is also very simple.

  • The simplest and rude way: Cache a copy of the query results of the database.
  • A little more formal method: Replace the inefficient manual reflection Dapper , and change the program to 64bit.

I communicated with my friends, adopted the first method, and finally pressed the CPU down, and everything was calm again!


一线码农
369 声望1.6k 粉丝