2

One: background

1. Tell a story

At the beginning of this month, a friend added to WeChat to ask him for help. There was a sporadic CPU spike in his program. I hope to solve it for a fee.

From the description, this problem should have been troubled for a long time. It is still a good friend from the hospital. door is 161bfe3b4f3eff 100 pieces of red envelopes 🤣🤣🤣, since it is an occasional burst, artificial is not possible, you have to use procdump automatically catch it, and use procdump -ma -s 5 -n 2 -c 70 w3wp , A few days later, two dumps were generated as desired, which is great, let's analyze it with windbg.

Two: Windbg analysis

1. Is the cpu really high?

We only trust the data. Here we use !tp to look at the cpu value of the machine at this time.


0:062:x86> !tp
CPU utilization: 83%
Worker Thread: Total: 37 Running: 6 Idle: 31 MaxLimit: 8191 MinLimit: 4
Work Request in Queue: 0
--------------------------------------
Number of Timers: 2
--------------------------------------
Completion Port Thread:Total: 1 Free: 1 MaxFree: 8 CurrentLimit: 1 MaxLimit: 1000 MinLimit: 4

From the data point of view, at this time CPU utilization: 83% , nothing wrong.

2. View thread time-consuming

Since it is a sporadic bug, and I also said that it may be caused by what conditions triggered by the doctor's operation, there are also 2 dumps here, so compare the time-consuming of each thread, here only extract the top5.


0:062:x86> .time
Debug session time: Thu Dec 16 14:31:45.000 2021 (UTC + 8:00)
System Uptime: not available
Process Uptime: 0 days 1:20:48.000
  Kernel time: 0 days 0:08:43.000
  User time: 0 days 1:08:19.000

0:062:x86> !runaway
 User Mode Time
  Thread       Time
   62:7188     0 days 0:18:05.343
   44:6c90     0 days 0:16:16.687
   39:86e8     0 days 0:14:57.734
   32:1d8c     0 days 0:01:02.546
   35:23a4     0 days 0:00:58.250

0:062:x86> .time
Debug session time: Thu Dec 16 14:32:00.000 2021 (UTC + 8:00)
System Uptime: not available
Process Uptime: 0 days 1:21:03.000
  Kernel time: 0 days 0:08:45.000
  User time: 0 days 1:08:41.000

0:062:x86> !runaway
 User Mode Time
  Thread       Time
   62:7188     0 days 0:18:11.875
   44:6c90     0 days 0:16:23.156
   39:86e8     0 days 0:15:04.156
   32:1d8c     0 days 0:01:02.546
   35:23a4     0 days 0:00:58.250

From the information point of view, the 15s interval of the dump, relatively speaking, 62,44,39 most time, so these three threads are worth digging.

3. View the thread stack

Next, use ~62s; !clrstack;~44s; !clrstack;~39s; !clrstack cut to these three threads to see the stack situation, as shown in the following figure:

From the stack, there is no user code, which is very embarrassing. I once suspected that it was caused by the synchronization context of the webform, but I still have some experience. Since !clrstack can't be seen, then use !dumpstack .


0:062:x86> !dumpstack
OS Thread Id: 0x7188 (62)
TEB information is not available so a stack size of 0xFFFF is assumed
Current frame: (MethodDesc 6b0e1b58 +0x1c System.Collections.Generic.ObjectEqualityComparer`1[[System.__Canon, mscorlib]].Equals(System.__Canon, System.__Canon))
ChildEBP RetAddr  Caller, Callee
3867ebfc 6b440484 (MethodDesc 6b0db558 +0x54 System.Collections.Generic.List`1[[System.__Canon, mscorlib]].Contains(System.__Canon))
3867ec18 24bbc3c5 (MethodDesc 25e2ba88 +0x845 xxx.bl_baseInfo.getBljl(System.String, System.String)), calling 2f23072e
3867ec84 6b466d0b (MethodDesc 6b0dcb5c +0x7b System.String.TrimHelper(Int32)), calling (MethodDesc 6b0d1cf4 +0 System.Globalization.CharUnicodeInfo.IsWhiteSpace(Char))
3867ec98 24bbba00 (MethodDesc 2a6eca54 +0x1b8 xxx_blcx.Button1_Click(System.Object, System.EventArgs)), calling (MethodDesc 25e2ba88 +0  xxx.getBljl(System.String, System.String))
3867ecb8 05b5d487 05b5d487
3867ecec 6092da13 (MethodDesc 5fdff5c0 System.Web.UI.WebControls.Button.OnClick(System.EventArgs))
3867ed04 5ffdd1cd (MethodDesc 5fdff5e8 +0xcd System.Web.UI.WebControls.Button.RaisePostBackEvent(System.String))
3867ed1c 5ffdd0fd (MethodDesc 5fdff5e0 +0xd System.Web.UI.WebControls.Button.System.Web.UI.IPostBackEventHandler.RaisePostBackEvent(System.String))
...

It's really weird. How did the xxx.bl_baseInfo.getBljl unmanaged stack? This is the first time I encountered it. From the stack, the program xxx.bl_baseInfo.getBljl() method, and then use !dso the heap objects.


0:062:x86> !dso
Error requesting heap segment b4fe0000
Failed to retrieve segments for gc heap
Unable to determine bounds of gc heap

Let me go, the stack of this dump is destroyed. It may be caused by the high burst of cpu, or it may be badly grasped. This is too torturous, so I can only use kb to find method parameters on the unmanaged stack.


0:062:x86> kb
 # ChildEBP RetAddr      Args to Child              
00 3867ebfc 6b440484     cd0a25a8 124e2c7c 0efb330c mscorlib_ni!System.Collections.Generic.ObjectEqualityComparer`1[System.__Canon].Equals(System.__Canon, System.__Canon)$##6003913+0x1c
01 3867ec18 24bbc3c5     cd0a25a8 132b35e4 132b35cc mscorlib_ni!System.Collections.Generic.List`1[System.__Canon].Contains(System.__Canon)$##600398F+0x54
WARNING: Frame IP not in any known module. Following frames may be wrong.
02 3867ec98 24bbba00     0e3aead8 8412256c 3867ecc0 0x24bbc3c5
03 3867ecb8 05b5d487     0a3d6f00 3867f170 5381fbca 0x24bbba00
04 3867ecec 6092da13     0a3d6e48 00000000 132a20c0 0x5b5d487
05 3867ed04 5ffdd1cd     124ca1a8 80208dfc 80208dfc System_Web_ni![COLD] System.Web.UI.WebControls.Button.OnClick(System.EventArgs)$##60029E3+0xb
...

Next we look at the !do 132b35cc see what list it is.


0:062:x86> !do 132b35cc
Name:        System.Collections.Generic.List`1[[xxx.Model.me_zyblbr, xxx]]
MethodTable: 29f36c8c
EEClass:     6b0aedc4
Size:        24(0x18) bytes
File:        C:\Windows\Microsoft.Net\assembly\GAC_32\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
6b4aea10  4001871        4     System.__Canon[]  0 instance 8e8054e0 _items
6b513c04  4001872        c         System.Int32  1 instance   233139 _size
6b513c04  4001873       10         System.Int32  1 instance   233139 _version
6b512104  4001874        8        System.Object  0 instance 00000000 _syncRoot
6b4aea10  4001875        4     System.__Canon[]  0   static  <no information>

As you can see from the output, this list=23w record is at list.Contains . With this information, the source code can be exported. The simplified code is as follows:


        public IList<xxx> getBljl(string as_search, string as_ztbz)
        {
            IList<me_zyblbr> list = new List<me_zyblbr>();
            using (CDataBase cDataBase = new CDataBase("xxx"))
            {
                var text = "select xxxx  from xxx";
                OracleDataReader oracleDataReader = cDataBase.SetReader(text);
                while (oracleDataReader.Read())
                {
                    if (!list.Contains(me_zyblbr))
                    {
                        list.Insert(0, me_zyblbr);
                    }
                }
                oracleDataReader.Close();
                return list;
            }
            return list;
        }

Sharp-eyed friend certainly noted that in the case of large volumes of data, where list.Insert(0, me_zyblbr); have a big problem, after all list.Insert complexity is O(N) , for 23w for the total time of the complex is:

n(n-1)/2 = 23w(23w-1)/2 = 26,450,000,000 = 26.4 billion.

Then three such threads lift up the cpu together.

4. What sql statement is causing it?

Although the problem with 161bfe3b50006e has been found, what my friend cares most is what the doctor entered that caused the sql to query such a large amount of data. I don’t know if the doctor needs to deduct the money or they have to explain to it 😂😂😂, because of the heap stack have been damaged, and find it is very troublesome, I used sos in !lno, !dumpheap are being given a thorough lying nest, and finally think the next sosex also has a !mdso command, and finally all roses found a significant OracleParameter parameter.


0:062:x86> !mdso
Thread 62:
Location          Object            Type
------------------------------------------------------------
EDI:      132b35cc  System.Collections.Generic.List`1[[xxx.me_zyblbr, xxx]]
3867ec08  124e2c7c  System.Collections.Generic.ObjectEqualityComparer`1[[xxx.me_zyblbr, xxx]]
3867ec44  132b3a5c  Oracle.DataAccess.Client.OracleParameter

0:062:x86> !mdt 132b3a5c
132b3a5c (Oracle.DataAccess.Client.OracleParameter)
    __identity:NULL (System.Object)
    m_pOpoPrmValCtx:4e691200 (System.UIntPtr)
    m_paramName:125fe6f0 (System.String) Length=5, String=":xxx"
    m_sourceColumn:NULL (System.String)
    m_sourceVersion:0x200 (System.Data.DataRowVersion)
    m_dbType:0x0 (System.Data.DbType)
    m_oraDbType:0x77 (NVarchar2) (Oracle.DataAccess.Client.OracleDbType)
    m_bOracleDbTypeExSet:false (System.Boolean)
    m_maxSize:0xffffffff (System.Int32)
    m_maxArrayBindSize:NULL (System.Int32[])
    m_nullable:false (System.Boolean)
    m_value:132b3af8 (System.String) Length=6, String="%高血压病%"

It turned out that the doctor made a vague inquiry about a hypertension. . .

But the main thing I want to tell you here is that when the memory is destroyed to a certain extent, so don't be afraid that sos is completely lying down. There may be other plug-ins that can save us from the fire. One more plug-in has one more way.

Three: Summary

In general, this occasional CPU explosion accident is relatively low-level. I List.Insert very well. It may also be due to the debt owed in order to catch up with the business. It is also relatively simple to change the issue. Use add to send to the list, and finally reorder it according to the rules.


一线码农
366 声望1.6k 粉丝