NET a travel system CPU explosion analysis

I: Background

1. Storytelling

The other day a participant in the training camp found me, said their travel background system appeared to be the case of CPU burst high, burst high after the next can not go down, their own analysis did not find the reason, things are more urgent, let me help to look at what is going on, the hands of the dump, threw me after we analyze it on windbg.

II: WinDbg analysis

1. Why is the CPU so high?

Anyone who has watched the series knows if the CPU is blowing high enough to use the!tp command to verify, the reference output is as follows:


0:000> !tp
CPU utilization: 100%
Worker Thread: Total: 66 Running: 66 Idle: 0 MaxLimit: 32767 MinLimit: 4
Work Request in Queue: 93
    Unknown Function: 00007ffc744f1750  Context: 000002c3acdad7d8
    AsyncTimerCallbackCompletion TimerInfo@000002bf57193d20
    Unknown Function: 00007ffc744f1750  Context: 000002c3acb2aef0
    ...
    Unknown Function: 00007ffc744f1750  Context: 000002c3ad336718
--------------------------------------
Number of Timers: 0
--------------------------------------
Completion Port Thread:Total: 1 Free: 1 MaxFree: 8 CurrentLimit: 1 MaxLimit: 1000 MinLimit: 4

From the trigrams ofCPU utilization: 100% It can be confirmed that the CPU is pumped up at this point, and there is also a phenomenon that thisthread pool queueThere is a buildup, and generally speaking a buildup indicates poor downstream processing, which often manifests itself in program seizures, Http timeouts, andCPU burst problem There is no direct correlation, it is a matter of experience, and we must screen it so that we do not fall into the wrong trap.

Next we look at how capable the machine's CPU is, the weaker the capability the more likely it is to be blown out of the water, which can be done with a!cpuid Observation.


0:000> !cpuid
CP  F/M/S  Manufacturer     MHz
 0  6,85,7  <unavailable>   2095
 1  6,85,7  <unavailable>   2095
 2  6,85,7  <unavailable>   2095
 3  6,85,7  <unavailable>   2095

From the trigrams you can see the currentCPU=4coreIf you have 4 fully loaded Threads, you can easily break them, so where do we go from here? Yes, start with Thread.

2. What the threads are doing

To see what the threads are doing? You can use the~*e !clrstack Just observe and refer to the output as follows:


OS Thread Id: 0x968 (87)
        Child SP               IP Call Site
000000e63babb9a8 00007ffc879a6974 [GCFrame: 000000e63babb9a8] 
000000e63babba78 00007ffc879a6974 [HelperMethodFrame_1OBJ: 000000e63babba78] (Boolean, Int32, )
000000e63babbb90 00007ffc5e735bbf (Int32, ) [f:\dd\ndp\clr\src\BCL\system\threading\ @ 669]
000000e63babbc20 00007ffc5e72e9c5 (Int32, ) [f:\dd\ndp\clr\src\BCL\system\threading\Tasks\ @ 3320]
000000e63babbc90 00007ffc5f0cc188 (Int32, ) [f:\dd\ndp\clr\src\BCL\system\threading\Tasks\ @ 3259]
000000e63babbd60 00007ffc5f0c9176 `1[[System.__Canon, mscorlib]].GetResultCore(Boolean) [f:\dd\ndp\clr\src\BCL\system\threading\Tasks\ @ 559]
000000e63babbda0 00007ffc1b98f2cf ()
...

From the trigrams, you can see a lot of Wait waits, which are actually code calls to the due to, i.e., mixing synchronization in asynchrony, although this is an issue that can lead to thread starvation, and our goal:high CPU burstNothing to do with it, so we need to look for other threads in the vast call stack that may cause the CPU to explode, after a careful and patient search, we finally found the suspected call stack, and it just so happens that 5 of them are stuck in this location. The reference is as follows:


OS Thread Id: 0x26a8 (35)
        Child SP               IP Call Site
000000e62537b048 00007ffc879a64a4 [HelperMethodFrame: 000000e62537b048] 
000000e62537b190 00007ffc5e68e04a (, ) [f:\dd\ndp\clr\src\BCL\system\ @ 3207]
000000e62537b1e0 00007ffc1dbe85eb xxx.GetParentDeptName_All(`1<>, Int64)
...
000000e62537c870 00007ffc1e0af75b DynamicClass.lambda_method(, , [])
000000e62537c8b0 00007ffc1b29b3b8 +ActionExecutor+c__DisplayClass10.b__9(, [])
000000e62537c8f0 00007ffc1b29a768 (, `2<,>, )
000000e62537c950 00007ffc1b29a18e +d__0.MoveNext()
000000e62537c9c0 00007ffc1b2996ca `1[[System.__Canon, mscorlib]].Start[[+d__0, ]](d__0 ByRef)
000000e62537ca70 00007ffc1b299611 (, )
000000e62537cb30 00007ffc1b299535 (, )
000000e62537cb60 00007ffc1b299504 .b__0(ActionInvoker)
000000e62537cb90 00007ffc1b2994a9 +c__DisplayClass10`1[[+ActionInvoker, ]].b__f()
000000e62537cbe0 00007ffc1b2622f9 +d__5.MoveNext()
000000e62537cc40 00007ffc1b261cfa `1[[System.__Canon, mscorlib]].Start[[+d__5, ]](d__5 ByRef)
000000e62537ccf0 00007ffc1b261bf4 (, , `1<`1<>>)
000000e62537cdd0 00007ffc1b2584d4 +d__0.MoveNext()

After arriving here, no matter whether there is a problem or not, the suspicion is very high, so we have to study it next.GetParentDeptName_All Methods.

3. GetParentDeptName_All Any questions?

To find out if there are any problems, we used ILSpy to observe its source code, which is slightly blurred here due to client privacy issues, as shown in the screenshot below:

From the trigrams, Nima, there really is a while logic, all of a sudden raised my interest, it looks like eight or nine, next to the thread stack GetParentDeptName_All method of thelistDept respond in singingdeptId parameter to dig out, we'll start with thread 35 and use the!clrstack -a parameter to observe the output.


0:035> !clrstack -a
OS Thread Id: 0x26a8 (35)
        Child SP               IP Call Site
000000e62537b048 00007ffc879a64a4 [HelperMethodFrame: 000000e62537b048] 
000000e62537b190 00007ffc5e68e04a (, ) [f:\dd\ndp\clr\src\BCL\system\ @ 3207]
    PARAMETERS:
        str0 (<CLR reg>) = 0x000002c05816f360
        str1 (<CLR reg>) = 0x000002c3e21eaf88
    LOCALS:
        <no data>
        <no data>

000000e62537b1e0 00007ffc1dbe85eb xxx.GetParentDeptName_All(`1<xxxDepts>, Int64)
    PARAMETERS:
        this (0x000000e62537b2c0) = 0x000002c059898590
        listDept (0x000000e62537b2c8) = 0x000002c159803390
        deptId (0x000000e62537b2d0) = 0x00000000000324fb

After getting the address of the listDept, use the.logopen classifier for objects with a handle!mdt -e:2 000002c159803390 The output of all the results are recorded to a text file, in order to protect the privacy of customers, here is not a screenshot, directly on the text.


[20828] 000002c059802b68 (xxxtDepts)
    <DeptId>k__BackingField:0x324fb (System.Int32)
    <ParentDeptId>k__BackingField:0x31347 (System.Int32)
    ...

[20643] 000002c0597f0e30 (xxxtDepts)
    <DeptId>k__BackingField:0x2f240 (System.Int32)
    <ParentDeptId>k__BackingField:0x31347 (System.Int32)
    ...

[20663] 000002c0597f2d18 (xxxtDepts)
    <DeptId>k__BackingField:0x2f254 (System.Int32)
    <ParentDeptId>k__BackingField:0x2f240 (System.Int32)
    ...

From the data in the trigram, the child nodes and parent nodes of items 20643 and 20663 of the listDept array are cyclic, which naturally leads to a dead loop, so this production accident was essentially caused by the data, and the results were fed back to my friend, and were confirmed.

Once the problem was identified, the solution was relatively simple.

Calibration data.
You can set the loop limit, if it exceeds a certain threshold, directly throw an exception, so that you can avoid high CPU explosion caused by machine-level failures

PS: The .net core version of Dictionary does this, see below:

In the .net framework there is a serious incident of a silly hit and run.

III: Summary

My students did not analyze it, I think it should be given Misleading, the real dump analysis may be true and false, false and true, just like this society, need more practice practice it.
图片名称