NET a smart mobility system CPU explosion high analysis

I: Background

1. Storytelling

The other day a friend found me, said their system appeared the CPU 100% situation, let you help to see what is going on. dump also got it, I thought of this situation let him catch a few more, since there is a take the existing analysis.

II: WinDbg Analysis

1. Why the high burst

Since we are talking about 100%, as a debugger, we have to take the data and we can use the!tp Come and observe.


0:000:x86> !tp
CPU utilization: 100%
Worker Thread: Total: 382 Running: 382 Idle: 0 MaxLimit: 8191 MinLimit: 8
Work Request in Queue: 8694
    Unknown Function: 6f62b650  Context: 4a36bbbc
    Unknown Function: 6f62b650  Context: 4a36e1d4
    Unknown Function: 6f62b650  Context: 4a372384
    Unknown Function: 6f62b650  Context: 239adfec
    Unknown Function: 6f62b650  Context: 4a374994
    Unknown Function: 6f62b650  Context: 239b9e14
    Unknown Function: 6f62b650  Context: 2399fd9c
    ...

From the trigrams, it looks like the CPU is 100% and all the threads in the thread pool are full.Life has always been a little too full for that. Half poor, half rich, half at peace.. At the same time the thread pool queue has accumulated 8694 tasks pending, indicating that at this point the thread pool is in full fallout, and to find this out, you need to use the~*e !clrstack command observes what each thread is doing at that moment, and the output is as follows:


0:000:x86> ~*e !clrstack
OS Thread Id: 0x22f4 (429)
Child SP       IP Call Site
4bc1e060 0000002b [GCFrame: 4bc1e060] 
4bc1e110 0000002b [HelperMethodFrame_1OBJ: 4bc1e110] (Boolean, Int32, )
4bc1e19c 24aad7da (, Int32, Boolean)
4bc1e1ac 2376f0d6 ()
4bc1e1dc 2420bbc6 ()
...
4bc1e234 24206fbe (`2<,>)
4bc1e3e0 216e25f9 DynamicClass.lambda_method(, , [])
4bc1e3f0 238b86b7 (, [])
...
4bc1eee0 2353d448 (IntPtr, IntPtr, IntPtr, Int32)
4bc1efb8 00a9e3c2 [ContextTransitionFrame: 4bc1efb8]

You can see from the trigram that there are currently 371 threads in the I can't get out of the Wait in the center. Why can't I get out of the Wait?

2. Exploring the source code

The only way to find this answer is to look at the source code, which is simplified as follows:


    public IRedisClient GetClient()
    {
        lock (writeClients)
        {
            AssertValidReadWritePool();
            RedisClient inActiveWriteClient;
            while ((inActiveWriteClient = GetInActiveWriteClient()) == null)
            {
                if (!(writeClients, ))
                {
                    throw new TimeoutException("Redis Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use.");
                }
            }
        }
    }

    private RedisClient GetInActiveWriteClient()
    {
        int num = WritePoolIndex % ;
        for (int i = 0; i < ; i++)
        {
            int num2 = (num + i) % ;
            RedisEndPoint redisEndPoint = ReadWriteHosts[num2];
            for (int j = num2; j < ; j += )
            {
                if (writeClients[j] != null && !writeClients[j].Active && !writeClients[j].HadExceptions)
                {
                    return writeClients[j];
                }
            }
        }
        return null;
    }

Carefully read the code in the trigrams, the main reason why you enter Wait is becauseGetInActiveWriteClient() method returns null, we know from the exception message that this is because the writeClients pool is full, so is the pool full? You can dig out the writeClients array and use the!dso Command.


0:429:x86> !dso
OS Thread Id: 0x22f4 (429)
ESP/REG  Object   Name
...
4BC1E0D0 0ea38d18 []
4BC1E100 0ea38bb0 
...

0:429:x86> !da 0ea38d18
Name:        []
MethodTable: 237af1c0
EEClass:     0129a224
Size:        52(0x34) bytes
Array:       Rank 1, Number of elements 10, Type CLASS
Element Methodtable: 237ae954
[0] 0ea38dd4
[1] 0a9f9f58
[2] 0296e468
[3] 0c9786a0
[4] 0a9fe768
[5] 04a21f24
[6] 0aa0d758
[7] 10946d90
[8] 04a8c8b0
[9] 02a2a2a0

0:429:x86> !DumpObj /d 0ea38dd4
Name:        
MethodTable: 237ae954
EEClass:     2375d154
Size:        152(0x98) bytes
File:        C:\Windows\xxx\
Fields:
...
0129aa48  4000169       7d         1 instance        1 <Active>k__BackingField
...

Looking at the trigrams the writeClients pool is only 10 sizes, and they are allActive=1It's not surprising that it returns null.

3. why are the clients in use?

To find this answer, one needs to look at the upper level of the How the method is called is a little more vague for privacy.

As you can see from the diagram, the problem lies in the use of foreach to keep iterating leading to writeClient pool depletion, resulting in a large number of requests in the constant blocking, do not forget that there are 371 threads in the contention here, oh, it is really a big no-no.

Next, how many times does this foreach foreach? Keep digging with !dso.


0:429:x86> !DumpObj /d 077cec20
Name:        `1[[xxxInfo, xxx]]
MethodTable: 241ad794
EEClass:     0193166c
Size:        24(0x18) bytes
File:        C:\Windows\\assembly\GAC_32\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
01860eec  4001891        4     System.__Canon[]  0 instance 077e0048 _items
0129c9b0  4001892        c         System.Int32  1 instance      307 _size
0129c9b0  4001893       10         System.Int32  1 instance      307 _version
01296780  4001894        8          0 instance 00000000 _syncRoot
01860eec  4001895        4     System.__Canon[]  0   static  <no information>

From the trigrams to see the current need to cycle 307 times, but also once again verified the pool exhaustion, I know that careful friends will certainly say, I recognize the card death, but can lead to the CPU burst high I can not understand, in fact, you carefully read the source code will be able to understand, which is the classicLock convoy. phenomenon because the following two conditions are satisfied.

Multi-threaded foreach high-frequency calls.
Wait causes the thread to pause and enter the wait queue.

4. How to address the issue

Knowing the causes and consequences, the solution is relatively simple, three approaches:

Change foreach iteration to batch processing to reduce writeclient rental.
Increase the writeclient pool size, as described on the official website.
The version is very old and charged, so it's best to replace it to eliminate any future problems.

III: Summary

This production accident analysis is still very interesting, a seemingly blocking problem can also cause the CPU to explode high, beyond some people's knowledge, right, in fact, it is the classiclock convoy Phenomenon, you can find me if you have any dump questions, as always, free analysis.