NET crash analysis of a laboratory autosampling system.

I: Background

1. Storytelling

The other day a friend contacted me on WeChat and said that their program crashed on the customer's side, and asked me to help see what was going on, and I got the dump, so let's get to work analyzing it.

II: WinDbg Analysis

1. Where's the breakdown

Since the collapse of the program, naturally there is a reason, leather pants over cotton pants, there must be a reason, either the leather pants are too thin or the cotton pants are not hairy, use the!analyze -v Observe the following anomaly message.


0:107> !analyze -v

CONTEXT:  (.ecxr)
rax=0000005e0dc7c4a0 rbx=0000005e0dc7c400 rcx=0000005e0dc7c4a0
rdx=0000000000000000 rsi=0000005e0dc7c3f0 rdi=0000005e0dc7c4a0
rip=00007ffb1ecfc223 rsp=0000005e0dc7c3c0 rbp=0000005e0dc7c4c0
 r8=00000000000004d0  r9=0000000000000000 r10=0000000000000000
r11=0000005e0dc7c4a0 r12=0000000000000000 r13=000002079d450220
r14=000002079b93aba0 r15=0000000000000000
iopl=0         nv up ei pl nz na pe nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000200
coreclr!EEPolicy::HandleFatalError+0x7f:
00007ffb`1ecfc223 488d442440      lea     rax,[rsp+40h]
Resetting default scope

EXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 00007ffb1ec6d70f (coreclr!ProcessCLRException+0x00000000000d9f7f)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000001
NumberParameters: 0

From the information in the trigrams this is a classicaccess violationBut the collapse inEEPolicy::HandleFatalError The HandleFatalError method is mainly used to fix the context of the exception before it is thrown, and it's a solid method that generally doesn't cause problems, but anyway, look at the followingrsp+40h What the hell.


0:107> dp rsp+40h L1
0000005e`0dc7c400  00000001`c0000005

upperc0000005 Obviously it's an access violation, it looks like there's a bit of confusion here and it's not the first crash site, so I won't dwell on it too much here, so how do you go about finding the real crash site? Another way is to go to theRaiseException orKiUserExceptionDispatch Useful functions before the return point are referenced below:


0:107> .ecxr
0:107> k
  *** Stack trace for last set context - .thread/.cxr resets it
 # Child-SP          RetAddr               Call Site
00 0000005e`0dc7c3c0 00007ffb`1ec6d72e     coreclr!EEPolicy::HandleFatalError+0x7f [D:\a\_work\1\s\src\coreclr\vm\ @ 776] 
01 0000005e`0dc7c9d0 00007ffb`5235292f     coreclr!ProcessCLRException+0xd9f9e [D:\a\_work\1\s\src\coreclr\vm\ @ 1036] 
02 0000005e`0dc7cc00 00007ffb`52302554     ntdll!RtlpExecuteHandlerForException+0xf
03 0000005e`0dc7cc30 00007ffb`5235143e     ntdll!RtlDispatchException+0x244
04 0000005e`0dc7d340 00000000`6c942893     ntdll!KiUserExceptionDispatch+0x2e
05 0000005e`0dc7daf0 00007ffa`c066ed7b     libxxx_manage!get_clean_xxx
06 0000005e`0dc7db70 00007ffa`c06b73a4     0x00007ffa`c066ed7b
...

From the trigrams, the program crashed atlibxxx_manage!get_clean_xxx It looks like a dynamic link library written in C++, which is a bit of a mouthful.

2. Why C++ libraries crash

The best way to find the answer is to observe the000000006c942893 The assembly code at is referenced below:


0:107> ub 00000000`6c942893
libxxx_manage!get_clean_xxx:
00000000`6c942876 55              push    rbp
00000000`6c942877 53              push    rbx
00000000`6c942878 4883ec68        sub     rsp,68h
00000000`6c94287c 488dac2480000000 lea     rbp,[rsp+80h]
00000000`6c942884 48894d00        mov     qword ptr [rbp],rcx
00000000`6c942888 c745dc00000000  mov     dword ptr [rbp-24h],0
00000000`6c94288f 488b4500        mov     rax,qword ptr [rbp]

0:107> u 00000000`6c942893
00000000`6c942893 488b00          mov     rax,qword ptr [rax]

0:107> dp rbp L1
0000005e`0dc7c4c0  00000000`00000000

From the above assembly code, this is the prologue code of get_clean_xxx method, the problem is that the content of rbp is 0, but rbp comes from rcx, according to the x64 calling agreement, rcx is the first parameter of the method, it looks like this parameter is null, refer to the following:


0:107> !address rcx

Usage:                  Stack
Base Address:           0000005e`0dc78000
End Address:            0000005e`0dc80000
Region Size:            00000000`00008000 (  32.000 kB)
State:                  00001000          MEM_COMMIT
Protect:                00000004          PAGE_READWRITE
Type:                   00020000          MEM_PRIVATE
Allocation Base:        0000005e`0db00000
Allocation Protect:     00000004          PAGE_READWRITE
More info:              ~107k

0:107> dp rcx L1
0000005e`0dc7c4a0  00000000`00000000

3. is the get_clean_xxx parameter null?

This is a relatively simple problem, continue with!clrstack Look at the C# code on top of Pinvoke.


0:107> !clrstack
OS Thread Id: 0x3508 (107)
        Child SP               IP Call Site
0000005E0DC7DBA0 00007ffac066ed7b [InlinedCallFrame: 0000005e0dc7dba0] xxx_LibPInvoke.xxx_clean_query(IntPtr)
0000005E0DC7DB70 00007ffac066ed7b ILStubClass.IL_STUB_PInvoke(IntPtr)
0000005E0DC7DC30 00007ffac06b73a4 xx+c__DisplayClass11_0.<xxxQueryClean>b__0(IntPtr)
...

The next step is to see how the C# code for the hosting layer is written, the screenshot is below:

As you can clearly see from the diagram, the xxxChannel passed to C++ did not determine whether it was null or not, causing the crash to occur, so is there any other corroboration? Actually, there is. If the symbols are strong you can also use the!clrstack -a Go find it.xxxChannel Pass down the value.


0:107> !clrstack -a
OS Thread Id: 0x3508 (107)
        Child SP               IP Call Site
0000005E0DC7DBA0 00007ffac066ed7b [InlinedCallFrame: 0000005e0dc7dba0] xxx_LibPInvoke.xxx_clean_query(IntPtr)
0000005E0DC7DB70 00007ffac066ed7b ILStubClass.IL_STUB_PInvoke(IntPtr)
    PARAMETERS:
        <no data>

0000005E0DC7DC30 00007ffac06b73a4 xxx+c__DisplayClass11_0.<xxxQueryClean>b__0(IntPtr)
    PARAMETERS:
        this (0x0000005E0DC7DC80) = 0x0000020a9d9ca8d8
        xxxChannel (0x0000005E0DC7DC88) = 0x0000000000000000
    LOCALS:
        0x0000005E0DC7DC6C = 0x0000000000000000
        0x0000005E0DC7DC68 = 0x0000000000000000

You can clearly see that it is indeed 0, here all the truth is clear, just add a judgment on the parameters, so who is responsible for this thing? I think both sides have problems.

The person who wrote the hosting layer is a bit of a drifter.
The person who wrote the unmanaged layer has not programmed defensively, or is young and too trusting.

III: Summary

This production accident completely destroyed the trust between the two language teams to cooperate with each other, trust rebuild can be difficult, not afraid of God-like opponent, afraid of piggy teammates, put here or quite appropriate, haha, a little joke.