I: Background
1. Storytelling
The other day a friend contacted me on WeChat and said that their program crashed on the customer's side, and asked me to help see what was going on, and I got the dump, so let's get to work analyzing it.
II: WinDbg Analysis
1. Where's the breakdown
Since the collapse of the program, naturally there is a reason, leather pants over cotton pants, there must be a reason, either the leather pants are too thin or the cotton pants are not hairy, use the!analyze -v
Observe the following anomaly message.
0:107> !analyze -v
CONTEXT: (.ecxr)
rax=0000005e0dc7c4a0 rbx=0000005e0dc7c400 rcx=0000005e0dc7c4a0
rdx=0000000000000000 rsi=0000005e0dc7c3f0 rdi=0000005e0dc7c4a0
rip=00007ffb1ecfc223 rsp=0000005e0dc7c3c0 rbp=0000005e0dc7c4c0
r8=00000000000004d0 r9=0000000000000000 r10=0000000000000000
r11=0000005e0dc7c4a0 r12=0000000000000000 r13=000002079d450220
r14=000002079b93aba0 r15=0000000000000000
iopl=0 nv up ei pl nz na pe nc
cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000200
coreclr!EEPolicy::HandleFatalError+0x7f:
00007ffb`1ecfc223 488d442440 lea rax,[rsp+40h]
Resetting default scope
EXCEPTION_RECORD: (.exr -1)
ExceptionAddress: 00007ffb1ec6d70f (coreclr!ProcessCLRException+0x00000000000d9f7f)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000001
NumberParameters: 0
From the information in the trigrams this is a classicaccess violation
But the collapse inEEPolicy::HandleFatalError
The HandleFatalError method is mainly used to fix the context of the exception before it is thrown, and it's a solid method that generally doesn't cause problems, but anyway, look at the followingrsp+40h
What the hell.
0:107> dp rsp+40h L1
0000005e`0dc7c400 00000001`c0000005
upperc0000005
Obviously it's an access violation, it looks like there's a bit of confusion here and it's not the first crash site, so I won't dwell on it too much here, so how do you go about finding the real crash site? Another way is to go to theRaiseException
orKiUserExceptionDispatch
Useful functions before the return point are referenced below:
0:107> .ecxr
0:107> k
*** Stack trace for last set context - .thread/.cxr resets it
# Child-SP RetAddr Call Site
00 0000005e`0dc7c3c0 00007ffb`1ec6d72e coreclr!EEPolicy::HandleFatalError+0x7f [D:\a\_work\1\s\src\coreclr\vm\ @ 776]
01 0000005e`0dc7c9d0 00007ffb`5235292f coreclr!ProcessCLRException+0xd9f9e [D:\a\_work\1\s\src\coreclr\vm\ @ 1036]
02 0000005e`0dc7cc00 00007ffb`52302554 ntdll!RtlpExecuteHandlerForException+0xf
03 0000005e`0dc7cc30 00007ffb`5235143e ntdll!RtlDispatchException+0x244
04 0000005e`0dc7d340 00000000`6c942893 ntdll!KiUserExceptionDispatch+0x2e
05 0000005e`0dc7daf0 00007ffa`c066ed7b libxxx_manage!get_clean_xxx
06 0000005e`0dc7db70 00007ffa`c06b73a4 0x00007ffa`c066ed7b
...
From the trigrams, the program crashed atlibxxx_manage!get_clean_xxx
It looks like a dynamic link library written in C++, which is a bit of a mouthful.
2. Why C++ libraries crash
The best way to find the answer is to observe the000000006c942893
The assembly code at is referenced below:
0:107> ub 00000000`6c942893
libxxx_manage!get_clean_xxx:
00000000`6c942876 55 push rbp
00000000`6c942877 53 push rbx
00000000`6c942878 4883ec68 sub rsp,68h
00000000`6c94287c 488dac2480000000 lea rbp,[rsp+80h]
00000000`6c942884 48894d00 mov qword ptr [rbp],rcx
00000000`6c942888 c745dc00000000 mov dword ptr [rbp-24h],0
00000000`6c94288f 488b4500 mov rax,qword ptr [rbp]
0:107> u 00000000`6c942893
00000000`6c942893 488b00 mov rax,qword ptr [rax]
0:107> dp rbp L1
0000005e`0dc7c4c0 00000000`00000000
From the above assembly code, this is the prologue code of get_clean_xxx method, the problem is that the content of rbp is 0, but rbp comes from rcx, according to the x64 calling agreement, rcx is the first parameter of the method, it looks like this parameter is null, refer to the following:
0:107> !address rcx
Usage: Stack
Base Address: 0000005e`0dc78000
End Address: 0000005e`0dc80000
Region Size: 00000000`00008000 ( 32.000 kB)
State: 00001000 MEM_COMMIT
Protect: 00000004 PAGE_READWRITE
Type: 00020000 MEM_PRIVATE
Allocation Base: 0000005e`0db00000
Allocation Protect: 00000004 PAGE_READWRITE
More info: ~107k
0:107> dp rcx L1
0000005e`0dc7c4a0 00000000`00000000
3. is the get_clean_xxx parameter null?
This is a relatively simple problem, continue with!clrstack
Look at the C# code on top of Pinvoke.
0:107> !clrstack
OS Thread Id: 0x3508 (107)
Child SP IP Call Site
0000005E0DC7DBA0 00007ffac066ed7b [InlinedCallFrame: 0000005e0dc7dba0] xxx_LibPInvoke.xxx_clean_query(IntPtr)
0000005E0DC7DB70 00007ffac066ed7b ILStubClass.IL_STUB_PInvoke(IntPtr)
0000005E0DC7DC30 00007ffac06b73a4 xx+c__DisplayClass11_0.<xxxQueryClean>b__0(IntPtr)
...
The next step is to see how the C# code for the hosting layer is written, the screenshot is below:
As you can clearly see from the diagram, the xxxChannel passed to C++ did not determine whether it was null or not, causing the crash to occur, so is there any other corroboration? Actually, there is. If the symbols are strong you can also use the!clrstack -a
Go find it.xxxChannel
Pass down the value.
0:107> !clrstack -a
OS Thread Id: 0x3508 (107)
Child SP IP Call Site
0000005E0DC7DBA0 00007ffac066ed7b [InlinedCallFrame: 0000005e0dc7dba0] xxx_LibPInvoke.xxx_clean_query(IntPtr)
0000005E0DC7DB70 00007ffac066ed7b ILStubClass.IL_STUB_PInvoke(IntPtr)
PARAMETERS:
<no data>
0000005E0DC7DC30 00007ffac06b73a4 xxx+c__DisplayClass11_0.<xxxQueryClean>b__0(IntPtr)
PARAMETERS:
this (0x0000005E0DC7DC80) = 0x0000020a9d9ca8d8
xxxChannel (0x0000005E0DC7DC88) = 0x0000000000000000
LOCALS:
0x0000005E0DC7DC6C = 0x0000000000000000
0x0000005E0DC7DC68 = 0x0000000000000000
You can clearly see that it is indeed 0, here all the truth is clear, just add a judgment on the parameters, so who is responsible for this thing? I think both sides have problems.
- The person who wrote the hosting layer is a bit of a drifter.
- The person who wrote the unmanaged layer has not programmed defensively, or is young and too trusting.
III: Summary
This production accident completely destroyed the trust between the two language teams to cooperate with each other, trust rebuild can be difficult, not afraid of God-like opponent, afraid of piggy teammates, put here or quite appropriate, haha, a little joke.