Location>code7788 >text

NET crash analysis of an environmental monitoring system.

Popularity:965 ℃/2024-08-09 09:29:25

I: Background

1. Storytelling

The other day a friend found me, said their program crashed, but also their own analysis of the preliminary results, let me help reconfirm, since I let me confirm, then start dump analysis journey.

II: WinDbg Analysis

1. Why the collapse

One of the powerful things about windbg is that it comes with an automated analyze command.!analyze -v It can help us to quickly analyze the output as follows:


0:000> !analyze -v
*******************************************************************************
*                                                                             *
*                        Exception Analysis                                   *
*                                                                             *
*******************************************************************************

CONTEXT:  (.ecxr)
rax=00007ff95c5a9877 rbx=00007ff959d6d8e0 rcx=0000000000000000
rdx=0000000000000000 rsi=000000e394b98de0 rdi=000000e394b99530
rip=00007ff959c7b699 rsp=000000e394b99510 rbp=000000e394b99d00
 r8=0000000000000000  r9=0000000000000007 r10=0000000000000000
r11=0000000000000000 r12=0000022da11451d0 r13=0000000000000000
r14=000000e394b9a9e0 r15=0000000000040ae4
iopl=0         nv up ei pl nz na pe nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000200
KERNELBASE!RaiseException+0x69:
00007ff9`59c7b699 0f1f440000      nop     dword ptr [rax+rax]
Resetting default scope

EXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 00007ff959c7b699 (KERNELBASE!RaiseException+0x0000000000000069)
   ExceptionCode: c000041d
  ExceptionFlags: 00000001
NumberParameters: 0

PROCESS_NAME:  

ERROR_CODE: (NTSTATUS) 0xc000041d - <Unable to get error code text>

EXCEPTION_CODE_STR:  c000041d
...

You can see from the trigrams that the current crash code isc000041dnamelyAn unhandled exception was encountered during a user callbackThe exception code is a generic exception, and the implication is that there is a real exception code hidden inside, so what is the real exception code?

2. Where are the real exception codes?

To know the answer to this, you can cut to the exception context to find the parent function of RaiseException in the graph observation, the output is as follows:


0:000> k 5
 # Child-SP          RetAddr               Call Site
00 000000e3`94b99510 00007ff8`eb52cb19     KERNELBASE!RaiseException+0x69
01 000000e3`94b995f0 00007ff8`eb52cb4b     coreclr!NakedThrowHelper2+0x9
02 000000e3`94b99620 00007ff8`eb52cb55     coreclr!NakedThrowHelper_RspAligned+0x1e
03 000000e3`94b99b48 00007ff8`8da3caa3     coreclr!NakedThrowHelper_FixRsp+0x5
04 000000e3`94b99b50 00007ff8`8d5a5e23     Avalonia_Base!+0x83

0:000> ub 00007ff8`eb52cb19
...
00007ff8`eb52cb14 e857910b00      call    coreclr!LinkFrameAndThrow (00007ff8`eb5e5c70)

0:000> uf coreclr!LinkFrameAndThrow
Flow analysis was incomplete, some code may be missing
coreclr!LinkFrameAndThrow [D:\a\_work\1\s\src\coreclr\vm\ @ 6934]:
 6934 00007ff8`eb5e5c70 4053            push    rbx
 6934 00007ff8`eb5e5c72 4883ec20        sub     rsp,20h
 6937 00007ff8`eb5e5c76 488d05bb771f00  lea     rax,[coreclr!FaultingExceptionFrame::`vftable' (00007ff8`eb7dd438)]
 ...
 6949 00007ff8`eb5e5cea 448b05c7682800  mov     r8d,dword ptr [coreclr!g_SavedExceptionInfo+0x18 (00007ff8`eb86c5b8)]
 6949 00007ff8`eb5e5cf1 8b15ad682800    mov     edx,dword ptr [coreclr!g_SavedExceptionInfo+0x4 (00007ff8`eb86c5a4)]
 6949 00007ff8`eb5e5cf7 8b0da3682800    mov     ecx,dword ptr [coreclr!g_SavedExceptionInfo (00007ff8`eb86c5a0)]
 6950 00007ff8`eb5e5cfd 4883c420        add     rsp,20h
 6950 00007ff8`eb5e5d01 5b              pop     rbx
 6949 00007ff8`eb5e5d02 48ff2537581b00  jmp     qword ptr [coreclr!_imp_RaiseException (00007ff8`eb79b540)]  Branch
 ...

As you can see from the trigram, RaiseException's arguments come from the exception information global variableg_SavedExceptionInfoThis variable holds the true context of the current crash as well as register information, and has the following data structure in the CLR:


struct SavedExceptionInfo
{
    EXCEPTION_RECORD m_ExceptionRecord;
    CONTEXT m_ExceptionContext;
    CrstStatic m_Crst;
}

With that in place the next step is to dig in with dt and the output is as follows:


0:000> dt coreclr!g_SavedExceptionInfo 00007ff8eb86c5a0
   +0x000 m_ExceptionRecord : _EXCEPTION_RECORD
   +0x0a0 m_ExceptionContext : _CONTEXT
   +0x570 m_Crst           : CrstStatic

0:000> dx -r1 (*((coreclr!_EXCEPTION_RECORD *)0x7ff8eb86c5a0))
(*((coreclr!_EXCEPTION_RECORD *)0x7ff8eb86c5a0))                 [Type: _EXCEPTION_RECORD]
    [+0x000] ExceptionCode    : 0xc0000005 [Type: unsigned long]
    [+0x004] ExceptionFlags   : 0x0 [Type: unsigned long]
    [+0x008] ExceptionRecord  : 0x0 [Type: _EXCEPTION_RECORD *]
    [+0x010] ExceptionAddress : 0x7ff88da3caa3 [Type: void *]
    [+0x018] NumberParameters : 0x2 [Type: unsigned long]
    [+0x020] ExceptionInformation [Type: unsigned __int64 [15]]

The real reason for the current collapse from the information in the trigrams is that0xc0000005namelyaccess violationAnd the point at which it collapsed.RIP=0x7ff88da3caa3

3. What logic led to the crash

This is relatively simple to do with!U cap (a poem)uf Try them all, the output is as follows:


0:000> !U 0x7ff88da3caa3
Normal JIT generated code
()
ilAddr is 0000022DC65AE2D4 pImport is 00000238EE6FECA0
Begin 00007FF88DA3CA20, size 96
...
00007ff8`8da3ca9b 488bce          mov     rcx,rsi
00007ff8`8da3ca9e e8cdeaa5fe      call    00007ff8`8c49b570 ((), mdToken: 00000000060009D9)
>>> 00007ff8`8da3caa3 488b4008        mov     rax,qword ptr [rax+8]
00007ff8`8da3caa7 8b4008          mov     eax,dword ptr [rax+8]
...

0:000> dt coreclr!g_SavedExceptionInfo 00007ff8eb86c5a0
   +0x000 m_ExceptionRecord : _EXCEPTION_RECORD
   +0x0a0 m_ExceptionContext : _CONTEXT
   +0x570 m_Crst           : CrstStatic

0:000> dx -r1 (*((coreclr!_CONTEXT *)0x7ff8eb86c640))
...
    [+0x078] Rax              : 0x0 [Type: unsigned __int64]
...

Looking at the assembly code in the trigram, the reason for the crash is that theAvalonia frameworkRequestCompositionBatchCommitAsync return null lead to, that is, rax = 0, this Avalonia is not that cross-platform WPF, a little interesting, the next to the source code to confirm what variables.

From the code logic _nextCommit is a class variable rather than a method local variable, and in the case of high concurrency if any other method will be_nextCommit=nullIf this is the case, I searched in the class to verify the idea, and there is really a method that sets null, as shown in the screenshot below:

It's basically clear that this is a bug in Avalonia. Finally, we look at the version of Avalonia, which we find to be very recent, and the output is as follows:


0:000> lmvm Avalonia_Base
    ...
    Timestamp:        A0BE2821 (This is a reproducible build file hash, not a timestamp)
    CheckSum:         001CDA05
    ImageSize:        001D4000
    File version:     11.1.0.0
    Product version:  11.1.0.0
    File flags:       0 (Mask 3F)
    File OS:          4 Unknown Win32
    File type:        2.0 Dll
    File date:        00000000.00000000
    Translations:     0000.04b0
    Information from resource tables:
        CompanyName:      Avalonia Team
        ProductName:      Avalonia
        InternalName:     
        OriginalFilename: 
        ProductVersion:   11.1.0+2a8ea17985fd739234fa0d93c3437948535d35c4
        FileVersion:      11.1.0.0
        FileDescription:  
        LegalCopyright:   Copyright 2013-2024 © The AvaloniaUI Project

4. How can this be resolved?

Knowing that this is a bug in Avalonia, and that Avalonia is a very new version, the way to upgrade is blocked, so I can only submit an issue to the official:/AvaloniaUI/Avalonia Come and fix it.

III: Summary

I dug up some new stuff in this production accident, and I'm a bit curious to know if Avalonia is now being used as a replacement for WPF in the industrial control industry. But at this stage the stability is not comparable to WPF, look forward to a more robust version in the future.

图片名称