This article is a blog entry from a course I took within the group. In this course, I will introduce you to some common debugging tools on Windows, as well as common ways to investigate problems. It is suitable for partners to get started with Windows debugging.
The following content of this article is the use of the original course courseware inside a page of content assembled, the process of supplementing some of the content of the lectures
The core content of this course is debugging tools, debugging tools are we in the debugging software when the tool, through the debugging tools we can find software problems, solve software problems!
Let me tell you a debugging story at the beginning of this course, which starts with user feedback that the software doesn't work.
Users say the software doesn't work, so what could be the problem? Users are not professional developers, they don't know how to accurately articulate the problem
Students who have studied software engineering should have a lot of software engineering should be mentioned, the first step of development is also a very critical step is the demand analysis. When you receive feedback from the user that the software does not work, what is the user saying? Is it possible that the software crashed? Or the software can not start? Or other problems
What are some of the entry points we can have when we encounter users who say the software doesn't work? My idea of investigation is divided into two major directions. The first direction is to start from the current situation. If there is no longer a site at the moment, then we can consider the second direction, reproduction (reproduction) of the problem
When starting in the first direction, consider looking for traces from the user's device first. I'm going to talk to you about how to start looking for traces on the user's device. Of course, if the user is one of our testers or a coworker, then this step of finding traces is even more valuable!
When looking for traces on the user's device, don't forget that Windows is our best friend, and Windows provides a lot of tools that can help us find the cause of the problem. I'm going to introduce you to some of the most common debugging tools that come with Windows.
The first stop is the Event Viewer. You can start by assuming that we may be experiencing a software crash on startup. Without remoting the user, we can start by asking the user to send us a log or screenshot of the system event. The reason why the event viewer is the first stop is that you can ask the user to take a screenshot or send a log directly without initiating a remote. Relatively low cost of work for the developer.
The event viewer allows for a quick analysis, such as seeing logs of software crashes, which can prove that it is indeed a software crash. The direction of our investigation can then be directed towards the direction of the software crash
It's also possible to see very valid information directly through the event viewer and end the fight straight away, localizing the problem
give me an example
Once when I was debugging a software, the user gave me feedback that the software could not be started. I asked the user to send over the logs from the event viewer, and through the logs I could see the following
Bug application name: , version: 5.1.12.63002, timestamp: 0xedd2d687
Error module name: , version: 10.0.40219.325, timestamp: 0x4df2be1e
Exception Code: 0x40000015
Error Offset: 0x0008d6fd
Error Process ID: 0x994
Wrong application startup time: 0x01d50ac3bd970061
Error Application Path: C:\Program Files\lindexi\
Error Module Path: C:\Program Files\PowerShadow\App\
Report ID: a0c5c0b1-76b7-11e9-9d20-94c69123de40
If you are careful, you may be able to see the problem at a glance, the problem is the module, but the path of this module is actually in the directory of an unknown software named PowerShadow. At this point, we can probably determine the problem, this is being poisoned!
Try using Google Goodwill and search for what this software is. Just searched this blog:Shadow system prevents C++ programs from running
So this is the end of the battle, investigated the cause of the problem, the software can not start because it was poisoned, by the shadow system poisoning. The solution is to ask the user to uninstall the shadow system, because the shadow system is not maintained, our software layer has nothing to struggle!
Unfortunately on many users' devices, the event viewer doesn't work on a daily basis. That's okay, being able to find additional information from the event viewer is a bonus!
If you can't find the Event Viewer or it doesn't work? There are plenty of other tools we can use
Another great tool that is commonly used when looking for traces is the Task Manager. Task Manager is a tool that comes with Windows and can help us to know a lot of information
When looking for traces through Task Manager, you can follow the decision tree shown above to get an idea of what's going on. If you can't see the process inside the task manager, it's likely that the process has crashed. If you can see the process, it is probably stuck. At this point the focus can be on CPU utilization. If the CPU utilization rate is not moving, then you can guess that it may be a deadlock problem, if the CPU utilization rate is too high, then it may be a dead loop and other problems. Synchronization also look at the memory usage rate, although the task manager inside the memory usage rate can not truly reflect the memory usage, but can be used as a reference. Detailed on how to correctly view the program's memory usage, there will be a special content introduced later!
No matter what the situation, you can try to fish a DUMP back to debug and see. Of course, for the case of software crash, first try to see if you can start up, and then quickly fish a DUMP back, if not, then we will introduce other tools to assist in fishing DUMP files!
To recap, we started our investigation by trying to find traces. The search for traces is carried out with the help of useful tools provided by Windows, the event viewer and the task manager being the most important ones here. The Event Viewer allows you to quickly see why the software crashed, and the Task Manager allows you to see how the software is running.
In the absence of clear results from the included tools, try to get a DUMP back on the development machine for further analysis.
The DUMP file mentioned in this course refers to a memory dump file under Windows, which is a binary file that simply means that the memory contents of a process are saved to a file. A DUMP file is a binary file that saves the memory contents of a process to a file. A DUMP file can be used to effectively restore the memory state of the process and the contents of the memory, which can be used for further analysis. When the user environment does not bring development tools, fishing a DUMP file back, can help us in the development of the machine for further analysis. The process of fetching DUMP for analysis is equivalent to taking a snapshot of the process and then placing it on the development machine for further analysis.
Assuming the process is still alive, the easiest way to fish for DUMP is to right-click on the Task Manager and select Create memory dump file. The English equivalent is the Create memory dump file menu item.
It should be noted that if the current system is an x64 system, but your processes are x86 processes, it is not recommended to use the default task manager to fish for DUMP files. The default task manager is for x64, and it will output an x64 dump file, which contains information about the WoW64 subsystem. For more details, please refer toIs there a problem with the dump file you generated? - Knowing
The correct approach should be to useC:\Windows\SysWOW64\
The task manager to fish for DUMP files in the
Assuming we have the DUMP file, the next step is to analyze the DUMP file. Of course, the first step is how to transfer the DUMP file back to your development machine, and here is a little trick to compress the DUMP. Since DUMP files are memory dumps, most of them are all-zero, and the compression rate is very high. If you need to transfer it over the network, for example, it's much faster to compress it and transfer it again!
There are many tools for analyzing DUMP, but the one I'd like to introduce to you is VisualStudio, the strongest IDE in the solar system. VisualStudio is already a mature IDE, so just drag DUMP into it, and the smart VisualStudio can automatically analyze it for us. VisualStudio is already a mature IDE.
In general, drag DUMP into Visual Studio and click the Hybrid Debugging button. Hybrid debugging uses a combination of managed debugging and native debugging. Managed debugging refers to debugging .NET programs and native debugging refers to debugging other non-.NET based programs. Hybrid debugging refers to debugging both managed and native code, because in general it is somewhat difficult for a .NET-based application to crash at the managed level unless the developer is relatively unprotected himself. However, native code, such as some programs written in C, assembly, or C++, is more likely to crash. Hybrid debugging allows you to debug both types of code at the same time. Even if the process is not a .NET program at all, you can use hybrid debugging to debug it.
After entering hybrid debugging, you need to wait for Visual Studio to analyze it automatically. If it's your first time to debug a DUMP file, you may get stuck for a while at the Download Symbols step. You can go out for a cup of tea, wait for a while, and then come back to have a look. If you can't wait any longer, just click Cancel Symbol Loading and continue.
Okay, so now we're at the point where we've found the problem on the user's side and can't end the fight via Event Viewer, etc. Fish out the user's DUMP file and analyze it with Visual Studio. The way to analyze it is to drag the DUMP file into Visual Studio and click on the Hybrid Debug button. Wait for Visual Studio to analyze it automatically and you will see the results!
What will the clever Visual Studio analyze for us? How to see what Visual Studio analyzes? A common approach is to focus on the following three aspects of Visual Studio
- call stack
- The content of the "three axes" will be described later.
- local variable
Let's start by introducing you to the call stack. The call stack is a good thing. The call stack is a very important element that helps us to understand how the program is running. The call stack allows you to see how the program runs, which function it started with, how it is called, and how it returns. Inside the default Visual Studio debug layout, you can quickly see the call stack pane
How can the call stack be viewed? The call stack can be analyzed together with what you see in the task manager on the client side. If you can't see the process in the task manager, i.e. the process crashed, you can use the call stack to try to see who brought the crash and which function was called before the crash. If you can see the process in the task manager, but the CPU utilization does not move, it may be a deadlock problem, you can call the stack to see which function is stuck in the main thread or enter the lock. If the CPU utilization rate is high, then it may be a dead loop problem, you can call the stack to see which function is running full of threads.
As a real world example, here is a DUMP file I fished back from the user side. Analyzed by Visual Studio, the call stack before the crash looks like this
> 00000000() Unknown
[Frames below may be incorrect and/or missing] Unknown
!710d0745() Unknown
!5989f2e1() Unknown
!595f1716() Unknown
!596b7827() Unknown
!598a6233() Unknown
!5989b95c() Unknown
!5989c33b() Unknown
!598816bc() Unknown
!710ca40e() Unknown
!710cbb78() Unknown
!710ca17f() Unknown
!710ca0d3() Unknown
!5ab86f81() Unknown
!_NtWaitForMultipleObjects@20 () Unknown
!76f69723() Unknown
As you can see from the call stack, it is a module with a crash. This module is a module for the NVIDIA graphics driver. The call stack shows that it is the NVIDIA driver that is bringing down the module. The solution to this problem is to update the NVIDIA driver. For more details on this issue, please seeFixed WPF application startup flickering problem due to NVIDIA HDD error.
Driver issues are a common problem with client crashes, as evidenced by the fact that many users' computers work fine, but some users can't get up!
When repairing DirectX, I often use the DirectX Repair Tool, which can be downloaded at:/VBcom/article/details/6962388
Having talked about the problem of who brings the crash, let's look at another case next. For the CPU immobilization problem, the following call stack is shown below
Guess what the stack is telling us.
You can tell by the stack above that you've entered a lock. At this point, the common practice is to look from top to bottom, looking for the first call to the function of our own program set, such as here to find the method that is inside. What you can tell is that this method has logic waiting for the lock, and the lock will not return. It's a good idea to use the code at this point. Here we can know the reason why the process is stuck is because of waiting for the lock, and this lock does not return, but as to what this lock is in the business needs us to further analyze with the code
Let's take a look at a case that corresponds to a high CPU burst, where the information on the stack can tell us which methods are running right now. It's possible that the current methods at the top of the call stack are logically running full of threads. Again, this would be better served with code
However, it is possible that at this point you are faced with a situation where there is no code. Such as the use of third-party libraries, etc., at this time to rely on the stack information is not enough. First let you think about this question, if there is no code at this time how to further analyze? I will introduce how to further analyze through the three axes in the following text and you
Recall that these are three common cases where we dragged in a DUMP file and relied on the call stack inside Visual Studio to analyze the problem. For software crashes, you can use the call stack to see who brought the crash down. For a CPU immobilization problem, you can use the call stack to see who is jamming the main thread. For problems with high CPU explosion, you can see who is running full threads through the call stack.
But the call stack alone may not be enough, sometimes more information is needed. Next I will show you how to further analyze it by using the "three axes".
The "three axes" introduced here are registers, disassembly and memory tools. Through these three aspects of the tool can help us further analyze the problem
It is important to note that these three tools are only used when we need more status information. It is not always possible to understand exactly what is causing the problem with these tools. The use of these three tools itself is not difficult, but the difficulty lies in the understanding of the program itself and the mechanism of the software behind the content seen by these tools. If you don't understand the mechanism of software operation, the content of these three tools may be difficult to understand, or the direction of the investigation is off course.
Still using the earlier example, when you see the CPU bursting high, you can see which method is running full of threads by looking at the call stack. But what is the reason that this method is logically running full? The call stack doesn't answer this question!
Try opening the Memory, Registers, and Disassembly panes inside Visual Studio first. These three tools can help us further analyze the problem
The layout of the Visual Studio interface after opening is roughly as shown above
Taking the CPU explosion example from this course, I first disassembled it to find out what the problem might be, such as trying to see what is stored inside the rcx register. The register pane shows what is stored in the rcx register. The memory pane shows what is stored at this address. I just saw that the corresponding memory holds a piece of funny code
The difficulty of using the "triple axe" itself is not great, but the difficulty lies in the knowledge behind it. Such as assembly knowledge, the mechanism of registers, and the operation mechanism of the software itself. This part of the knowledge is far beyond the scope of this course can be introduced, you need to learn on their own, but because of the high cost of learning this part of the knowledge, so in practice, this part of the knowledge may not be necessary. I only dare to recommend that you have the spare capacity to learn, if the work has been very busy learning not to come, that part of the knowledge can also be put first. But if you can master this part of the knowledge, that will help in debugging problems!
We continue to introduce you to another debugging tool of Visual Studio - local variables. Local variables are also good for helping us to understand the state of the program at runtime. With local variables, you can see the value of a variable when the program is running, which can help us understand the state of the program when it is running.
As seen before the error the local variable has a name oflastErrorCode
variable, it might be possible to learn the cause of the error by the value of this variable. But what does this error code mean? And where can we find the meaning of this error code? We can try the error tool, which can automatically help us find the meaning of possible error codes. This tool is organized by Microsoft, and most of the error codes seen when calling system level components can be found here.
Tool download address:/en-us/windows/win32/debug/system-error-code-lookup-tool
As we can see here the error message is a file or folder name error, according to our business logic, it may be a file name error that is causing the problem. Then the next direction of investigation is to see why the wrong file name, this time perhaps a look at the code will understand
Another real life example, as you can see, is the crash caused by the exception shown above. According to what we learned through the search engine, this is the WIC multimedia decoding layer of Windows. Perhaps the problems encountered at this time and images and other multimedia codecs related to
It just so happens that inside this example, the file address of the image in question is seen through a local variable, at which point the investigation becomes more directed. In addition to possible problems with the WIC layer, there can also be problems with the image file itself. Such as image file poisoning and other issues
By extension, how can we find out if images, audio and video files have been poisoned? Here's a tool that can help us see a lot of information about a file through the MediaInfo tool.
For example, this file is a WebP file that pretends to be a png, and then poisons the WIC layer to crash it.
MediaInfo tool download address:/en/MediaInfo/Download
It's like... there's still some issues that I can't debug.
Even the strongest IDE in the solar system can't stand it.
Then try WinDbg, which is close to being able to debug everything.
This tool is very powerful, there's just one problem. That there is a billion point barrier to getting started
Here I tell you a very simple method, so that you can instantly learn to use WinDbg tools debugging problems. The way is to ask a partner who is familiar with WinDbg, let him help you debugging, find a tool man to help you use WinDbg debugging problems is the fastest way to learn to use WinDbg!
To recap, we talked about finding a problem on the user side and trying to quickly locate it using the tools that come with Windows first. After we found the DUMP file, how to further analyze it in Visual Studio on the development machine. The way to analyze the DUMP file is to drag it into Visual Studio and click the Hybrid Debug button. Wait for Visual Studio to analyze it automatically and see the results. The focus of the analysis is on the call stack, triple axes, and local variables. These three tools can help us to further analyze the problem.
If Visual Studio doesn't solve the problem, then find a tool to help you investigate the problem using WinDbg.
That's what the first general direction is about
The second general direction is the problem of reproducing the scene after the fact. When do you need to reproduce the problem? For example, the simplest way is to crash the software when it starts, and it is too late to open the task manager to fish for the DUMP file. This is when you need to reproduce the problem, and reproducing the problem can help us better locate the problem.
Reproducing the problem is not just a matter of running the program again and again, but can be aided by more tools to better locate the problem when reproducing it.
First on the list is the ProcDump tool
When you can't fish DUMP with Task Manager or it's not good to fish DUMP, use ProcDump tool can help us to fish DUMP file better.ProcDump tool is a tool of Sysinternals, the download address is:/zh-cn/sysinternals/downloads/procdump
Why do you say that sometimes it is not good to use the task manager to fish for DUMP? Because the reality is often very complicated. In addition to the flash crash, software startup crash caused by the hand is not fast enough to fish DUMP file, there are many other problems. For example, the software is in a state that seems to crash, expect to catch a certain moment of the state, such as the software must be in a certain CPU burst high after the failure to meet the expectations of the work, however, the CPU burst high time is very short, relying on human beings to go to see to catch is a bit of a waste of program apes. For example, if the software crashes in the middle of the night, it will only crash at 12:00 midnight, when human beings may have already fallen asleep, and even if they haven't fallen asleep, they may have to wait for tomorrow's 12:00 midnight after missing this time point. Another example is a non-binding problem, you need to pressure test to reproduce, expect automated collection, otherwise you may have to run a few thousand times to reproduce once, relying on human workload is a bit big!
Through ProcDump you can effectively generate Dump files in a million ways of program death, just use the parameters of ProcDump. The role of specific parameters can be referred toOfficial Microsoft Documentation cap (a poem)How to Effectively Generate a Dump in a NET Program in a Million Ways to Die (Up) - First Line Coder - Blogspot
It's a little game to connect the dots and see what should be used in what situation
Here at Investigative Thoughts, reproducing the problem is often accompanied by the use of the ProcDump utility, as the ProcDump utility can help us fish for DUMP files in a very wide range of situations
The ProcDump tool is not the only tool that can be used to reproduce a problem. There is also the possibility of being faced with an after-the-fact scene, when additional tools need to be used to assist in locating the problem. As well as when there is no investigation ideas, you can try the exploration of common problems to help find ideas
I'm here to talk to you about the investigation at the scene of the incident.
What is after-the-fact scene? The problem of after-the-fact scene is generally referred to here as the current scene or the scene that can be reproduced and captured is no longer the scene where the problem occurred, but the scene after the problem occurred
For example, the problem is found, but the problem is non-essential. It is common to analyze the situation by DUMP as a null exception, which leads to a crash because of a null pointer exception. But how does the null pointer exception arise? At this time it is necessary to further analyze the problem through the aftermath of the scene to analyze the idea of investigation!
For example, the place where the problem occurs is not the place where the problem is created. For example, in the example in this course, the cause of the crash is a WebP image that pretends to be a png, so where did the image come from and why is it being used. If code logic doesn't help at this point, then you need to further analyze the problem by recreating the scene before the problem was investigated.
For example, systemic problems. It is common to see gangs working together, not a single application causing the problem. The difficulty of this type of problem is its complexity, which may make it difficult to capture the correct scene. At this point, it is also necessary to reproduce the problem several times to capture more information, and further analyze the problem by analyzing the scene before and after the fact
In the face of problems such as after-the-fact sites and gangs, using the Process Monitor tool from Microsoft's Extreme Toolkit in conjunction with the DebugView tool usually yields good results!
Process Monitor tool download address:/zh-cn/sysinternals/downloads/portmon
Debugview++ tool open source address:/CobaltFusion/DebugViewPP
DebugView tool download address:/en-us/sysinternals/downloads/debugview
DebugView++ is better than DebugView in terms of interface and interaction.
Let's take a real-life example to show you how multiple tools can be used together to invoke an interesting and complex problem.
This problem started when a test student and I reported a touch failure problem, and then after further investigation found that it was actually an explorer unresponsive problem, which manifested itself as a mysterious flashing black explorer.
The complexity of this problem is that explorer is not ours, we are not familiar with it, and we don't know what causes it. Moreover, explorer is too large, and it is too stressful and time-consuming to analyze it in DUMP. We need to use more tools to help further analyze the problem.
At this point, I grabbed the explorer process information through the Process Monitor tool and found something interesting as shown above. The one thing that caught my attention was the presence of the process exiting the
I searched around on the internet and found that explorer is a multi-process software, the exit of processes and the mysterious flash black may have some influence. Since the process has exited, let's try the ProcDump tool to grab a DUMP file before and after the process and analyze it.
Since explorer is very large, and we are not familiar with explorer's code, we grabbed DUMP several times to analyze it, but we didn't get anything. Until one time I got an interesting DUMP file, through this DUMP file, I found that the call stack before the process exited contains some Shell32 calls.
According to the previous Process Monitor tool, the Realtek Bluetooth module was touched before the process exited, so the center of gravity is on the combination of Shell32 and Bluetooth together.
Now that I've got that down, I'll move on to the ShellView tool. Using the ShellView tool to disable a large number of Shell32 components, I did a dichotomous disabling process of disabling whatever didn't work, and eventually found that the Realtek Bluetooth module was causing the problem.
Dichotomous disabling means disabling half of the components in one gulp first to see if the problem is resolved. If it does, then the problem is in that half. If not, the problem is in the other half. Then you continue to disable the components in this half until you find the problem.
After the above investigation tools can be understood to be Bluetooth-related module problems, focusing on the fire to find a clear direction for debugging, and soon found to be a Bluetooth driver problem!
Detailed debugging can be a lot more interesting than what is presented here, seeRemembering a Debugging Explorer Unresponsive Experience