Saturday, July 26, 2014

Customizing the WinDbg environment

Hi everyone,

In this post, I am going to discuss some of the customizations that I have come across in reading and videos from those such as Andrew Richards, etc, that I always make to my WinDbg environment to make it that much more comfortable/easier in the long run. In this post, I am using a fresh install of Windows 7 x64 on a VM so I can explain better and truly start from scratch. Do note that all of the customizations I make here also follow on Windows 8, etc, as well.

Disclaimer: This post will not teach you/show you how to install WinDbg! There are already plenty of posts and tutorials out there regarding that. This is strictly regarding its customization once it is installed.

Registering File Associations

On clean install of WinDbg with zero modifications and/or customizations done, this is what crash dumps look like:


We can see that the kernel-dump in the picture has no icon, etc, and is just a file. This is due to the fact that it has no known association. If we attempt to double click the crash dump in this state, Windows throw an error and say "Hey, we have no idea what type of file this is, so can you please perhaps tell us what program you'd like to associate all future crash dumps with?"

To properly circumvent this and set the file association, we're going to want to open an elevated command prompt. You can quickly do this by pressing Windows Key + R to open the run box, and then typing cmd and pressing Ctrl+Shift+Enter to execute an elevated command prompt.

Now that we have a window open, let's sidetrack to discuss something. If you navigate to the location in which you installed your WinDbg, and go into the Debuggers folder, we see the following:


We have an x86 and x64 debugger, and this is due to the fact that it's always best to debug a crash dump in its original architectural environment. With this said, if you're debugging a crash dump generated by an x64 box, you'll want to use the x64 debugger. At least 90% of all crash dumps you'll be debugging at this time in our life will be x64, therefore we want to obviously associate the x64 debugger as the main program for crash dumps. It just makes sense.

Now that we have navigated to the Debuggers folder, we can register associations by quickly going one folder higher (x64) and typing the following:

windbg -IA

Typing the above will do the following:



-- YOU MAY NEED TO RESTART AFTERWARDS TO HAVE IT FULLY APPLY.

Great! Now all future crash dumps are associated with the x64 debugger. This raises one problem though, what if we're assisting a user and they are on an x86 box? Surely we can just navigate to the WinDbg directory, go to the x86 folder and open the debugger, but this is too much work.

With the above said, let's have some fun by pressing the Windows Key + R to open the run box, and then typing regedit to execute the Registry Editor. The key we're interested in working within is HKEY_CLASSES_ROOT. It's important to note that this is the key that controls all of the file name extension associations and COM class registration information. With that said, programs start how they're supposed to thanks to this key!

Now we want to take a peek at the extension .dmp (scroll down until you see it, or actually type .dmp until you get there):


The above is essentially a shortcut which implies in the default key, jump to WinDbg.DumpFile.1.

Now that we know this, let's go ahead and do as we did above, but instead type windbg until we get to WinDbg.DumpFile.1. You can alternatively scroll as well if you're cautious of screwing anything up, although it'll take you awhile:


WinDbg.DumpFile.1 is where the .dmp extension is housed.

If we click DefaultIcon:


The familiar little computer WinDbg icon is what resource -3002 is from windbg.exe.

If we go into shell, we see Open:


If we double-click the string itself:


&Open actually creates the underscore under the O in the Open command in the context menu for right-clicking a crash dump. We want to change this from &Open to Open x&64, for example:


and then click OK.

Now that we've done this, as said above, on the context menu for crash dumps, instead of saying Open, it will say Open x64. For example:


Great! We're making progress, but we also want an Open x86 option on the context menu as well. In order to do this, bring regedit back up and right-click shell, select new, and then select key. Once you've done this, name the new key Open_x86. After we've done all that, this is what regedit should look like:


With all of the above done, now we need to go ahead and set that value by double-clicking where it says (Default), and typing Open x&86 this time as opposed to Open x&64 as we did earlier above. Now we need to right-click our Open_x86 key, select new, and then select key. Once you've done this, name the new key command. After we've done all that, this is what regedit should look like:


Once we have the above, to save ourselves some time, let's navigate to Open's command key, double-click (Default), and then copy the string's Value data. For example:


With the above Value data copied, let's navigate back to our created Open_x86's command key, double-click (Default), and then paste into Value data's empty box. DO NOT PRESS OK YET!

We need to make an adjustment to the path, which is extremely simple.

Here was my pasted Value data:

"C:\Program Files (x86)\Windows Kits\8.1\Debuggers\x64\windbg.exe" -z "%1"

Here is what it needed to be changed to:

"C:\Program Files (x86)\Windows Kits\8.1\Debuggers\x86\windbg.exe" -z "%1"

Extremely simple as I noted, and all I had to do was change x64 to x86.

With all of the above said and done, now if we once again right-click a crash dump to bring up its context menu, here's what we now see:


Fantastic, so now we no longer need to spend time traversing through folders and such to open the x86 debugger if we need to. You may be saying to yourself "This was a bit of a pain, I really don't want to do this every time I have to reinstall Windows...". You're in luck, you don't have to!

If you now right-click WinDbg.DumpFile.1 in regedit, select Export, and save it as WinDbg_IA.reg, you're now exporting that registry key from the registry to be saved. This is extremely handy because you can throw it on your trusty USB that you use for absolutely everything debugging related, and when you're on a brand new system/different system that doesn't have file associations set that you need to debug, you can just quickly install the key, and you're good to go. This also of course saves you from having to do this process on your own system.

-- Do note that you will have to do the windbg -IA command before installing the registry key or it will not work. Once you've done the windbg -IA command, you can then simply install the registry key and all the work is done.

Symbols

So now that we have file associations registered for crash dumps, and have successfully modified our crash dump context menu to contain an option to either execute the x64 or x86 debugger, let's go one step further and automate the creation, download, etc, of symbols.

One thing to note is that everything we're doing here in this post could be done manually, but will take twice the actual effort, and will in the end waste time. When we're debugging, etc, and looking to get a box back up and running that we're debugging on, time is of the essence. With that said, all of these steps are to make our lives as debuggers much easier.

First off, it's important to understand what symbols are! Symbols (also known as PDB files -- program database) hold a variety of data which are not actually needed when running the binaries, but contain very useful debugging information. The fact that they are not needed when running the binaries themselves is the reason why PDB files exist! PDB files exist to separate the symbols from the binary, which ultimately helps limit the size of the executable, saving disk storage space and reducing the time it takes to load the data.

Symbol files contain:
  • Global varibles
  • Local Variables
  • Function names and the addresses of their entry points
  • Frame pointer omission (FPO) records
  • Source-line numbers
Each of these items is individually a symbol. For example, a single symbol file called Myprogram.pdb might contain several hundred symbols, including global variables and function names and hundreds of local variables.

Unfortunately, as debuggers who aren't internally within a company (Microsoft for example), we are only provided with what is known as public symbols (as opposed to private). The differences are listed here.

With all of the above said, if a debugger attempts to load a crash dump without symbols, we won't be able to successfully resolve functions, etc, and it'll instead be junk. This is why symbols are very important to have. Normally, when you first install WinDbg, you'd have to navigate to File>Symbol File Path and manually set the path. However, since time is of the essence and we love to do things automatically, we're going to make a one-time script that will set everything for us.

Let's get to work!

1. Create a next .txt file on your Desktop.

2. Within the new .txt file, paste the following code:
md c:\Symbols
md c:\Symbols\Src
md c:\Symbols\Sym
md c:\Symbols\SymCache
setx /M _NT_SOURCE_PATH SRV*C:\Symbols\Src
setx /M _NT_SYMBOL_PATH SRV*C:\Symbols\Sym*http://msdl.microsoft.com/download/symbols
setx /M _NT_SYMCACHE_PATH C:\Symbols\SymCache
You can change Symbols to whatever you want. I just find naming it Symbols is easy to remember and obvious on what's being stored there.

So, what is this script doing?

- Creating the Symbols directory in C:\.

- Inside of the Symbols directory, it's also creating three subfolders: Src, Sym, SymCache.

Src - Source code.

Sym - Symbol path.

SymCache - Symbol cache.

This ultimately creates an environment variable which sets it all for us, therefore we have to do nothing manually.

To expand a little, the reason we're using * in the script is to essentially say:

"Hey, look at C:\Symbols for the symbols. Are they there?"

If the answer is yes, then the symbols are loaded locally.

If the answer is no, then here's what it instead says:

"Hey, look at C:\Symbols for the symbols. Are they there? Oh wow, they're not?! Okay, let's grab them from http://msdl.microsoft.com/download/symbols instead!"

This is done that in the event your local symbols aren't available for some reason, it'll download them from the MSFT symbol server.

3. After pasting the code and changing anything about it to your liking (such as the directory name itself), navigate to File>Save As, change the Save as type: to All Files, and finally give it the File name: Symbols.bat or Symbols.cmd, etc.

4. Once you've done that, run it.

5. After it runs, you're all done setting up symbols and never have to worry about them ever again.

Extensions

Extensions are a really brilliant part of WinDbg that I actually didn't use for awhile, but once I looked into them more, it really has made my life a lot easier. First off, one of the extensions I really can't live without now even using it for such a short period of time is SwishDbgExt.

Some of its features, etc, are listed on that page, as well as the download link. Once you've downloaded it, installing it is really easy. When you extract it, it has a few items, but only two folders you're interested in (x64 and x86). Inside both of these folders is the extension itself, but for its specific architecture.

Now that you know the existence of the extension files, do the following:

1. Navigate to your WinDbg directory. For example, mine is C:\Program Files (x86)\Windows Kits\8.1\Debuggers

2. Once inside the Debuggers folder, double-click the x64 folder, and then the winext folder.

3. Now that you're in the winext folder, this is where you will simply drag n' drop the extension itself from the x64 extension folder. Once you've done that, you're all done for x64. Do the same for x86.

4. Now that you have both the extension files in their respectable directories, load up a crash dump as an example and type !load swishdbgext to load the extension. For example:



Once it's loaded, you can then use any of the various commands that this extension holds. For example, we could run !ms_drivers to display a list of drivers loaded:



We can get in-depth IRP information this way on a driver, such as myfault.sys:



Workspace



Here's what you can get your workspace to look like with some time and some messing around. This was done based off of Tess Ferrandez's workspace. I personally prefer the all stock white/black WinDbg, but if The Matrix is your thing, go for it.

You can make all of these edits by opening up WinDbg on its own and not with a dump file, navigating to View>Options, and changing the Colors how you want. You can also change Fonts through the View menu. Ensure that after you get yours all set, you manually save your workspace by going to File>Save Workspace As..., and then creating one. This will save your workspace and make it the default for the future.

That's it for now, thanks for reading!

References/LinksPimp up your debugger: Creating a custom workspace for windbg debugging.
Channel9.

Wednesday, July 16, 2014

Page Faults Explained

Hello everyone!

In this post, I'm going to do my best to go in-depth regarding page faults, but do my best to speak English at the same time. There are many page fault related articles out there, but I've noticed they're either picking up from an imaginary somewhere (i.e a rushed explanation that seems to begin and end abruptly), incomplete, assume you're already knowledgeable (even basic) regarding Windows' memory manager, paging, page faults, etc. Recently, thanks very much to Pavel Yosifovich, I have a better understanding of page faults and would like to as always share my knowledge as a whole.

-- I would like to note that in the making of this post, as far as double-checking to ensure I was correct goes, if I was not flat out correct, I was either incorrect or learned way more than I thought I knew. This is one of the things I love most about making blog posts, and learning in general.

--------------------

First off, before even diving into page faults themselves, and especially since we want to do this the right way, we need to understand a few things (well, many things).

Disclaimer: I am not going to go extremely in-depth regarding Windows' memory manager (as that would take forever and a half/my knowledge is solely my knowledge), and if you are interested in that, Mark Russinovich has done a brilliant article over at TechNet, as well as many others all across the web if you do some digging (or check the reference links below). I am merely laying the groundwork for the understanding and explanation of page faults and nothing more.

If you ask me personally, Windows' memory manager (and memory management in general throughout the operating system) is one of the most complicated and in-depth parts of Windows internals. It's daunting yet extremely fascinating at the same time, as one extremely in-depth piece leads to another. It seems endless, and I highly recommend spending time reading into the memory management specifics throughout Windows, as it's truly fascinating.

Physical Memory

Physical memory is by far one of the most important resources, and one we must absolutely understand. Among many things, the memory manager within Windows is responsible for the data of all current active processes, drivers, and the operating system. Even today as of this blog post, the operating system itself accesses more code/data than can actually fit in physical memory. With this said, as said by the brilliant Mark Russinovich, think of physical memory as a window into the code/data used over time.

Now that this is known, we can understand that the amount of physical memory present on the system affects performance greatly, because if the data/code needed for a process or the operating system itself is not directly available in physical memory, it must be brought in (paged-in) from the disk which is quite the performance hit.

One of the reasons it's very important to understand physical memory before virtual memory (or in general) is because physical memory contributes to the virtual memory system limit, which interestingly enough is roughly the size of physical memory plus any page files configured on the operating system.

We can view the layout of physical memory with Meminfo (download here). Do note that you'll need to execute the program through an elevated command prompt manually. You can see the path I chose in the screenshot below, as well as the layout itself.


If you use meminfo.exe it will display the different parameters you can use. In our case, if we use meminfo.exe -r it will run Meminfo and display the valid physical memory ranges that are detected.

If you're interested in going further regarding physical memory consumption device-wise, you can use Device Manager to check what addresses devices are occupying. The image below is a simple snippet of my personal system's memory consumption as an example.


We can also take a look at the physical memory limits of Windows 7 and 8 as an example.







As we can see, the actual physical memory limits themselves on the client operating systems drastically increase regarding its x64 architecture, yet remain the same with x86. x86's physical limit has remained 4 GB since Windows XP as far as its client operating systems go. This is simply due to the fact that on x86 systems, the processor's address bus which is 32 lines (and/or 32 bits) can only access addresses ranges 0x00000000 to 0xFFFFFFFF (totaling 4 GB).

--------------------

Virtual Memory

Now that we understand some of the fundamentals behind physical memory, we can go ahead and discuss virtual memory. Do you have your cup of coffee? Good, you're going to need it. It's very important to first understand that virtual memory is a completely different entity than physical memory, although they both work together hand-in-hand.

An extremely important thing to note at this point is a process does not equal (=/=) and/or mean the same thing as a program, and the same follows regarding a thread. For example, when and if you hear a user say "My Firefox (32 bit) process is running according to Task Manger", that's actually not correct. Processes do not run, threads run. Processes are solely a set of resources used to execute a program, and consist of a private virtual address space (where memory is allocated), an executable used to start the application (.exe), a table of handles to various kernel objects, a security context (otherwise known as access token), and one (or possibly more) threads that execute the code.

With all of the above said, virtual memory (of many things, at least) is a technique of Windows' memory manager that maps memory addresses used by a program, namely virtual addresses, into physical addresses. In layman's terms, virtual memory exists to separate a program's view of physical memory, so the operating system can then go ahead and decide whether to store that program's code/data physically and/or virtually.

Let's break this down further! When you run a program, it will go ahead and generate addresses which are generated in the following ways:

  • Load instruction
  • Store instruction
  • Fetching an instruction

Absolutely phenomenal article here regarding the first two. In short, the first two create data addresses, and the third goes ahead and creates instruction addresses. It's very important to know that RAM cannot distinguish between the two, and simply sees them as addresses. Addresses generated by programs are considered virtual, therefore it needs to be translated to a physical address. How does this happen? Good question! This all occurs through address translation hardware (done by the CPU and invoked by the kernel), known as MMU.

After MMU translates virtual > physical, the operating system can then go ahead and create a virtual address space that allows programs to reference more memory than actually physically available by using disk. This is one of the main benefits regarding virtual memory, aside from memory protection.


Thanks to Mike from BrokenThorn for the above image!

All of the above now finally leads into paging and page faults, which will be discussed below.

--------------------

Paging

Paging is very important in many ways, mainly because it allows the operating system to virtualize memory without worrying about segmentation. Instead of splitting up an address space into three logical segments, it's split up into fixed-size units known as a page.

-- A page is a sequence of N bytes where N is a power of 2.

Page sizes are at least 4 K and at most 64 K or more.

--------------------

Page Table/Disk Map

Now that we understand pages/paging, every address space on the system has two things associated with it:

1. Page Table - Identifies which/what pages are in physical memory.

2. Disk Map - Identifies where all the pages are on the disk.

Both of these describe an entire address space.

In an effort to make my own content this time as opposed to using pre-created images (inspired by P.J. Denning and Steve Coile), I have created the image below.


Regarding the above image, the followings flags are:

P - Presence flag

U - Used flag

M - Modified flag

F - Page frame

A - Disk address

With the above now known, if the P flag is set, this implies that the page is currently in physical memory (RAM). The F flag determines its location in memory, and is the number of the page frame in which the page is located.

If however the P flag is not set (not in physical memory), the address mapper will throw a page fault if the process in question attempts to reference the page. If this occurs, the page fault handler will use the disk map to go ahead and locate the page on the disk, and finally swap it (or page it) in. This is only a very minor explanation of a page fault process, and I will expand on page faults below.

--------------------

Page Fault

Finally! We get to the page fault, what we've been waiting for. I described above a very basic page fault process, but it's a lot cooler/interesting than that! In its basic definition of course, a page fault occurs when a program attempts to access pages that are not currently in physical memory (RAM). This is also known as a hard fault. It's absolutely imperative you understand the difference between hard fault and soft fault, which I will discuss below.

Hard Fault - Hard Fault (otherwise known/referred to as a major fault) is the exact same thing as Page Fault, and you'll see its name in Resource Monitor on newer versions of Windows (afaik Vista and later). To expand on why hard faults are defined as they are, and to stress on why they're expensive, it's due to the process the page fault handler must follow if one occurs. For example, if the page is not loaded into memory at the time of a program referencing its address, the page fault handler needs to find a 'free location'. This free location is either a page in memory, or a non-free page in memory.
 
If the latter is currently in use by a pre-existing process, the operating system needs to spend time writing out the data in that current page, and mark it as not being loaded into memory. Once this is done, it is now a free location and can be used to read the data for the new page into memory, add an entry to its location within the MMU, and finally of course indicate that it is now successfully loaded into memory.

Below is an image representation of the entire process outlined above.


Soft Fault - Entirely different from a hard fault, a soft fault is when the MMU (as we discussed above) has not yet marked a page being loaded in memory. This is sometimes/also referred to as a minor fault, as the solution is simple (i.e make the operating system create an entry for the page, have the MMU point to that page in memory, and finally of course indicate that it is now successfully loaded into memory).

With all of this said, you can imagine why page faults/hard faults are an extremely expensive process. It's also imperative that you also understand that having to unnecessarily access the disk is very slow. If anyone has ever had a system in which it was experiencing multiple/frequent page faults for whatever reason, they can truly attest to how much their system slows to a crawl.

Why is this? Well, since you now understand what actually occurs during a page fault behind the scenes, we can imagine how ridiculously taxing this is on the disk. It doesn't help that the process of actually accessing the disk itself is slow in general, but to have to constantly do it is very bad. This is a good time to discuss the main pros & cons of virtual memory (i.e the disk). It's more like pro & con, really.

Pros - Very easy to get a lot of disk space for a small cost.

Cons - Slow!

A processor's register can be accessed in about a nanosecond, cache in 5 nanoseconds, and RAM in approximately 100 nanoseconds. With this said, the disk is literally seconds slower (at least a million times slower). If you are constantly/frequently having to go through the process of a page fault, you can truly imagine yourself now how slow it is.

--------------------

Page Fault - Continued

To go a bit more behind the scenes in regards to page faults, what actually happens when a page fault occurs is the thread that was running is placed into a wait state until the operating system's page fault handler can go ahead and go through the page fault process outlined above. This is done through an interrupt that halts (remember, wait state) the current program.
 
The instruction that went ahead and attempted to either access the page that was invalid, or nonresident (i.e not in physical memory), fails and throws an exception that generates the interrupt discussed above. Before discussing anything any further, we must first discuss why an exception is thrown. Quite simply, an exception is thrown because the CPU has no idea what page files are, etc, and only knows how to work with memory.
 
At this point, one of two things happens:
 
1. An Interrupt Service Routine (ISR) determines that the address is in fact valid, however is not resident (not in physical memory). The operating system then goes ahead and throws an exception (page fault) and goes through the page fault process outlined above. Once the page fault process is successfully completed, the program picks up right where it left off like nothing happened.
 
or
 
2.  An Interrupt Service Routine (ISR) determines that the address is in fact invalid, and then throws an exception known as an access violation. Remember above how we discussed hard (major)/soft (minor) faults? This is specifically known as an invalid fault. In this case, as opposed to following the page fault process outlined above, it is told to not attempt to access the memory as it's a null/bad address, and to simply terminate the executing program in question.

With the above said, this may dispel the common misconception of saying 'frequent page faults are okay'. Frequent page faults are not okay, but that is not to say that page faults aren't a normal operation of the operating system. On a fully functioning machine (regarding both hardware and software), you will experience page faults on a very small scale due to some programs simply requiring more memory (an example). If you are experiencing a very large number of page faults occurring (hard faults/sec), you have a problem, and you can most certainly tell because your system is likely slowed to that of a snail.

As far as some of the potential issues go when it comes to frequent page faults on a system:
 
  • Insufficient RAM (physical memory).
  • Faulty RAM.
  • Need to tailor the pagefile to your system's specific needs.
  • etc

--------------------

Page Fault - BSOD

Now that we understand what happens when a page fault occurs in user-mode, it's also imperative we understand what happens when an exception such as an access violation is thrown in kernel-mode. As you may or may not be able to tell with the way I started this, when a page fault exception such as an access violation occurs in kernel-mode, this results in a bug check (Blue Screen of Death, BSOD).
 
Why? Well, remember we discussed that when an access violation for example occurs, the page fault handler goes ahead and terminates the program. What if we're in kernel-mode and the instruction involving the violation is a device driver? "Uh oh" is exactly what happens. Luckily I have a crash dump from just the other day!
 
SYSTEM_THREAD_EXCEPTION_NOT_HANDLED_M (1000007e)

This indicates that a system thread generated an exception which the error handler did not catch.

BugCheck 1000007E, {ffffffffc0000005, fffff88004a5d62a, fffff880035af908, fffff880035af160}

The 1st argument is the exception code that wasn't handled by the error handler. In this cause, ffffffffc0000005 is an NTSTATUS code. Kernel-mode drivers use NTSTATUS types for return values. ffffffffc0000005's NTSTATUS value is 0xc0000005 (otherwise known as an access violation).

Using NTSTATUS Values

The 2nd argument is the memory address in which the exception occurred at. In our case, this was fffff88004a5d62a.
 
The 3rd argument is actual exception record address. In our case, this was fffff880035af908 which we can run .exr on to show exception record information.
1: kd> .exr 0xfffff880035af908
ExceptionAddress: fffff88004a5d62a (igdkmd64+0x000000000003862a)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000000
NumberParameters: 2
   Parameter[0]: 0000000000000000
   Parameter[1]: 0000000000073070
Attempt to read from address 0000000000073070
Just by looking at the attempted address read, we can assume it's not a null address (because it's not zero), so it must be simply invalid. You can if interested confirm this by running !pte address which will display the page table entry (PTE) and page directory entry (PDE) for the specified address. This is not a kernel-dump, so running it in my case wouldn't yield any beneficial results.

The 4th argument is the context record address. In our case, this was fffff880035af160 which we can run .cxr on to show the context record.
1: kd> .cxr 0xfffff880035af160
rax=0000000000073000 rbx=fffffa8006299040 rcx=fffffa800637d540
rdx=00000000008dfaf0 rsi=fffffa800637d540 rdi=fffff88004a5d3c0
rip=fffff88004a5d62a rsp=fffff880035afb40 rbp=fffffa8006356b20
 r8=0000000000000000  r9=0000000000000000 r10=0000000000000018
r11=fffff880035afb60 r12=fffffa8006356b20 r13=fffffa800661cc30
r14=0000000000000038 r15=fffff88000f15fe0
iopl=0         nv up ei pl nz na po nc
cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00210206
igdkmd64+0x3862a:
fffff880`04a5d62a ff5070          call    qword ptr [rax+70h] ds:002b:00000000`00073070=????????????????
This shows us the context that was saved from the exception at the time of the crash. It contains the CPU registers, the instruction we failed on, the bad address, etc. First off, as highlighted in blue, the exception (access violation) was thrown by/occurred because of igdkmd64.sys (Intel Graphics driver) referencing invalid memory. Regarding the instruction we failed on, we were calling a pointer in the rax register. The rax register in our case was 0000000000073000 (invalid). All of this invalid memory stuff occurring would result in a memory write to ????????????????, therefore the box bug checked.

We can see it from another perspective by disassembling the rip register:
1: kd> u @rip
igdkmd64+0x3862a:
fffff880`04a5d62a ff5070          call    qword ptr [rax+70h]
fffff880`04a5d62d 488b442420      mov     rax,qword ptr [rsp+20h]
fffff880`04a5d632 8b8034010000    mov     eax,dword ptr [rax+134h]
fffff880`04a5d638 c1e813          shr     eax,13h
fffff880`04a5d63b 83e001          and     eax,1
fffff880`04a5d63e 85c0            test    eax,eax
fffff880`04a5d640 0f84ab000000    je      igdkmd64+0x386f1 (fffff880`04a5d6f1)
fffff880`04a5d646 488b442478      mov     rax,qword ptr [rsp+78h]
--------------------

And that's that! I really hope you enjoyed reading, and I imagine there will be many edits/additions to be made as time goes by. For now though, at this moment, I am happy with it.

References/Links

- Pavel Yosifovich's Windows Internals 1/2.
- The Basics of Page Faults.
- Windows Memory Management (Written by: Pankaj Garg).
- Pushing the Limits of Windows: Physical/Virtual Memory.
- So What Is A Page Fault?
- HP OpenVMS Systems Documentation.
- Virtual Address Space and Physical Storage.
- Everything You Need To Know To Start Programming 64-Bit Windows Systems.
- Virtual Memory.
- Physical Memory Structures.
- How to virtualize memory without segments.
- Load/Store Instructions.
- Operating Systems Development - Virtual Memory (by Mike, 2008).
- Implementation of swapping in virtual memory.
- Page fault handling (image).

Saturday, July 5, 2014

0x000000D1 Debugging - NotMyFault exploration (x64)

I've discussed some 0xD1 debugging here, but I figured I'd also go into a different 0xD1 scenario here, and just show it from different angles by using NotMyFault to force a bug check.

Download NotMyfault here.

--------------------

DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)

This indicates that a kernel-mode driver attempted to access pageable memory at a process IRQL that was too high.

We're all familiar with this bug check, so let's move on to what I wanted to talk about.

Let's go ahead and do an !analyze -v

DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high.  This is usually
caused by drivers using improper addresses.
If kernel debugger is available get stack backtrace.
Arguments:
Arg1: fffff8a0066eb800, memory referenced
Arg2: 0000000000000002, IRQL
Arg3: 0000000000000000, value 0 = read operation, 1 = write operation
Arg4: fffff88002af7385, address which referenced memory
fffff8a0066eb800 was the memory that was referenced. It's either invalid or it was at an IRQL that was too high.

kd> !pte fffff8a0066eb800
                                           VA fffff8a0066eb800
PXE at FFFFF6FB7DBEDF88    PPE at FFFFF6FB7DBF1400    PDE at FFFFF6FB7E280198    PTE at FFFFF6FC50033758
contains 000000007AC84863  contains 000000000367B863  contains 000000006B4C6863  contains 00003B5000000000
pfn 7ac84     ---DA--KWEV  pfn 367b      ---DA--KWEV  pfn 6b4c6     ---DA--KWEV  not valid
                                                                                  PageFile:  0
                                                                                  Offset: 3b50
                                                                                  Protect: 0
Using our handy !pte command which shows page table and directory entry for an address, we can see that it is not a valid address despite appearing to be one based on a first glance. Why is it not valid? As we can see above, and as I highlighted in purple, it's because this address is currently on the pagefile.

Why can't we just page it in? As we know, this is not how the Windows memory manager works regarding kernel-mode and its rules. If we're at IRQL (2) or higher (which we are, see argument 2), we cannot page anything in, therefore we bug check.

Great, so we know why the system crashed. However, what caused it?

--------------------

Let's go ahead and dump the stack:

kd> k
Child-SP          RetAddr           Call Site
fffff880`032f4448 fffff800`02a912a9 nt!KeBugCheckEx
fffff880`032f4450 fffff800`02a8ff20 nt!KiBugCheckDispatch+0x69
fffff880`032f4590 fffff880`02af7385 nt!KiPageFault+0x260
fffff880`032f4720 fffff880`02af7727 myfault+0x1385
fffff880`032f4870 fffff800`02dac127 myfault+0x1727
fffff880`032f48d0 fffff800`02dac986 nt!IopXxxControlFile+0x607
fffff880`032f4a00 fffff800`02a90f93 nt!NtDeviceIoControlFile+0x56
fffff880`032f4a70 00000000`76df138a nt!KiSystemServiceCopyEnd+0x13
00000000`0023edc8 00000000`00000000 0x76df138a
So here we have our call stack. Rather than doing <--- next to the calls, I'll just do this below because I don't want to destroy the formatting of the stack.

We start out with something in user-mode that we don't have the symbols for, and this is why it's 0x76df138a as opposed to a resolved name that we can understand. Why did I make the 7 in the address red, and how did I know we started out with something going on in user-mode? Good question! When the first digit of an address like that is 7 or lower, it's a user-mode address.

This is also due to the fact that this is a kernel-dump, which we can see towards the top of our crash dump within WinDbg:

Kernel Summary Dump File: Only kernel address space is available
With that said, we cannot see what the application was doing outside of when it went down into kernel-mode.

So we know that some application (0x76df138a) did something, and called down into kernel-mode. Everything above 0x76df138a is now kernel-mode. On x64, you can tell because the addresses start with fffff880`032f4a00 under Child-SP which implies kernel-mode.

We can see it goes through a few functions, and then ends up in myfault. Shortly afterwards, we hit a pagefault (trying to page in memory from the pagefile -- big no no).

--------------------

If we take a look at the trap frame:

kd> .trap 0xfffff880032f4590
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=0000000005000000 rbx=0000000000000000 rcx=0000000000002481
rdx=fffffa8001810000 rsi=0000000000000000 rdi=0000000000000000
rip=fffff88002af7385 rsp=fffff880032f4720 rbp=fffff880032f4b60
 r8=0000000000012408  r9=0000000000000810 r10=fffff80002a12000
r11=0000000000000002 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei ng nz na po nc
myfault+0x1385:
fffff880`02af7385 8b03            mov     eax,dword ptr [rbx] ds:00000000`00000000=????????
The first very important thing to note is the note about the trap frame not containing all registers, and how they may be either zeroed out or incorrect. The big question is why? Well, trap frame code generation on x64 versions of Windows does not save the contents of registers that are non-volatile.

With that said, registers such as rbx, rdi, rsi, etc, are either zeroed out or incorrect. This is due to the fact that on x64, any code that runs after the generation of a trap frame will properly hand it and restore it to its own frame. It's seen as an unnecessary step in a hot path within the kernel.

Extremely detailed article with much more info here.

Moving on, what happened with the instruction we failed on, we were moving a pointer which was stored in the rbx register:

mov     eax,dword ptr [rbx]

Uh oh, rbx is zeroed out. With that said, we can't !pte the register address to double check it, etc. We just need to assume that this all occurred because of myfault attempted to access memory that was either paged out or invalid (which it did).

--------------------

If you wanted any extra proof or to see if NotMyFault was the crash, you could dump all of the processes at the time of the crash to see if there was any correlation. In this case, you'd use !process 0 0. Flags are important in this case, and you can as always check the WinDbg help file for info, or use MSDN.

PROCESS fffffa80040a7060
    SessionId: 1  Cid: 0654    Peb: 7fffffd4000  ParentCid: 0708
    DirBase: 670ea000  ObjectTable: fffff8a00666c330  HandleCount:  68.
    Image: NotMyfault.exe
We can see we did indeed have a NotMyFault process running at the time of the crash, so we can at this point assume that this is very likely the accurate cause of the crash.

Hope you enjoyed reading!

0x0000119 Debugging - Invalid Fence IDs

Now that my extremely exciting week has come to an end, and I now have a moment to sit and relax, I figured what better way to do that then to go ahead and write a blog post! In this post we'll be discussing the 0x0000119 bug check, otherwise known for its name as VIDEO_SCHEDULER_INTERNAL_ERROR. I worked on one not too long ago, found the thread while I was cleaning out some reference bookmarks, and figured I'd do a write-up!

---------------------------

As usual, let's have a look at the basic description of the bug check:

VIDEO_SCHEDULER_INTERNAL_ERROR (119)

This indicates that the video scheduler has detected a fatal violation.

See this MSDN article for more information on Windows video scheduling, memory management, etc.

With this out of the way, let's jump right in and have some fun!

Using the basic !analyze -v:
-- By the way, ! is known as bang. Interesting tidbit of the day : )

VIDEO_SCHEDULER_INTERNAL_ERROR (119)
The video scheduler has detected that fatal violation has occurred. This resulted
in a condition that video scheduler can no longer progress. Any other values after
parameter 1 must be individually examined according to the subtype.
Arguments:
Arg1: 0000000000000001, The driver has reported an invalid fence ID.
Arg2: 0000000000000c00
Arg3: 0000000000000c01
Arg4: 0000000000000c01
Great, so right away we actually have some pretty helpful information, which is the 1st argument tells us that 'The driver has reported an invalid fence ID'. Now that we know this is the reason behind the bug check occurring on the system, we need to understand what driver reported an invalid fence ID, and what a fence ID even is.

Regarding arguments 2, 3, and 4, I believe 2 is the invalid fence ID we're dealing with, and 3 & 4 are what the expected fence ID was.

First off, we need to understand the Windows Display Driver Model (WDDM) - Article here. After reading this, we can understand that a fence ID is essentially a glorified ticket for the GPU to have access to process a Direct Memory Access (DMA) buffer. This is done so the GPU itself doesn't have to bother the CPU or OS, and its life is a lot easier.

---------------------------

Now that we know the above, let's take a look at the call stack:

0: kd> k
Child-SP          RetAddr           Call Site
fffff800`04438528 fffff880`015ed22f nt!KeBugCheckEx
fffff800`04438530 fffff880`07807ec5 watchdog!WdLogEvent5+0x11b
fffff800`04438580 fffff880`07808131 dxgmms1!VidSchiVerifyDriverReportedFenceId+0xad
fffff800`044385b0 fffff880`07807f82 dxgmms1!VidSchDdiNotifyInterruptWorker+0x19d
fffff800`04438600 fffff880`078f513f dxgmms1!VidSchDdiNotifyInterrupt+0x9e
fffff800`04438630 fffff880`073d64d8 dxgkrnl!DxgNotifyInterruptCB+0x83 <--- DMA buffer completed.
fffff800`04438660 fffffa80`08d938e8 igdkmd64+0x1744d8
fffff800`04438668 fffff800`031f4e80 0xfffffa80`08d938e8
fffff800`04438670 fffff800`04438840 nt!KiInitialPCR+0x180
fffff800`04438678 fffffa80`0923d7a8 0xfffff800`04438840
fffff800`04438680 fffffa80`0925b000 0xfffffa80`0923d7a8
fffff800`04438688 fffffa80`00000c00 0xfffffa80`0925b000
fffff800`04438690 fffff880`0c52b000 0xfffffa80`00000c00
fffff800`04438698 00000000`00000000 0xfffff880`0c52b000

Essentially what happened here was after the GPU finished processing the DMA buffer, the Intel Graphics driver (igdkmd64.sys) was notified that it finished what is was doing and provided an ID # of the DMA Buffer (known as a fence ID). In our case, this was in invalid fence ID, therefore DirectX said 'woah, this isn't right' and called the bug check to stop the GPU from continuing with illegal accessed memory.

---------------------------

With such an issue you may think that it's always a bad GPU, however, in this specific case it was simply a video driver issue that was solved with an update. Update those video drivers!

Hope you enjoyed reading!

Tuesday, July 1, 2014

Microsoft MVP - Thank you!

Wow.... that's really all I can say!

I'm going to take this time to give an extremely large thank you to everyone. I'd first like to of course start out by thanking everyone who has nominated me, it means an unbelievable amount to me. I cannot express how much it means to be nominated by a Microsoft MVP in general, but how much more it means to be nominated by people you learned from. To know that the people you once learned from now believe you have 'grown enough' to become an MVP yourself is absolutely mind blowing.

The MVP award to me is extremely important because debugging is not just a passion, but rather an extremely huge part of my life. I have gone from being simply interested in debugging ~2 and a half years ago, to wanting it to be involved in my every day working life. I have learned so much in the time I have been debugging, but there is so much I still do not know, and that's the beauty of it all.

In my life, what makes me happy is helping other people. The fact that I was able to combine my love for debugging, and actively helping people across various communities, has truly made an extremely positive impact on my life. Even more that I have been recognized and awarded for it by Microsoft is a dream truly come true. I am beyond honored.

I'd like to extend a huge thank you to my friends as well who I interact with every day across all of the communities. There are far too many names to name, but you certainly know who you are. I love working with you all!

So again, thank you very much, and especially to the community for allowing me to work with you, and to help you. I hope I can wear the MVP badge well, and I hope I can make many new friends when Summit comes around.

....Now, time for more debugging posts : )

- Patrick