I'm a programmer, working for Valve (http://www.valvesoftware.com/), focusing on optimization and reliability. Nothing's more fun than making code run 5x faster. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And juggle.
Posts by Bruce-Dawson
  1. More Adventures in Failing to Crash Properly ( Counting comments... )
  2. When Even Crashing Doesn’t Work ( Counting comments... )
  3. WPA–Xperf Trace Analysis Reimagined ( Counting comments... )
  4. In Praise of Idleness ( Counting comments... )
  5. That’s Not Normal–the Performance of Odd Floats ( Counting comments... )
  6. Xperf Wait Analysis–Finding Idle Time ( Counting comments... )
  7. Exceptional Floating Point ( Counting comments... )
  8. Floating-point complexities ( Counting comments... )
  9. Intermediate Floating-Point Precision ( Counting comments... )
  10. Float Precision–From Zero to 100+ Digits ( Counting comments... )
  11. Comparing Floating Point Numbers, 2012 Edition ( Counting comments... )
  12. Don’t Store That in a Float ( Counting comments... )
  13. Stupid Float Tricks ( Counting comments... )
  14. Tricks With the Floating-Point Format ( Counting comments... )
  15. Top Ten Technologies of 2011 ( Counting comments... )
  16. Increased Reliability Through More Crashes ( Counting comments... )
  17. A Tale of Two Call Stacks ( Counting comments... )
  18. Source Indexing is Underused Awesomeness ( Counting comments... )
Technology/ Code /

I’ve written previously about the importance of crashing in order to improve code quality. However even the seemingly simple task of crashing can be more error prone than you might expect.

I’ve recently become aware of two different problems that can happen when crashing in 64-bit Windows. There is a Windows bug which can make debuggers forget where a crash happened, and there is a Windows design decision which sometimes causes a crash to be completely ignored!

Both problems are (mostly) avoidable once you know what to do, but the required techniques are far from obvious.


Forgetting where a crash happened

It is a reasonable minimum requirement that a debugger should halt on the exact instruction that triggered a fault and then attempt to show source code, local variables, a call stack etc. There are all sorts of reasons it may be difficult or impossible to show source code (none available), local variables (optimized away), or a call stack (stack trashed), but for user-mode debugging it should always be possible to stop on the faulting instruction.

And indeed, in all the decades that I have used Visual C++ it has managed this task quite well – until recently.

imageStarting a few months ago I noticed that, when the program that I was debugging crashed, the VC++ debugger would not halt on the faulting instruction. It wouldn’t even halt in the crashing function. Instead it would halt two levels into the OS, with a call stack that made no sense. At first I thought that the project I was working on was doing something weird with a structured exception handler but I was able to reproduce the bug on a fresh project created by the VC++ New Project Wizard. I briefly thought that maybe something was misconfigured on my machine, but then my coworkers started reporting this problem as well. Then I thought maybe it was a newly introduced VC++ bug – but the same problem can be triggered in windbg as well.

I wasn’t sure what was happening but it smelled like a recently introduced Windows bug.

My minimal test program for this bug was to call this Crash() function just before the message pump in a default Win32 program, debug build:

void Crash()
{
    char* p = 0;
    p[0] = 0;
}

If I break on the instruction that will crash then I get the call stack below, and I should get the same call stack after crashing:

image

That is indeed the call stack that I got in this scenario for years. However, starting a few months ago, on most 64-bit Windows 7 machines that I have tested this on, the actual call stack is this:

image

Notice that the function that crashed is not even listed! This makes routine bug investigation an expert-level problem.

Sometimes the crash call stack is even worse, with even the parent of the crashing function missing:

image

The actual stack displayed varies. Sometimes it is correct, and sometimes the two ZwRaiseException entries are listed. It seems to depend on subtle details of the code at the crash location, or the stack frames, or the phase of Venus.

Windbg defaults to halting on first-chance exceptions, so it normally avoids this bug. However if you continue execution after a crash then the exception handlers run and the bug appears.

I’ve created a simple test program with a “Crash normally” menu item so that you can easily test it. Source and the executable are available here. You’ll have to build the project file (with VS 2010 or VS 2012) to get symbols in order to see this properly in a debugger.

Another blogger investigated this issue earlier this year and found the root cause. The issue is a bug in the WoW64 support for AVX. Saving the state of the AVX registers requires additional space, and apparently the WoW64 debug support fails to reserve enough space, so the stack gets corrupted. Oops.

There is a fix (well, a couple of workarounds)

The problem with correctly displaying the location of a crash only occurs if the first-chance exception handlers are allowed to run. First-chance exception handlers give a program a chance to take some action when a program crashes (such as saving a minidump, or translating raw exception numbers into something more readable).

Programmatically saving minidumps is unnecessary and inadvisable when you are running under the debugger, so that’s no loss. Translating raw exception numbers is valuable when debugging – I demonstrated it a few posts ago – but it’s not valuable enough to justify the complexity caused by not knowing where you crashed. Other uses of first-chance exceptions – such as ‘fixing’ bugs so that you can continue executing – are morally bankrupt and will not be acknowledged further here.

imageClearly what we want to do is to stop any exception handlers from running when our program crashes. We want the debugger to halt when an exception is thrown, instead of after it has complicated things by letting exception handlers run. This is actually the default behavior in windbg but in Visual Studio we have to change a setting. Go to the Debug menu, select Exceptions, and check the box beside Win32 Exceptions.

In an ideal world this would be a global setting and we would be done with the problem, but alas this is a per-solution setting, so you may have to click this check box many times. It’s a minor nuisance, and well worth it for the benefit of actually being able to debug your crashes.

Another workaround with a different set of tradeoffs was suggested by Michaln, author of the os2museum blog. He points out that you can disable AVX support and therefore avoid the problem. The obvious disadvantage is that you lose AVX support, which will eventually become unacceptable. The command below and a reboot will turn off AVX support.

bcdedit /set xsavedisable 1

I think that there are two changes which Microsoft should make. One is that Visual Studio should default to halting immediately when Win32 exceptions are thrown – that is a safer policy in general, and would have avoided most of the impact of this bug.

The other change that Microsoft should make is to actually fix WOW64.

I have reported this bug to Microsoft through informal channels, but I’ve heard no reply so far.

Failure to stop at all

An equally disturbing problem was introduced some years ago with 64-bit Windows and it causes some crashes to be silently ignored.

Structured exception handling is the Windows system that underpins all exception handling (C++ exceptions are implemented using structured exception handling under the hood). Its full implementation relies on being able to unwind the stack (without or without calling destructors) in order to transfer execution from where an exception occurs to a catch/__except block.

The introduction of 64-bit Windows complicated this. On 64-bit Windows it is impossible to unwind the stack across the kernel boundary. That is, if your process calls into the kernel, and then the kernel calls back into your process, and an exception is thrown in the callback that is supposed to be handled on the other side of the kernel boundary, then Windows cannot handle this.

This may seem a bit esoteric and unlikely – writing kernel callbacks seems like a rare activity – but it’s actually quite common. In particular, a WindowProc is a callback, and it is often called by the kernel, as shown below:

image

If your code crashes in the user code on the right – called from the kernel – then Windows has a problem. Since Windows can’t invoke your exception handlers in the box on the left, and it doesn’t know what they would do, it has to make an executive decision about this exception. It can either crash the process, or it can silently ignore the exception, unwind the stack back to the kernel boundary, and then continue executing as if nothing happened.

Crashing the process may significantly inconvenience users, especially if there is a bug specific to 64-bit Windows in an unsupported product. But silently swallowing the exception means that many developers may be crashing in their WndProc without realizing it, leaving their process in an indeterminate state that may be causing future pain and suffering. Microsoft tries to err on the side of maximum compatibility and stability, but sometimes this just sweeps problems under the rug.

Triggering this behavior is easy. In a Project Wizard “Win32 Project” just drop a call to the Crash() function in the paint handler. To make this demo particularly dramatic be sure to put the Visual Studio exception settings back to normal. That is, make it so that Visual Studio does not stop when an exception is thrown – only when it is unhandled. Here’s a sample of what the modified code could look like, complete with a new/delete pair that straddles the Crash() call:

case WM_PAINT:
    {
        hdc = BeginPaint(hWnd, &ps);
        char* p = new char[1000000];
        Crash();
        delete [] p;
        EndPaint(hWnd, &ps);
        break;
    }

And here’s what the output window looks like:

image

The more you resize the window the more frantically the debugger tries to tell you that your program is in trouble. And yet your program continues. Try running it not under the debugger and you will see that it appears to be running normally. But if you look in task manager as you resize the window you will see the memory consumption growing out of control – the delete statement is never reached.

Aside: inevitably somebody will suggest that if I used std::vector then I wouldn’t have this memory leak. And indeed, normally I would use std::vector or some other container class to manage memory – manually calling delete is for chumps. However there are a couple of points to consider here:

  • The ability of std::vector to magically delete memory when an exception is thrown only works with C++ exceptions, not structured exceptions (crashes). There are ways to translate structured exceptions to C++ exceptions but this is misguided and, either slows your program or misses some exceptions. It’s just a bad idea. Don’t do it.
  • Additionally, the whole point of this article is that when you are in a callback from the kernel the exception handling mechanism – structured and C++ exceptions – is impaired. You can’t count on it to save you.

There is a fix

The default policy on 64-bit Windows is to silently swallow the crash, but as of Windows 7 SP1 there is a choice. There is a pair of undocumented (not directly listed on MSDN, although they are mentioned) functions that can be used to configure this behavior. You can read the hairy details (written from a pre-SP1 perspective) here or you can start the process of redemption by calling this function:

void EnableCrashingOnCrashes()
{
    typedef BOOL (WINAPI *tGetPolicy)(LPDWORD lpFlags);
    typedef BOOL (WINAPI *tSetPolicy)(DWORD dwFlags);
    const DWORD EXCEPTION_SWALLOWING = 0x1;

    HMODULE kernel32 = LoadLibraryA("kernel32.dll");
    tGetPolicy pGetPolicy = (tGetPolicy)GetProcAddress(kernel32,
                "GetProcessUserModeExceptionPolicy");
    tSetPolicy pSetPolicy = (tSetPolicy)GetProcAddress(kernel32,
                "SetProcessUserModeExceptionPolicy");
    if (pGetPolicy && pSetPolicy)
    {
        DWORD dwFlags;
        if (pGetPolicy(&dwFlags))
        {
            // Turn off the filter
            pSetPolicy(dwFlags & ~EXCEPTION_SWALLOWING);
        }
    }
}

The GetProcAddress dance is necessary because many versions of Windows don’t have these functions.

Calling this function – once at process startup will do – is a great start. It will ensure that crashes you hit during testing will get noticed and, one hopes, fixed. However there is probably another step that you will want to do. If you normally use an exception handler to record crash dumps then you will probably find that your exception handler is not recording these crashes. That’s because your exception handler is probably on the other side of the kernel boundary. You could try putting in more exception handlers, but then you’re playing callback whack-a-mole. The far simpler solution is to use SetUnhandledExceptionFilter to put in a process-wide exception handler of last resort. This will get called even when the stack based handlers cannot be, and this should allow your crash dump reporting to catch even the gnarliest of kernel-crossing crashes. Don’t count on unhandled exception filters for too much, but in this case they are better than nothing. Note also that the unhandled exception filter is not called when you are debugging.

imageThe same test program that has the “Crash normally” menu item also has a “Crash in callback” menu item to enable crashing in the WM_PAINT handler. As in the sample code above it leaks memory. It even keeps track of how much memory has leaked. Source and the executable are available here.

The test program also has an “Enable crashing on crashes” menu item. If you select this then the next callback crash – typically the next time that you resize the window – will be a real crash, instead of a silently ignored crash.

Note that CrashTest.exe will only exhibit the crash swallowing behavior on 64-bit Windows, and I’ve only tested it on Windows 7.

Things that annoy me

imageI don’t like the Program Compatibility Assistant. It tells you that something has gone wrong (in this case a crash in a kernel callback) but because it doesn’t tell you what went wrong there is no practical way for a Windows developer to do anything about it. Thus, the cycle of programs not running correctly continues.

It also doesn’t tell you what compatibility settings it has applied. I know that this information is not of interest to Windows users, but there should be some way to get specific information about what went wrong, so that developers are not left impotently scratching their heads.

Your task list

Download the test program from here to see bad exception handling (if you have an AVX capable machine running 64-bit Windows 7 SP1), and exception swallowing (all 64-bit versions of Windows).

You should enable breaking when an exception is thrown for all Win32 Exceptions, using Visual Studio’s Debug menu, Exceptions dialog. Don’t forget to do this for every solution. It is unfortunate that the default behavior is for Visual Studio has been incorrect for a couple of decades now. Maybe VS 2015 will correct this. If you enable breaking when Win32 exceptions are thrown then even kernel-callback crashes will break into the debugger, when you are running under a debugger.

Vote on the connect issue to request that the Visual Studio change the defaults so that the debugger halts on first-chance Win32 exceptions.

You should consider disabling AVX to avoid the WOW64 debugging bug: “bcdedit /set xsavedisable 1”

Call EnableCrashingOnCrashes() to ensure that crashes in callbacks are not ignored. Don’t use registry editing or other options to control this behavior.

If you have a crash dump saving system then use SetUnhandledExceptionFilter to ensure that it is called if your code crashes in a callback.

Test on 64-bit Windows.

Watch for more discussion on the compatibility assistant and exception handling in some future post.

Update

The silent swallowing of exceptions documented above is for 32-bit processes on 64-bit Windows. The behavior for 64-bit processes is a bit different. On Windows 7 if a 64-bit process crashes in a kernel callback then it will actually crash. However, if the executable doesn’t have a Windows 7 compatibility manifest (subject of a later post) then the Program Compatibility Assistant will apply a shim that will suppress future crashes. Confused yet?

The summary remains the same, with the possible addition that if you are testing on Windows 7 then you should add a compatibility manifest to say that you are doing so.

Also, I hear rumors that the stack corruption caused by storing AVX state is known by Microsoft and will be fixed. Whether the fix will be for Windows 8 only is not clear at this point.