The case of the orphaned critical section despite being managed by an RAII type

Some time ago, I was enlisted to help debug an elusive deadlock. Studying a sampling of process memory dumps led to the conclusion that a critical section had been orphaned. Sometimes, the thread that owned the critical section had already exited, but sometimes the thread was still running, but it was running code completely unrelated to the critical section. It was as if the code that acquired the critical section had simply forgotten to release it before returning.

The thing is, all attempts to acquire the critical section were managed by an RAII type, so there should be no way that the critical section could have been forgotten. And yet it was.

When would the destructor for an RAII object by bypassed? One possibility is that somebody did an ExitThread or (horrors) TerminateThread. But this doesn’t match the evidence, because as noted above, in some of the crash dumps, the critical section owner is still alive and running, but unaware that it owns the critical section.

On all platforms other than x86, exception unwind information is kept in tables in a rarely-used portion of the image, so that we don’t waste memory on exception unwind information until an exception actually occurs: When an exception occurs, the system pages in the unwind tables and does a table lookup to see which unwind handler should run. But on x86, the exception unwind state is maintained manually in the code. This is a bad thing for x86 performance, but a good thing for getting inside the head of the compiler.

Bonus reading: Unwinding the Stack: Exploring How C++ Exceptions Work on Windows.
— James McNellis, CppCon 2018

The unwind checkpoint is a 32-bit value, usually stored at [ebp-4]. The compiler uses it to keep track of what needs to get unwound if an exception occurs. If the compiler can deduce that no exception can occur between two checkpoints, then it can optimize out the first checkpoint.

There are four functions that enter the critical section in question. The code that does so looks like this:

{
  auto guard = SystemChangeListenerCS.Lock();
  ... some code ...
} // guard destructor releases the lock

Finding the exact point where the guards are created is made easier with the assistance of the # debugger command, which means “Disassemble until you see this string in the disassembly.”

0:000> #SystemChangeListenerCS SystemChangeListenerThreadProc
SystemChangeListenerThreadProc+0x7c:
1003319c mov     ecx,offset SystemChangeListenerCS (100b861c)
0:000>

Okay, so the debugger found a line of assembly that mentions SystemChangeListenerCS. Let’s look to see whether there is an unwind checkpoint after the lock is taken.

0:000> u 1003319c
ChangeMonitorThreadProc+0x7c:
1003319c mov     ecx,offset contoso!SystemChangeListenerCS (100b861c)
100331a1 push    eax
100331a2 call    Microsoft::WRL::Wrappers::CriticalSection::Lock (1002a863)
100331a7 mov     byte ptr [ebp-4],5

We see that immediately after acquiring the lock, the code updates [ebp-4] to remember that it needs to destruct the lock guard in case an exception occurs.

Exercise: I said that the unwind state is recorded in a 32-bit value stored at [ebp-4], but the code here updates only a byte. Why only a byte?

The lock is acquired again later in that same function, so we’ll search some more. If you leave off the second parameter to the # command, it continues searching where the previous search left off.

0:000> #SystemChangeListenerCS
SystemChangeListenerThreadProc+0x487:
100335a7 mov     ecx,offset contoso!SystemChangeListenerCS (100b861c)
0:000> u 100335a7
contoso!SystemChangeListenerThreadProc+0x487:
100335a7 mov     ecx,offset contoso!SystemChangeListenerCS (100b861c)
100335ac push    eax
100335ad call    Microsoft::WRL::Wrappers::CriticalSection::Lock (1002a863)
100335b2 mov     byte ptr [ebp-4],0Dh

Okay, so this lock guard is also marked for unwinding.

The next function that uses the critical section is ResetWidgets.

0:000> #SystemChangeListenerCS ResetWidgets
ResetWidgets+0x133:
10033fcc mov     ecx,offset SystemChangeListenerCS (100b861c)
0:000> u 10033fcc l4
ResetWidgets+0x133:
10033fcc mov     ecx,offset SystemChangeListenerCS (100b861c)
10033fd1 push    eax
10033fd2 call    Microsoft::WRL::Wrappers::CriticalSection::Lock (1002a863)
10033fd7 call    Microsoft::WRL::ComPtr<IStream>::Reset (10039932)
10033fdc call    Microsoft::WRL::ComPtr<Widget>::Reset (10039142)
10033fe1 cmp     dword ptr [ebp-4Ch],0
10033fe5 je      ResetWidgets+0x157 (10033ff0)
10033fe7 push    dword ptr [ebp-4Ch]

Hm, this function doesn’t create an unwind checkpoint after taking the lock. This means that the compiler believes that no exception can occur between the point the guard is created and the next thing that would require updating the unwind checkpoint (in our case, that would be the point the lock is destructed).

We repeat this analysis with the other two functions. One of them creates an unwind checkpoint; the other doesn’t.

Why does the compiler believe that no exceptions can occur in the guarded block? Well, inside the block it calls ComPtr::Reset twice, and it does some other stuff. The Reset method is declared like this:

template<typename T>
class ComPtr {
unsigned long Reset() { return InternalRelease(); }
unsigned long InternalRelease() throw() { ... }
...
};

Observe that the InternalRelease method uses the deprecated throw() specifier, which says that the method never throws an exception. The compiler then inferred that the Reset method also never throws an exception, since it does nothing that could result in an exception.

This code was compiled before the Microsoft C++ compiler added the /std:C++17 switch, so it uses the old rules for the throw() specifier, which for the Microsoft C++ compiler boils down to “I’m trusting you never to throw an exception.”

My theory is that the Reset actually did throw an exception. Since the compiler didn’t create an unwind checkpoint, the lock guard did not get unwound. The exception was caught higher up the stack, so the process didn’t crash.

Digging into the two objects wrapped inside the ComPtr revealed that the first one was a WidgetMonitor object.

Exercise: The first was really an IWidgetMonitor interface, so why did it disassemble as ComPtr<IStream>?

The WidgetMonitor‘s destructor went like this:

WidgetMonitor::~WidgetMonitor()
{
 Uninitialize();
}

void WidgetMonitor::Uninitialize()
{
 blah blah;
 ThrowIfFailed(m_monitor.Deactivate());
 blah blah;
 ThrowIfFailed(m_monitor.Disconnect());
 blah blah;
}

Now you see the problem. If the Uninitialize method throws an exception, the exception will propagate out of the destructor. (This code is so old that it predates C++11’s rule that destructors are noexcept by default where possible.) And then it will propagate out of ComPtr::InternalRelease, and then out of ComPtr::Reset, and then out of ResetWidgets. And unwinding out of ResetWidgets will not run the lock guard’s destructor because the compiler assumed that no exception could be thrown, thanks to the throw() specifier on the ComPtr::InternalRelease method.

As is often the case, it’s usually a lot easier to find something once you know what you’re looking for. The team dug into its telemetry to see that, yes indeed, the systems that encountered the problem had also thrown an exception from WidgetMonitor::Uninitialize, thus confirming the theory.

Now they could work on fixing the problem: Fix the destructor so it doesn’t throw any exceptions. In this specific case, the exception was thrown because they were deactivating an object that hadn’t been fully activated. Since cleanup functions cannot fail, the best you can do is to just soldier on and clean up as much as you can.

The case of the orphaned critical section despite being managed by an RAII type

Author

0 comments

Read next

How can I prevent a WebView control from opening a browser window?

Why am I getting mojibake when I try to create a shell link?