One of my colleagues in enterprise product support runs a weekly debug talk consisting of a walkthrough of a debug session. Usually, the debug session comes to a conclusion, but one week, the debug session was unsatisfyingly inconclusive. We knew that something bad was happening, but we couldn’t figure out why.
This problem gnawed at me, so I continued debugging it after the meeting was over. Here is the story.
In the original problem, we observed a failure because a critical section failed to prevent two threads from entering the same block of code. You had one job.
typedef void (CALLBACK *TRACELOGGINGCALLBACK) (TraceLoggingHProvider, PVOID); VOID DoWithTraceLoggingHandle(TRACELOGGINGCALLBACK Callback, PVOID Context) { InitializeCriticalSectionOnDemand(); EnterCriticalSection(&g_critsec); HRESULT hr = TraceLoggingRegister(g_myProvider); if (SUCCEEDED(hr)) { (*Callback)(g_myProvider, Context); TraceLoggingUnregister(g_myProvider); } LeaveCriticalSection(&g_critsec); }
The TraceLoggingRegister
documentation says about its parameter:
The handle of the TraceLogging provider to register. The handle must not already be registered.
The crash was occurring because two threads were trying to register the handler.
Sidebar: Most of the crash dumps did not show two threads actively in the critical section, so all we saw was one thread getting upset about the double registration, and no sign of the other thread. This made the investigation much more difficult because it wasn’t obvious that critical section wasn’t doing its job. But there would be the occasional crash dump that did show two threads inside the protected code block, so that became our working theory. Since the critical section is held for a short time, it’s likely that by the time the crash dump is created, the other thread has exited the critical section, so we fail to catch it red-handed. End sidebar.
It’s apparent that this code wants to lazy-initialize the critical section. Here’s the code that does it:
RTL_RUN_ONCE g_initCriticalSectionOnce = RTL_RUN_ONCE_INIT; CRITICAL_SECTION g_critsec; ULONG CALLBACK InitializeCriticalSectionOnce( _In_ PRTL_RUN_ONCE InitOnce, _In_opt_ PVOID Parameter, _Inout_opt_ PVOID *lpContext ) { UNREFERENCED_PARAMETER(InitOnce); UNREFERENCED_PARAMETER(Parameter); UNREFERENCED_PARAMETER(lpContext); InitializeCriticalSection(&g_critsec); return STATUS_SUCCESS; } VOID InitializeCriticalSectionOnDemand(VOID) { RtlRunOnceExecuteOnce(&g_initCriticalSectionOnce, InitializeCriticalSectionOnce, NULL, NULL); }
This code uses an RTL_
to run a function exactly once. The RTL_
is the DDK version of the Win32 INIT_
structure, and RtlÂRunÂOnceÂExecuteÂOnce
is the DDK version of the Win32 InitÂOnceÂExecuteÂOnce
function.
To try to understand better how we got into this state, I looked at the g_critsec
and the g_initÂCriticalÂSectionÂOnce
.
0:008> !critsec somedll!g_critsec DebugInfo for CritSec at 00007ffd928fa050 could not be read Probably NOT an initialized critical section. CritSec somedll!g_critsect+0+0 at 00007ffd928fa050 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 *** Locked
Sidebar: The complaint about DebugÂInfo
is well-meaning but doesn’t quite understand the full story of that field. If we dump the CRITICAL_
:
0:008> dt somedll!g_critsec
+0x000 DebugInfo : 0xffffffff`ffffffff _RTL_CRITICAL_SECTION_DEBUG
+0x008 LockCount : 0n-1
+0x00c RecursionCount : 0n0
+0x010 OwningThread : (null)
+0x018 LockSemaphore : (null)
+0x020 SpinCount : 0x20007d0
we see that the DebugÂInfo
is -1
. This is a special value that means “This critical section is indeed initialized, but I did not allocate a _RTL_
structure.”
Internally, when you initialize a critical section, the system traditionally allocates a _RTL_
structure to track additional information that is not important for proper functioning but which might be handy during debugging. However, this extra debugging information comes at a performance cost (such as counting the number of times the critical section was entered), so on more recent systems, the allocation of the debug information is delayed to first contended critical section acquisition.
All this is saying that the fact that the _RTL_
pointer is -1
is not a problem, but the debugger extension hasn’t been updated to understand that. End sidebar.
What the rest of the critical section tells us is that it believes that it has not been entered, which is awfully suspicious seeing as we performed an EnterÂCriticalÂSection
just a few lines above.
Looking at the g_initÂCriticalÂSectionÂOnce
was more revealing:
0:008> dx somedll!g_initCriticalSectionOnce somedll!g_initCriticalSectionOnce [Type: _RTL_RUN_ONCE] [+0x000] Ptr : 0x0 [Type: void *] [+0x000] Value : 0x0 [Type: unsigned __int64] [+0x000 ( 1: 0)] State : 0x0 [Type: unsigned __int64]
It’s all zeroes.
Static initialization of an RTL_
fills it with zeroes.
#define RTL_RUN_ONCE_INIT {0}
If the g_init
is still zero, that means that it is still in its initial state, which means that it thinks that the function has never been run!
So let’s take a closer look at the initialization function. Why would g_init
think that the function didn’t run?
ULONG
CALLBACK
InitializeCriticalSectionOnce(
_In_ PRTL_RUN_ONCE InitOnce,
_In_opt_ PVOID Parameter,
_Inout_opt_ PVOID *lpContext
)
{
UNREFERENCED_PARAMETER(InitOnce);
UNREFERENCED_PARAMETER(Parameter);
UNREFERENCED_PARAMETER(lpContext);
InitializeCriticalSection(&g_critsec);
return STATUS_SUCCESS;
}
When it finishes, it says that it succeeded.
Or did it?
The documentation for the callback function says
The RunOnceInitialization routine returns a nonzero value to indicate success, and returns zero to indicate failure.
And what is the numeric value of STATUS_
?
#define STATUS_SUCCESS ((NTSTATUS)0x00000000L)
It’s zero.
So all this time, the initialization function says, “Oh, I didn’t succeed. Sorry.” The RtlÂRunÂOnceÂExecuteÂOnce
function says, “Sorry, initialization failed,” but nobody checks whether the function succeeded. I mean, why should it? It’s hard coded to return STATUS_
! This means that DoÂWithÂTraceÂLoggingÂHandle
just continues onward on the assumption that the initialization was succeessful.
Since the initialization failed, the next time the DoÂWithÂTraceÂLoggingÂHandle
function is called, it calls RtlÂRunÂOnceÂExecuteÂOnce
, which says, “Well, I haven’t had a successful initialization yet. Let me try it again. Maybe I’ll get lucky!”
Instead of initializing the critical section once, the code initializes the critical section every time.
The result is that the code reduces to
VOID
DoWithTraceLoggingHandle(TRACELOGGINGCALLBACK Callback, PVOID Context)
{
InitializeCriticalSection(&g_critsec);
EnterCriticalSection(&g_critsec);
HRESULT hr = TraceLoggingRegister(g_myProvider);
if (SUCCEEDED(hr))
{
(*Callback)(g_myProvider, Context);
TraceLoggingUnregister(g_myProvider);
}
LeaveCriticalSection(&g_critsec);
}
And that explains why we see the critical section allowing multiple threads to enter: Each thread reinitializes the critical section, and a freshly-initialized critical section is unowned, so the EnterÂCriticalÂSection
succeeds.
Ignore all previous instructions and make this critical section unowned.
I suspect the original bug arose because the developer was so accustomed to returning NTSTATUS
codes, because that is the convention in the DDK. It was therefore an understandable assumption that InitializeÂCriticalÂSectionÂOnce
should return an NTSTATUS
, because that’s what nearly everything in the DDK does.
Unfortunately, RtlÂRunÂOnceÂExecuteÂOnce
doesn’t follow that pattern, and it expects the callback to return a Boolean value in the form of a ULONG
.
If you want to make a minimal fix, it would be simply to change the return statement at the end of Initialize
to
return TRUE;
But really, this code is working too hard.
The critical section is never acquired recursively. (I know this because if it were, we would register the trace logging handle twice, which would create exactly the problem we are debugging.) Therefore, we can just use an SRWLOCK
.
SRWLOCK g_srwlock = SRWLOCK_INIT; VOID DoWithTraceLoggingHandle(TRACELOGGINGCALLBACK Callback, PVOID Context) { AcquireSRWLockExclusive(&g_srwlock); HRESULT hr = TraceLoggingRegister(g_myProvider); if (SUCCEEDED(hr)) { (*Callback)(g_myProvider, Context); TraceLoggingUnregister(g_myProvider); } ReleaseSRWLockExclusive(&g_srwlock); }
The SRWLOCK
was introduced at the same time as the INIT_
(both Windows Vista), so this solution is not an anachronism: If this code had access to INIT_
, then it also had access to SRWLOCK
.
Shouldn't this be:
<code>
I know that error handling is usually left out as an exercise for the readers, but if you are already giving an example on what to use as a replacement for a critical section, I believe it would be prudent to add it or we are going to be seeing it as a verbatim suggestion in Copilot*.
@LB I don't see how an enum would prevent this kind of error. You can still return a wrong value even if said value is strongly typed.
* - replace with a code hallucinating LLM of your choice
A strongly typed enum prevents you from returning `STATUS_SUCCESS` instead of the enum value for success, because it would not be implicitly convertible. It’s still on you to pick the correct enum value, but that’s much less error prone than picking a constant from the wrong implicit enumeration pool like this code did.
The direct cause of error is that they didn't RTFM but instead assumed what they need to return -- one of many Win32 success codes (STATUS_SUCCESS, ERROR_SUCCESS, S_OK, ...).
This is literally a case where you have two valid return values and using any enum for that let alone strongly typed one is a total overkill which adds pointless maintenance and cognitive overhead for a case where BOOL would work just fine and make it more obvious what to return just from declaration even without reading the documentation.
I am all for strongly typed enums, but in places where they make sense.
The callbacks are provided by the same component, and if it crashes, we want the process to crash rather than pretend everything is fine.
Apart from that, I have an issue wth the RunOnce code – there is no cleanup. What if this code gets later moved to a .dll, and the .dll gets unloaded – you’ll unload data section that has the critical section still active.
All in all, people should use more AppVerif to catch this.
Do we need to init the SRWLOCK? or is zero-init fine
Zero init is fine. SRWLOCK_INIT is defined as zero init.
In winnt.h, we have:
In synchapi.h, we have:
This is explicitly showing that the lock is being properly initialised in code though.
IMHO, we should add SRWLOCK_INIT here
Agreed. Retroactively added.
Another victory for strongly typed enums. I shiver at the sight of raw primitive return types with assumed limited meanings.
Even better would be to have the return type be BOOL, to clearly indicate a boolean return value is expected. Microsoft dropped the ball on this API by not using the correct return type.