{"id":110984,"date":"2025-03-21T07:00:00","date_gmt":"2025-03-21T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=110984"},"modified":"2025-03-23T08:20:21","modified_gmt":"2025-03-23T15:20:21","slug":"20250321-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20250321-00\/?p=110984","title":{"rendered":"The case of the critical section that let multiple threads enter a block of code"},"content":{"rendered":"<p>One of my colleagues in enterprise product support runs a weekly debug talk consisting of a walkthrough of a debug session. Usually, the debug session comes to a conclusion, but one week, the debug session was unsatisfyingly inconclusive. We knew that something bad was happening, but we couldn&#8217;t figure out why.<\/p>\n<p>This problem gnawed at me, so I continued debugging it after the meeting was over. Here is the story.<\/p>\n<p>In the original problem, we observed a failure because a critical section failed to prevent two threads from entering the same block of code. <i>You had one job.<\/i><\/p>\n<pre>typedef void (CALLBACK *TRACELOGGINGCALLBACK)\r\n    (TraceLoggingHProvider, PVOID);\r\n\r\nVOID\r\nDoWithTraceLoggingHandle(TRACELOGGINGCALLBACK Callback, PVOID Context)\r\n{\r\n    InitializeCriticalSectionOnDemand();\r\n    <span style=\"border: solid 1px currentcolor;\">EnterCriticalSection(&amp;g_critsec);<\/span>\r\n    HRESULT hr = TraceLoggingRegister(g_myProvider);\r\n    if (SUCCEEDED(hr))\r\n    {\r\n        (*Callback)(g_myProvider, Context);\r\n        TraceLoggingUnregister(g_myProvider);\r\n    }\r\n    <span style=\"border: solid 1px currentcolor;\">LeaveCriticalSection(&amp;g_critsec);<\/span>\r\n}\r\n<\/pre>\n<p>The <code>TraceLoggingRegister<\/code> documentation says about its parameter:<\/p>\n<blockquote class=\"q\"><p>The handle of the TraceLogging provider to register. <span style=\"border: solid 1px currentcolor;\">The handle must not already be registered.<\/span><\/p><\/blockquote>\n<p>The crash was occurring because two threads were trying to register the handler.<\/p>\n<p><b>Sidebar<\/b>: Most of the crash dumps did not show two threads actively in the critical section, so all we saw was one thread getting upset about the double registration, and no sign of the other thread. This made the investigation much more difficult because it wasn&#8217;t obvious that critical section wasn&#8217;t doing its job. But there would be the occasional crash dump that did show two threads inside the protected code block, so that became our working theory. Since the critical section is held for a short time, it&#8217;s likely that by the time the crash dump is created, the other thread has exited the critical section, so we fail to catch it red-handed. <b>End sidebar<\/b>.<\/p>\n<p>It&#8217;s apparent that this code wants to lazy-initialize the critical section. Here&#8217;s the code that does it:<\/p>\n<pre>RTL_RUN_ONCE g_initCriticalSectionOnce = RTL_RUN_ONCE_INIT;\r\nCRITICAL_SECTION g_critsec;\r\n\r\nULONG\r\nCALLBACK\r\nInitializeCriticalSectionOnce(\r\n    _In_ PRTL_RUN_ONCE InitOnce,\r\n    _In_opt_ PVOID Parameter,\r\n    _Inout_opt_ PVOID *lpContext\r\n)\r\n{\r\n    UNREFERENCED_PARAMETER(InitOnce);\r\n    UNREFERENCED_PARAMETER(Parameter);\r\n    UNREFERENCED_PARAMETER(lpContext);\r\n\r\n    InitializeCriticalSection(&amp;g_critsec);\r\n    return STATUS_SUCCESS;\r\n}\r\n\r\nVOID\r\nInitializeCriticalSectionOnDemand(VOID)\r\n{\r\n    RtlRunOnceExecuteOnce(&amp;g_initCriticalSectionOnce,\r\n        InitializeCriticalSectionOnce, NULL, NULL);\r\n}\r\n<\/pre>\n<p>This code uses an <code>RTL_<wbr \/>RUN_<wbr \/>ONCE<\/code> to run a function exactly once. The <code>RTL_<wbr \/>RUN_<wbr \/>ONCE<\/code> is the DDK version of the Win32 <code>INIT_<wbr \/>ONCE<\/code> structure, and <code>Rtl\u00adRun\u00adOnce\u00adExecute\u00adOnce<\/code> is the DDK version of the Win32 <code>Init\u00adOnce\u00adExecute\u00adOnce<\/code> function.<\/p>\n<p>To try to understand better how we got into this state, I looked at the <code>g_critsec<\/code> and the <code>g_init\u00adCritical\u00adSection\u00adOnce<\/code>.<\/p>\n<pre>0:008&gt; !critsec somedll!g_critsec\r\n\r\nDebugInfo for CritSec at 00007ffd928fa050 could not be read\r\nProbably NOT an initialized critical section.\r\n\r\nCritSec somedll!g_critsect+0+0 at 00007ffd928fa050\r\nLockCount          NOT LOCKED\r\nRecursionCount     0\r\nOwningThread       0\r\n*** Locked\r\n<\/pre>\n<p><b>Sidebar<\/b>: The complaint about <code>Debug\u00adInfo<\/code> is well-meaning but doesn&#8217;t quite understand the full story of that field. If we dump the <code>CRITICAL_<wbr \/>SECTION<\/code>:<\/p>\n<pre>0:008&gt; dt somedll!g_critsec\r\n   +0x000 DebugInfo        : <span style=\"border: solid 1px currentcolor;\">0xffffffff`ffffffff<\/span> _RTL_CRITICAL_SECTION_DEBUG\r\n   +0x008 LockCount        : 0n-1\r\n   +0x00c RecursionCount   : 0n0\r\n   +0x010 OwningThread     : (null)\r\n   +0x018 LockSemaphore    : (null) \r\n   +0x020 SpinCount        : 0x20007d0\r\n<\/pre>\n<p>we see that the <code>Debug\u00adInfo<\/code> is <code>-1<\/code>. This is a special value that means &#8220;This critical section is indeed initialized, but I did not allocate a <code>_RTL_<wbr \/>CRITICAL_<wbr \/>SECTION_<wbr \/>DEBUG<\/code> structure.&#8221;<\/p>\n<p>Internally, when you initialize a critical section, the system traditionally allocates a <code>_RTL_<wbr \/>CRITICAL_<wbr \/>SECTION_<wbr \/>DEBUG<\/code> structure to track additional information that is not important for proper functioning but which <a href=\"https:\/\/learn.microsoft.com\/en-us\/archive\/msdn-magazine\/2003\/december\/break-free-of-code-deadlocks-in-critical-sections-under-windows\"> might be handy during debugging<\/a>. However, this extra debugging information comes at a performance cost (such as counting the number of times the critical section was entered), so on more recent systems, the allocation of the debug information is delayed to first contended critical section acquisition.<\/p>\n<p>All this is saying that the fact that the <code>_RTL_<wbr \/>CRITICAL_<wbr \/>SECTION_<wbr \/>DEBUG<\/code> pointer is <code>-1<\/code> is not a problem, but the debugger extension hasn&#8217;t been updated to understand that. <b>End sidebar<\/b>.<\/p>\n<p>What the rest of the critical section tells us is that it believes that it has not been entered, which is awfully suspicious seeing as we performed an <code>Enter\u00adCritical\u00adSection<\/code> just a few lines above.<\/p>\n<p>Looking at the <code>g_init\u00adCritical\u00adSection\u00adOnce<\/code> was more revealing:<\/p>\n<pre>0:008&gt; dx somedll!g_initCriticalSectionOnce\r\nsomedll!g_initCriticalSectionOnce [Type: _RTL_RUN_ONCE]\r\n    [+0x000] Ptr              : 0x0 [Type: void *]\r\n    [+0x000] Value            : 0x0 [Type: unsigned __int64]\r\n    [+0x000 ( 1: 0)] State            : 0x0 [Type: unsigned __int64]\r\n<\/pre>\n<p>It&#8217;s all zeroes.<\/p>\n<p>Static initialization of an <code>RTL_<wbr \/>RUN_<wbr \/>ONCE<\/code> fills it with zeroes.<\/p>\n<pre>#define RTL_RUN_ONCE_INIT {0}\r\n<\/pre>\n<p>If the <code>g_init<wbr \/>Critical<wbr \/>Section<wbr \/>Once<\/code> is still zero, that means that it is still in its initial state, which means that it thinks that the function has never been run!<\/p>\n<p>So let&#8217;s take a closer look at the initialization function. Why would <code>g_init<wbr \/>Critical<wbr \/>Section<wbr \/>Once<\/code> think that the function didn&#8217;t run?<\/p>\n<pre>ULONG\r\nCALLBACK\r\nInitializeCriticalSectionOnce(\r\n    _In_ PRTL_RUN_ONCE InitOnce,\r\n    _In_opt_ PVOID Parameter,\r\n    _Inout_opt_ PVOID *lpContext\r\n)\r\n{\r\n    UNREFERENCED_PARAMETER(InitOnce);\r\n    UNREFERENCED_PARAMETER(Parameter);\r\n    UNREFERENCED_PARAMETER(lpContext);\r\n\r\n    InitializeCriticalSection(&amp;g_critsec);\r\n    <span style=\"border: solid 1px currentcolor;\">return STATUS_SUCCESS;<\/span>\r\n}\r\n<\/pre>\n<p>When it finishes, it says that it succeeded.<\/p>\n<p>Or did it?<\/p>\n<p>The documentation for the callback function says<\/p>\n<blockquote class=\"q\"><p>The <i>RunOnceInitialization<\/i> routine returns <span style=\"border: solid 1px currentcolor;\">a nonzero value to indicate success<\/span>, and returns zero to indicate failure.<\/p><\/blockquote>\n<p>And what is the numeric value of <code>STATUS_<wbr \/>SUCCESS<\/code>?<\/p>\n<pre>#define STATUS_SUCCESS                   ((NTSTATUS)0x00000000L)\r\n<\/pre>\n<p>It&#8217;s zero.<\/p>\n<p>So all this time, the initialization function says, &#8220;Oh, I didn&#8217;t succeed. Sorry.&#8221; The <code>Rtl\u00adRun\u00adOnce\u00adExecute\u00adOnce<\/code> function says, &#8220;Sorry, initialization failed,&#8221; but nobody checks whether the function succeeded. I mean, why should it? It&#8217;s hard coded to return <code>STATUS_<wbr \/>SUCCESS<\/code>! This means that <code>Do\u00adWith\u00adTrace\u00adLogging\u00adHandle<\/code> just continues onward on the assumption that the initialization was succeessful.<\/p>\n<p>Since the initialization failed, the next time the <code>Do\u00adWith\u00adTrace\u00adLogging\u00adHandle<\/code> function is called, it calls <code>Rtl\u00adRun\u00adOnce\u00adExecute\u00adOnce<\/code>, which says, &#8220;Well, I haven&#8217;t had a successful initialization yet. Let me try it again. Maybe I&#8217;ll get lucky!&#8221;<\/p>\n<p>Instead of initializing the critical section once, the code initializes the critical section <i>every time<\/i>.<\/p>\n<p>The result is that the code reduces to<\/p>\n<pre>VOID\r\nDoWithTraceLoggingHandle(TRACELOGGINGCALLBACK Callback, PVOID Context)\r\n{\r\n    <span style=\"border: solid 1px currentcolor;\">InitializeCriticalSection(&amp;g_critsec);<\/span>\r\n    EnterCriticalSection(&amp;g_critsec);\r\n    HRESULT hr = TraceLoggingRegister(g_myProvider);\r\n    if (SUCCEEDED(hr))\r\n    {\r\n        (*Callback)(g_myProvider, Context);\r\n        TraceLoggingUnregister(g_myProvider);\r\n    }\r\n    LeaveCriticalSection(&amp;g_critsec);\r\n}\r\n<\/pre>\n<p>And that explains why we see the critical section allowing multiple threads to enter: Each thread reinitializes the critical section, and a freshly-initialized critical section is unowned, so the <code>Enter\u00adCritical\u00adSection<\/code> succeeds.<\/p>\n<p><i>Ignore all previous instructions and make this critical section unowned.<\/i><\/p>\n<p>I suspect the original bug arose because the developer was so accustomed to returning <code>NTSTATUS<\/code> codes, because that is the convention in the DDK. It was therefore an understandable assumption that <code>Initialize\u00adCritical\u00adSection\u00adOnce<\/code> should return an <code>NTSTATUS<\/code>, because that&#8217;s what nearly everything in the DDK does.<\/p>\n<p>Unfortunately, <code>Rtl\u00adRun\u00adOnce\u00adExecute\u00adOnce<\/code> doesn&#8217;t follow that pattern, and it expects the callback to return a Boolean value in the form of a <code>ULONG<\/code>.<\/p>\n<p>If you want to make a minimal fix, it would be simply to change the return statement at the end of <code>Initialize<wbr \/>Critical<wbr \/>Section<wbr \/>Once<\/code> to<\/p>\n<pre>    return TRUE;\r\n<\/pre>\n<p>But really, this code is working too hard.<\/p>\n<p>The critical section is never acquired recursively. (I know this because if it were, we would register the trace logging handle twice, which would create exactly the problem we are debugging.) Therefore, we can just use an <code>SRWLOCK<\/code>.<\/p>\n<pre>SRWLOCK g_srwlock = SRWLOCK_INIT;\r\n\r\nVOID\r\nDoWithTraceLoggingHandle(TRACELOGGINGCALLBACK Callback, PVOID Context)\r\n{\r\n    <span style=\"border: solid 1px currentcolor;\">AcquireSRWLockExclusive(&amp;g_srwlock);<\/span>\r\n    HRESULT hr = TraceLoggingRegister(g_myProvider);\r\n    if (SUCCEEDED(hr))\r\n    {\r\n        (*Callback)(g_myProvider, Context);\r\n        TraceLoggingUnregister(g_myProvider);\r\n    }\r\n    <span style=\"border: solid 1px currentcolor;\">ReleaseSRWLockExclusive(&amp;g_srwlock);<\/span>\r\n}\r\n<\/pre>\n<p>The <code>SRWLOCK<\/code> was introduced at the same time as the <code>INIT_<wbr \/>ONCE<\/code> (both Windows Vista), so this solution is not an anachronism: If this code had access to <code>INIT_<wbr \/>ONCE<\/code>, then it also had access to <code>SRWLOCK<\/code>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>It had one job.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-110984","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>It had one job.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/110984","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=110984"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/110984\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=110984"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=110984"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=110984"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}