{"id":100585,"date":"2018-12-28T07:00:00","date_gmt":"2018-12-28T22:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/?p=100585"},"modified":"2019-03-13T00:11:39","modified_gmt":"2019-03-13T07:11:39","slug":"20181228-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20181228-00\/?p=100585","title":{"rendered":"The case of the orphaned critical section despite being managed by an RAII type"},"content":{"rendered":"<p>Some time ago, I was enlisted to help debug an elusive deadlock. Studying a sampling of process memory dumps led to the conclusion that a critical section had been orphaned. Sometimes, the thread that owned the critical section had already exited, but sometimes the thread was still running, but it was running code completely unrelated to the critical section. It was as if the code that acquired the critical section had simply forgotten to release it before returning. <\/p>\n<p>The thing is, all attempts to acquire the critical section were managed by an RAII type, so there should be no way that the critical section could have been forgotten. And yet it was. <\/p>\n<p>When would the destructor for an RAII object by bypassed? One possibility is that somebody did an <code>Exit&shy;Thread<\/code> or (horrors) <code>Terminate&shy;Thread<\/code>. But this doesn&#8217;t match the evidence, because as noted above, in some of the crash dumps, the critical section owner is still alive and running, but unaware that it owns the critical section. <\/p>\n<p>On all platforms other than x86, exception unwind information is kept in tables in a rarely-used portion of the image, so that we don&#8217;t waste memory on exception unwind information until an exception actually occurs: When an exception occurs, the system pages in the unwind tables and does a table lookup to see which unwind handler should run. But on x86, the exception unwind state is maintained manually in the code. This is a bad thing for x86 performance, but a good thing for getting inside the head of the compiler. <\/p>\n<blockquote CLASS=\"m\"><p>Bonus reading: <a HREF=\"https:\/\/github.com\/CppCon\/CppCon2018\/blob\/master\/Presentations\/unwinding_the_stack_exploring_how_cpp_exceptions_work_on_windows\/unwinding_the_stack_exploring_how_cpp_exceptions_work_on_windows__james_mcnellis__cppcon_2018.pdf\">Unwinding the Stack: Exploring How C++ Exceptions Work on Windows<\/a>.<br>&mdash; <a HREF=\"https:\/\/cppcon2018.sched.com\/event\/FnLD\/unwinding-the-stack-exploring-how-c-exceptions-work-on-windows\">James McNellis<\/a>, CppCon 2018 <\/p><\/blockquote>\n<p>The unwind checkpoint is a 32-bit value, usually stored at <code>[ebp-4]<\/code>. The compiler uses it to keep track of what needs to get unwound if an exception occurs. If the compiler can deduce that no exception can occur between two checkpoints, then it can optimize out the first checkpoint. <\/p>\n<p>There are four functions that enter the critical section in question. The code that does so looks like this:<\/p>\n<pre>\n{\n  auto guard = SystemChangeListenerCS.Lock();\n  ... some code ...\n} \/\/ guard destructor releases the lock\n<\/pre>\n<p>Finding the exact point where the guards are created is made easier with the assistance of the <code>#<\/code> debugger command, which means &#8220;Disassemble until you see this string in the disassembly.&#8221; <\/p>\n<p><pre>\n0:000&gt; #SystemChangeListenerCS SystemChangeListenerThreadProc\nSystemChangeListenerThreadProc+0x7c:\n1003319c mov     ecx,offset SystemChangeListenerCS (100b861c)\n0:000&gt;\n<\/pre>\n<p>Okay, so the debugger found a line of assembly that mentions <code>System&shy;Change&shy;Listener&shy;CS<\/code>. Let&#8217;s look to see whether there is an unwind checkpoint after the lock is taken. <\/p>\n<pre>\n0:000&gt; u 1003319c\nChangeMonitorThreadProc+0x7c:\n1003319c mov     ecx,offset contoso!SystemChangeListenerCS (100b861c)\n100331a1 push    eax\n100331a2 call    Microsoft::WRL::Wrappers::CriticalSection::Lock (1002a863)\n100331a7 mov     byte ptr [ebp-4],5\n<\/pre>\n<p>We see that immediately after acquiring the lock, the code updates <code>[ebp-4]<\/code> to remember that it needs to destruct the lock guard in case an exception occurs. <\/p>\n<p><b>Exercise<\/b>: I said that the unwind state is recorded in a 32-bit value stored at <code>[ebp-4]<\/code>, but the code here updates only a byte. Why only a byte? <\/p>\n<p>The lock is acquired again later in that same function, so we&#8217;ll search some more. If you leave off the second parameter to the <code>#<\/code> command, it continues searching where the previous search left off. <\/p>\n<pre>\n0:000&gt; #SystemChangeListenerCS\nSystemChangeListenerThreadProc+0x487:\n100335a7 mov     ecx,offset contoso!SystemChangeListenerCS (100b861c)\n0:000&gt; u 100335a7\ncontoso!SystemChangeListenerThreadProc+0x487:\n100335a7 mov     ecx,offset contoso!SystemChangeListenerCS (100b861c)\n100335ac push    eax\n100335ad call    Microsoft::WRL::Wrappers::CriticalSection::Lock (1002a863)\n100335b2 mov     byte ptr [ebp-4],0Dh\n<\/pre>\n<p>Okay, so this lock guard is also marked for unwinding. <\/p>\n<p>The next function that uses the critical section is <code>Reset&shy;Widgets<\/code>. <\/p>\n<pre>\n0:000&gt; #SystemChangeListenerCS ResetWidgets\nResetWidgets+0x133:\n10033fcc mov     ecx,offset SystemChangeListenerCS (100b861c)\n0:000&gt; u 10033fcc l4\nResetWidgets+0x133:\n10033fcc mov     ecx,offset SystemChangeListenerCS (100b861c)\n10033fd1 push    eax\n10033fd2 call    Microsoft::WRL::Wrappers::CriticalSection::Lock (1002a863)\n10033fd7 call    Microsoft::WRL::ComPtr&lt;IStream&gt;::Reset (10039932)\n10033fdc call    Microsoft::WRL::ComPtr&lt;Widget&gt;::Reset (10039142)\n10033fe1 cmp     dword ptr [ebp-4Ch],0\n10033fe5 je      ResetWidgets+0x157 (10033ff0)\n10033fe7 push    dword ptr [ebp-4Ch]\n<\/pre>\n<p>Hm, this function doesn&#8217;t create an unwind checkpoint after taking the lock. This means that the compiler believes that no exception can occur between the point the guard is created and the next thing that would require updating the unwind checkpoint (in our case, that would be the point the lock is destructed). <\/p>\n<p>We repeat this analysis with the other two functions. One of them creates an unwind checkpoint; the other doesn&#8217;t. <\/p>\n<p>Why does the compiler believe that no exceptions can occur in the guarded block? Well, inside the block it calls <a HREF=\"https:\/\/docs.microsoft.com\/en-us\/cpp\/windows\/comptr-class?view=vs-2017#reset\"><code>ComPtr::<\/code><code>Reset<\/code><\/a> twice, and it does some other stuff. The <code>Reset<\/code> method is declared like this: <\/p>\n<pre>\ntemplate&lt;typename T&gt;\nclass ComPtr {\nunsigned long Reset() { return InternalRelease(); }\nunsigned long InternalRelease() throw() { ... }\n...\n};\n<\/pre>\n<p>Observe that <a HREF=\"https:\/\/docs.microsoft.com\/en-us\/cpp\/windows\/comptr-class?view=vs-2017#internalrelease\">the <code>Internal&shy;Release<\/code> method<\/a> uses the deprecated <code>throw()<\/code> specifier, which says that the method never throws an exception. The compiler then inferred that the <code>Reset<\/code> method also never throws an exception, since it does nothing that could result in an exception. <\/p>\n<p>This code was compiled before the Microsoft C++ compiler added the <code>\/std:C++17<\/code> switch, so it uses <a HREF=\"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/20180928-00\/?p=99855\">the old rules for the <code>throw()<\/code> specifier<\/a>, which for the Microsoft C++ compiler boils down to &#8220;I&#8217;m trusting you never to throw an exception.&#8221; <\/p>\n<p>My theory is that the <code>Reset<\/code> actually did throw an exception. Since the compiler didn&#8217;t create an unwind checkpoint, the lock guard did not get unwound. The exception was caught higher up the stack, so the process didn&#8217;t crash. <\/p>\n<p>Digging into the two objects wrapped inside the <code>ComPtr<\/code> revealed that the first one was a <code>Widget&shy;Monitor<\/code> object. <\/p>\n<p><b>Exercise<\/b>: The first was really an <code>IWidget&shy;Monitor<\/code> interface, so why did it disassemble as <code>ComPtr&lt;IStream&gt;<\/code>? <\/p>\n<p>The <code>Widget&shy;Monitor<\/code>&#8216;s destructor went like this: <\/p>\n<pre>\nWidgetMonitor::~WidgetMonitor()\n{\n Uninitialize();\n}\n\nvoid WidgetMonitor::Uninitialize()\n{\n blah blah;\n ThrowIfFailed(m_monitor.Deactivate());\n blah blah;\n ThrowIfFailed(m_monitor.Disconnect());\n blah blah;\n}\n<\/pre>\n<p>Now you see the problem. If the <code>Uninitialize<\/code> method throws an exception, the exception will propagate out of the destructor. (This code is so old that it predates C++11&#8217;s rule that destructors are <code>noexcept<\/code> by default where possible.) And then it will propagate out of <code>ComPtr::<\/code><code>Internal&shy;Release<\/code>, and then out of <code>ComPtr::<\/code><code>Reset<\/code>, and then out of <code>Reset&shy;Widgets<\/code>. And unwinding out of <code>Reset&shy;Widgets<\/code> will not run the lock guard&#8217;s destructor because the compiler assumed that no exception could be thrown, thanks to the <code>throw()<\/code> specifier on the <code>ComPtr::<\/code><code>Internal&shy;Release<\/code> method. <\/p>\n<p>As is often the case, it&#8217;s usually a lot easier to find something once you know what you&#8217;re looking for. The team dug into its telemetry to see that, yes indeed, the systems that encountered the problem had also thrown an exception from <code>Widget&shy;Monitor::<\/code><code>Uninitialize<\/code>, thus confirming the theory. <\/p>\n<p>Now they could work on fixing the problem: Fix the destructor so it doesn&#8217;t throw any exceptions. In this specific case, the exception was thrown because they were deactivating an object that hadn&#8217;t been fully activated. Since <a HREF=\"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/20080107-00\/?p=23913\">cleanup functions cannot fail<\/a>, the best you can do is to <a HREF=\"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/20140807-00\/?p=313\">just soldier on and clean up as much as you can<\/a>. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Rolling up the debugging sleeves.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-100585","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Rolling up the debugging sleeves.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/100585","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=100585"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/100585\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=100585"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=100585"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=100585"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}