{"id":106700,"date":"2022-06-01T07:00:00","date_gmt":"2022-06-01T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=106700"},"modified":"2022-06-01T10:01:33","modified_gmt":"2022-06-01T17:01:33","slug":"20220601-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20220601-00\/?p=106700","title":{"rendered":"The case of the COM reference that suddenly went bad in the middle of a coroutine"},"content":{"rendered":"<p>A customer reported a crash in a coroutine and needed help understanding it.<\/p>\n<p>Here&#8217;s the crash:<\/p>\n<pre style=\"white-space: pre-wrap;\">contoso!winrt::impl::consume_<wbr \/>Contoso_<wbr \/>IWidgetOptions&lt;<wbr \/>winrt::<wbr \/>Contoso::<wbr \/>IWidgetOptions&gt;::<wbr \/>Name+0x15:\r\n00007ff8`fa6c77e1 mov     rax,qword ptr [rcx] ds:00000000`00000000=????????????????\r\n<\/pre>\n<p>And the stack trace:<\/p>\n<pre style=\"white-space: pre-wrap;\">contoso!winrt::<wbr \/>impl::<wbr \/>consume_<wbr \/>Contoso_<wbr \/>IWidgetOptions&lt;<wbr \/>winrt::<wbr \/>Contoso::<wbr \/>IWidgetOptions&gt;::<wbr \/>Name+0x15\r\ncontoso!winrt::Contoso::<wbr \/>implementation::<wbr \/>Widget::<wbr \/>CreateAsync$_ResumeCoro$1+0xfe\r\ncontoso!winrt::<wbr \/>Contoso::<wbr \/>implementation::<wbr \/>Widget::<wbr \/>CreateAsync$_InitCoro$2+0x95\r\ncontoso!winrt::<wbr \/>Contoso::<wbr \/>implementation::<wbr \/>Widget::<wbr \/>CreateAsync+0x63\r\ncontoso!winrt::<wbr \/>Contoso::<wbr \/>implementation::<wbr \/>Gadget::<wbr \/>CreateWidgetAsync+0x7e\r\ncontoso!winrt::<wbr \/>impl::<wbr \/>produce&lt;winrt::<wbr \/>Contoso::<wbr \/>implementation::<wbr \/>Gadget,<wbr \/>winrt::<wbr \/>Contoso::<wbr \/>IGadget&gt;::<wbr \/>CreateWidgetAsync+0x35\r\nlitware!winrt::<wbr \/>impl::<wbr \/>consume_Contoso_Gadget&lt;winrt::<wbr \/>Contoso::<wbr \/>Gadget&gt;::<wbr \/>CreateWidgetAsync+0x47\r\nlitware!winrt::<wbr \/>LitWare::<wbr \/>implementation::<wbr \/>GadgetViewer::<wbr \/>CreateWidgetsAsync$_ResumeCoro$2+0x343\r\nlitware!std::<wbr \/>experimental::<wbr \/>coroutine_handle&lt;void&gt;::<wbr \/>resume+0xc\r\nlitware!std::<wbr \/>experimental::<wbr \/>coroutine_handle&lt;void&gt;::<wbr \/>operator()+0xc\r\nlitware!winrt::<wbr \/>impl::<wbr \/>resume_background_callback+0x10\r\nntdll!TppSimplepExecuteCallback+0xa3\r\nntdll!TppWorkerThread+0x686\r\nkernel32!BaseThreadInitThunk+0x10\r\nntdll!RtlUserThreadStart+0x2b\r\n<\/pre>\n<p>This is crashing inside the C++\/WinRT projection, which you can infer because we crashed inside a <code>consume_<\/code> function. The <code>consume_<\/code> functions are provided by the C++\/WinRT library to convert your C++\/WinRT calls into low-level ABI calls. So something happened inside that projection.<\/p>\n<p>Here&#8217;s the consume function up to the point of the crash:<\/p>\n<pre style=\"white-space: pre-wrap;\">    push    rbx\r\n    sub     rsp,20h\r\n    mov     rcx,qword ptr [rcx]   ; get the raw pointer from the IWidgetOptions\r\n    and     qword ptr [rsp+30h],0 ; pre-null the output parameter\r\n    mov     rbx,rdx\r\n    mov     rax,qword ptr [rcx]   ; Load the vtable\r\n<\/pre>\n<p>The problem it seems is that the <code>this<\/code> parameter to <code>IWidgetOptions::<wbr \/>Name()<\/code> is a COM wrapper around a null pointer.<\/p>\n<p>We are called from <code>_InitCoro<\/code>, which runs the initial synchronous portion, so we are still in the synchronous portion of the coroutine. That&#8217;s nice, because it means that the caller is still on the stack:<\/p>\n<pre>IAsyncOperation&lt;Widget&gt; Widget::CreateAsync(const WidgetOptions&amp; options)\r\n{\r\n    auto name = options.Name();\r\n<\/pre>\n<p>Let&#8217;s see what we got as the <code>options<\/code>:<\/p>\n<pre style=\"white-space: pre-wrap;\">0:018&gt; .frame 3\r\n03 000000bb`dccff630 00007ff8`fa6c7b32     contoso!Widget::CreateAsync+0x63\r\n0:018&gt; dv\r\n     options = 0x000000bb`dccff978\r\n0:018&gt; ?? options\r\nstruct winrt::Contoso::IWidgetOptions * 0x000000bb`dccff978\r\n   +0x000 m_ptr            : (null)\r\n<\/pre>\n<p>Yes indeed, the <code>options<\/code> is null. That&#8217;s why we crash trying to call a method on it.<\/p>\n<p>The caller of <code>Widget::<wbr \/>Create\u00adAsync<\/code> is <code>Gadget::<wbr \/>Create\u00adWidget\u00adAsync<\/code>:<\/p>\n<pre>IAsyncOperation&lt;Widget&gt; Gadget::CreateWidgetAsync()\r\n{\r\n    return Widget::CreateAsync(m_options);\r\n}\r\n<\/pre>\n<p>The customer suspected that this function was incorrectly implemented. Shouldn&#8217;t it be this?<\/p>\n<pre>IAsyncOperation&lt;Widget&gt; Gadget::CreateWidgetAsync()\r\n{\r\n    co_return co_await Widget::CreateAsync(m_options);\r\n}\r\n<\/pre>\n<p>&#8220;The immediate return might be losing some necessary coroutine frame lifetime, so that when <code>GadgetViewer::<wbr \/>Create\u00adWidgets\u00adAsync<\/code> completes, it&#8217;s operating on already-freed memory. However, I&#8217;m not confident in this analysis.&#8221;<\/p>\n<p>The function is fine. While it&#8217;s true that the common way of producing a <code>IAsync\u00adOperation&lt;Widget&gt;<\/code> is to autogenerate one from a coroutine, it is also perfectly legal to just create one by other means and just return it.<\/p>\n<p>Let&#8217;s try to walk back up the stack to find out what the <code>m_options<\/code> were that we <i>thought<\/i> we were passing in.<\/p>\n<pre style=\"white-space: pre-wrap;\">0:018&gt; .frame 4\r\n04 000000bb`dccff930 00007ff8`fa6c7a55 contoso!winrt::<wbr \/>Contoso::<wbr \/>implementation::<wbr \/>Gadget::<wbr \/>CreateWidgetAsync+0x7e\r\n0:018&gt; dv\r\n           this = &lt;value unavailable&gt;\r\n<\/pre>\n<p>Rats, the <code>this<\/code> pointer got optimized out. Keep going.<\/p>\n<pre style=\"white-space: pre-wrap;\">0:018&gt; .frame 5\r\n05 000000bb`dccff970 00007ff8`be19b2ff contoso!winrt::<wbr \/>impl::<wbr \/>produce&lt;winrt::<wbr \/>Contoso::<wbr \/>implementation::<wbr \/>Gadget,<wbr \/>winrt::<wbr \/>Contoso::<wbr \/>IGadget&gt;::<wbr \/>CreateWidgetAsync+0x35\r\n0:018&gt; dv\r\n           this = &lt;value unavailable&gt;\r\n      operation = 0x000000bb`dccff9d0\r\n\r\n0:018&gt; .frame 6\r\n06 000000bb`dccff9a0 00007ff8`be109963     litware!winrt::<wbr \/>impl::<wbr \/>consume_Contoso_Gadget&lt;winrt::<wbr \/>Contoso::<wbr \/>Gadget&gt;::<wbr \/>CreateWidgetAsync+0x47\r\n0:018&gt; dv\r\n           this = &lt;value unavailable&gt;\r\n      operation = 0x00000000`00000000\r\n\r\n0:018&gt; .frame 7\r\n07 000000bb`dccff9f0 00007ff8`be0d4b40     litware!winrt::<wbr \/>LitWare::<wbr \/>implementation::<wbr \/>GadgetViewer::<wbr \/>CreateWidgetsAsync$_ResumeCoro$2+0x343\r\n0:018&gt; dv\r\n      &lt;coro_frame_ptr&gt; = 0x00000276`15223b00\r\n            strongThis = struct winrt::<wbr \/>com_ptr&lt;winrt::<wbr \/>LitWare::<wbr \/>implementation::<wbr \/>GadgetViewer&gt;\r\n<\/pre>\n<p>We had to chase all the way back to <code>GadgetViewer::<wbr \/>Create\u00adWidgets\u00adAsync<\/code> before we got a foothold into the thing that will lead us to the <code>options<\/code>. Now we can start working our way back in.<\/p>\n<p>The <code>GadgetViewer::<wbr \/>Create\u00adWidgets\u00adAsync<\/code> coroutine looks like this:<\/p>\n<pre>IAsyncOperation&lt;IVectorView&lt;MenuItem&gt;&gt; GadgetViewer::CreateWidgetsAsync()\r\n{\r\n    ...\r\n\r\n    const auto strongThis{ get_strong() };\r\n\r\n    co_await winrt::resume_background();\r\n\r\n    std::vector&lt;winrt::Contoso::Widget&gt; widgets;\r\n\r\n    if (const auto widget1 = co_await m_gadget.CreateWidgetAsync())\r\n    {\r\n        ...\r\n<\/pre>\n<p>Aha, so the outbound call to <code>Create\u00adWidget\u00adAsync<\/code> is made on the <code>GadgetViewer<\/code>&#8216;s <code>m_gadget<\/code>. Follow the call:<\/p>\n<pre style=\"white-space: pre-wrap;\">0:018&gt; ?? strongThis\r\nstruct winrt::<wbr \/>com_ptr&lt;winrt::<wbr \/>Contoso::<wbr \/>implementation::<wbr \/>GadgetViewer&gt;\r\n   +0x000 m_ptr            : 0x00000276`184121f0 winrt::<wbr \/>Contoso::<wbr \/>implementation::<wbr \/>GadgetViewer\r\n0:018&gt; ??((winrt::<wbr \/>Contoso::<wbr \/>implementation::<wbr \/>GadgetViewer*) 0x00000276`184121f0)-&gt;m_gadget\r\nstruct winrt::<wbr \/>Contoso::<wbr \/>Gadget\r\n   +0x000 m_ptr            : 0x00000276`0f68d4e0 winrt::<wbr \/>impl::<wbr \/>abi&lt;winrt::<wbr \/>Windows::<wbr \/>Foundation::<wbr \/>IUnknown,<wbr \/>void&gt;::<wbr \/>type\r\n<\/pre>\n<p>Okay, we&#8217;ve found the <code>m_gadget<\/code>. Now to the options.<\/p>\n<pre style=\"white-space: pre-wrap;\">0:018&gt; dt contoso!winrt::<wbr \/>Contoso::<wbr \/>implementation::<wbr \/>Gadget\r\n   +0x010 vtable           : winrt::<wbr \/>impl::<wbr \/>produce&lt;winrt::<wbr \/>Contoso::<wbr \/>implementation::<wbr \/>Gadget,<wbr \/>winrt::<wbr \/>Contoso::<wbr \/>IGadget&gt;\r\n   +0x000 __VFN_table : Ptr64\r\n   +0x008 m_references     : std::<wbr \/>atomic&lt;unsigned __int64&gt;\r\n   +0x018 m_options        : winrt::<wbr \/>Contoso::<wbr \/>WidgetOptions\r\n   +0x020 m_source         : winrt::<wbr \/>Contoso::<wbr \/>GadgetSource\r\n   +0x028 m_home           : std::<wbr \/>basic_string&lt;wchar_t,<wbr \/>std::<wbr \/>char_traits&lt;wchar_t&gt;,<wbr \/>std::<wbr \/>allocator&lt;wchar_t&gt; &gt;\r\n<\/pre>\n<p>We see that the vtable for the producer of <code>IGadget<\/code> is at offset <code>0x010<\/code>, so we subtract that amount from our <code>winrt::<wbr \/>Contoso::<wbr \/>Gadget<\/code> pointer to get a pointer to the implementation.<\/p>\n<pre style=\"white-space: pre-wrap;\">0:018&gt; ?? ((contoso!winrt::<wbr \/>Contoso::<wbr \/>implementation::<wbr \/>Gadget*)(0x00000276`0f68d4e0-0x10))\r\nstruct winrt::<wbr \/>Contoso::<wbr \/>implementation::<wbr \/>Gadget * 0x00000276`0f68d4d0\r\n   +0x010 vtable           : winrt::<wbr \/>impl::<wbr \/>produce&lt;winrt::<wbr \/>Contoso::<wbr \/>implementation::<wbr \/>Gadget,<wbr \/>winrt::<wbr \/>Contoso::<wbr \/>IGadget&gt;\r\n   +0x000 __VFN_table : 0x00007ff8`fa6df2e8\r\n   +0x008 m_references     : std::<wbr \/>atomic&lt;unsigned __int64&gt;\r\n   +0x018 m_options        : winrt::<wbr \/>Contoso::<wbr \/>WidgetOptions\r\n   +0x020 m_source         : winrt::<wbr \/>Contoso::<wbr \/>GadgetSource\r\n   +0x028 m_home           : std::<wbr \/>basic_string&lt;wchar_t,<wbr \/>std::<wbr \/>char_traits&lt;wchar_t&gt;,<wbr \/>std::<wbr \/>allocator&lt;wchar_t&gt; &gt;\r\n\r\n0:018&gt; ?? &amp;((contoso!winrt::<wbr \/>Contoso::<wbr \/>implementation::<wbr \/>Gadget*)(0x00000276`0f68d4e0-0x10))-&gt;m_options\r\nstruct winrt::<wbr \/>Contoso::<wbr \/>WidgetOptions * 0x00000276`0f68d4e8\r\n   +0x000 m_ptr            : 0x00000276`152f6270 winrt::<wbr \/>impl::<wbr \/>abi&lt;winrt::<wbr \/>Windows::<wbr \/>Foundation::<wbr \/>IUnknown,<wbr \/>void&gt;::<wbr \/>type\r\n<\/pre>\n<p>The address of this <code>m_options<\/code> is <code>0x00000276`0f68d4e8<\/code>, which is not the same as the address that was passed to <code>Widget::<wbr \/>Create\u00adAsync<\/code>:<\/p>\n<pre style=\"white-space: pre-wrap;\">0:018&gt; .frame 3\r\n03 000000bb`dccff630 00007ff8`fa6c7b32     contoso!Widget::<wbr \/>CreateAsync+0x63\r\n0:018&gt; dv\r\n     options = 0x000000bb`dccff978\r\n<\/pre>\n<p>What happened? How did the address of a variable change?<\/p>\n<p>Studying the code in the <code>Gadget\u00adViewer<\/code> that uses the <code>m_gadget<\/code> member variable, we see that the member variable is used from both the foreground thread as well as from a background thread:<\/p>\n<pre>\/\/ Property setter: Set the \"Gadget\" property of the GadgetViewer\r\nvoid GadgetViewer::Gadget(winrt::Contoso::Gadget const&amp; value)\r\n{\r\n    VerifyUIThread();\r\n\r\n    if (m_gadget != value)\r\n    {\r\n        m_gadget = value;\r\n        ...\r\n    }\r\n}\r\n<\/pre>\n<p>Recall that one of the principles of debugging somebody else&#8217;s code is to <a title=\"Psychic debugging: When reading unfamiliar code, assume it's mostly correct\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20070423-00\/?p=27163\"> assume that the code is mostly correct<\/a>. The problem is likely in a small detail, an edge case, or a peculiar combination of factors. After all, if the problem was in a common case, they probably wouldn&#8217;t have had to ask an outsider for help.<\/p>\n<p>The <code>Verify\u00adUI\u00adThread<\/code> call tells us that the expectation is that the gadget is changed only from the UI thread. But there is no synchronization to protect access to this variable from multiple threads, even though we are accessing it from a background thread:<\/p>\n<pre>IAsyncOperation&lt;IVectorView&lt;MenuItem&gt;&gt; GadgetViewer::CreateWidgetsAsync()\r\n{\r\n    ...\r\n\r\n    const auto strongThis{ get_strong() };\r\n\r\n    <span style=\"color: blue;\">co_await winrt::resume_background(); \/\/ \u2190 hop to background thread<\/span>\r\n\r\n    std::vector&lt;winrt::Contoso::Widget&gt; widgets;\r\n\r\n    if (const auto widget1 = co_await m_gadget.CreateWidgetAsync())\r\n    {\r\n        ...\r\n<\/pre>\n<p>What may have happened is that while <code>Create\u00adWidgets\u00adAsync<\/code> was using <code>m_gadget<\/code> on the background thread, the foreground thread changed the <code>m_gadget<\/code>, causing the old gadget (and its <code>m_options<\/code>) to be destructed while the background thread was still using it.<\/p>\n<p>The customer provided some history: As originally written, the <code>Gadget\u00adViewer<\/code> accessed the <code>m_gadget<\/code> only from the foreground thread, but a subsequent change moved some of the work to a background thread, and that introduced concurrency into code that was written on the assumption that there was no concurrency.<\/p>\n<p>One possible solution is for <code>Gadget\u00adViewer::<wbr \/>Create\u00adWidgets\u00adAsync<\/code> to capture the member variables it intends to use (the <code>m_gadget<\/code>, in this case, but possibly other member variables not seen here) before going to the background thread, and operating entirely on the captured variables. It means that when you call <code>Create\u00adWidgets\u00adAsync<\/code>, you get the widgets associated with the gadget that you were viewing at the point you called <code>Create\u00adWidgets\u00adAsync<\/code>, even if you changed the gadget while the <code>Create\u00adWidgets\u00adAsync<\/code> was still working.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Is the coroutine really to blame?<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-106700","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Is the coroutine really to blame?<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/106700","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=106700"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/106700\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=106700"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=106700"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=106700"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}