May 5th, 2023

The case of the crash in a C++/WinRT coroutine: Unpeeling the onion

A customer had a mysterious crash in a C++/WinRT coroutine, and they were having trouble identifying the cause. The stack of the crash looked like this:

Contoso!winrt::impl::consume_Contoso_IWidgetWatcher<winrt::Contoso::IWidgetWatcher>::NotifyChange+0x3
Contoso!winrt::Contoso::implementation::WidgetController::SaveWidget+0x16
Contoso!winrt::impl::produce<winrt::Contoso::implementation::WidgetController,winrt::Contoso::IWidgetController>::SaveWidget+0x1d
Contoso!winrt::impl::consume_Contoso_IWidgetController<winrt::Contoso::IWidgetController>::SaveWidget+0xd
Contoso!winrt::Contoso::implementation::Widget::OnSettingsChanged$_ResumeCoro$1+0x31f
Contoso!std::experimental::coroutine_handle<void>::resume+0xc
Contoso!winrt::impl::await_adapter<winrt::Windows::Foundation::IAsyncAction>::await_suspend+0x15e
Contoso!winrt::Contoso::implementation::Widget::OnSettingsChanged$_ResumeCoro$1+0x11a
Contoso!winrt::Contoso::implementation::Widget::OnSettingsChanged$_InitCoro$2+0x6a
Contoso!winrt::Contoso::implementation::Widget::OnSettingsChanged+0x5c
Contoso!winrt::Windows::Foundation::TypedEventHandler<winrt::Contoso::Widget,winrt::Windows::Foundation::IInspectable>::<lambda_a7902c7784a1de3d47473a11e43d997c>::operator()+0x5b
Contoso!winrt::impl::delegate<winrt::Windows::Foundation::TypedEventHandler<winrt::Contoso::Widget,winrt::Windows::Foundation::IInspectable>,<lambda_a7902c7784a1de3d47473a11e43d997c> >::Invoke+0x6d
rpcrt4!Invoke+0x73
rpcrt4!Ndr64StubWorker+0xb8a
rpcrt4!NdrStubCall3+0xd3
combase!CStdStubBuffer_Invoke+0x6f
combase!InvokeStubWithExceptionPolicyAndTracing::__l6::<lambda_c9f3956a20c9da92a64affc24fdd69ec>::operator()+0x22
combase!ObjectMethodExceptionHandlingAction<<lambda_c9f3956a20c9da92a64affc24fdd69ec> >+0x4d
combase!InvokeStubWithExceptionPolicyAndTracing+0xe1
combase!DefaultStubInvoke+0x268
combase!SyncServerCall::StubInvoke+0x41
combase!StubInvoke+0xf6
combase!ServerCall::ContextInvoke+0x366
combase!ComInvokeWithLockAndIPID+0x9aa
combase!ThreadInvokeReturnHresult+0x17b
combase!ThreadInvoke+0x193
rpcrt4!DispatchToStubInCNoAvrf+0x22
rpcrt4!RPC_INTERFACE::DispatchToStubWorker+0x1b4
rpcrt4!RPC_INTERFACE::DispatchToStub+0xb3
rpcrt4!RPC_INTERFACE::DispatchToStubWithObject+0x188
rpcrt4!LRPC_SBINDING::DispatchToStubWithObject+0x23
rpcrt4!LRPC_SCALL::QueueOrDispatchCall+0x253
rpcrt4!LRPC_SCALL::HandleRequest+0x996
rpcrt4!LRPC_SASSOCIATION::HandleRequest+0x2c3
rpcrt4!LRPC_ADDRESS::HandleRequest+0x17c
rpcrt4!LRPC_ADDRESS::ProcessIO+0x939
rpcrt4!LrpcIoComplete+0x109
ntdll!TppAlpcpExecuteCallback+0x157
ntdll!TppWorkerThread+0x72c
kernel32!BaseThreadInitThunk+0x1d
ntdll!RtlUserThreadStart+0x28

Contoso!winrt::impl::consume_Contoso_IWidgetWatcher<winrt::Contoso::IWidgetWatcher>::NotifyChange+0x3:
00007ffd`ab49749d mov     rax,qword ptr [rcx] ds:00000000`00000000=????????????????

This is a big, ugly stack, but we can simplify it:

Contoso!winrt::impl::consume_IWidgetWatcher::NotifyChange+0x3
Contoso!winrt::WidgetController::SaveWidget+0x16
Contoso!winrt::impl::produce<IWidgetController>::SaveWidget+0x1d
Contoso!winrt::impl::consume_IWidgetController::SaveWidget+0xd
Contoso!winrt::Widget::OnSettingsChanged$_ResumeCoro$1+0x31f
Contoso!std::experimental::coroutine_handle<void>::resume+0xc
Contoso!winrt::impl::await_adapter::await_suspend+0x15e
Contoso!winrt::Widget::OnSettingsChanged$_ResumeCoro$1+0x11a
Contoso!winrt::Widget::OnSettingsChanged$_InitCoro$2+0x6a
Contoso!winrt::Widget::OnSettingsChanged+0x5c
Contoso!winrt::TypedEventHandler::<lambda_...>::operator()+0x5b
Contoso!winrt::impl::delegate::Invoke+0x6d
(external event machinery)

Reading from the bottom up, we start with a bunch of external event machinery, which led to the invocation of an event handler, evidently the On­Settings­Changed handler.

This handler is a coroutine, and it awaited something, and then upon resumption it called the WidgetController::SaveWidget method, which called WidgetWatcher::NotifyChange, but it crashed trying to make that call because the widget watcher is null. (We know this because rcx is null, and we’re trying to read its vtable.)

The customer’s On­Settings­Changed handler looked like this:

fire_and_forget Widget::OnSettingsChanged(
    const SettingsManager&, const IInspectable&)
{
    co_await ApplySettingsAsync();
    m_widgetController.SaveWidget(*this);
}

The customer saw this code and smacked their forehead. “Of course, the problem is that I forgot to hold a strong reference to the widget across the co_await.”

winrt::fire_and_forget Widget::OnSettingsChanged(
    const winrt::SettingsManager&, const winrt::IInspectable&)
{
    auto lifetime = get_strong();
    co_await ApplySettingsAsync();
    m_widgetController.SaveWidget(*this);
}

But their fix didn’t help. The crashes kept coming.

Let’s look at that crash stack again:

Contoso!winrt::Widget::OnSettingChanged$_ResumeCoro$1+0x31f
Contoso!std::experimental::coroutine_handle<void>::resume+0xc
Contoso!winrt::impl::await_adapter::await_suspend+0x15e
Contoso!winrt::Widget::OnSettingChanged$_ResumeCoro$1+0x11a

Reading upward, the coroutine was executing, and then decided to co_await something. The coroutine machinery called await_suspend, and the await_suspend immediately resumed the coroutine, causing OnSettingChanged$_ResumeCoro$1 to be re-entered.

So the coroutine never really suspended. Adding a strong reference wouldn’t have helped here, since the strong reference’s job is to keep the object alive across a suspension, which hasn’t happened. (I mean, the strong reference is still required in the case where the co_await does suspend. I’m just saying that this particular crash didn’t involve a suspension.)

Another way to observe that there was no suspension is that the stack trace leads all the way back to the external event machinery. We are still in the synchronous portion of the event handler. Nothing has suspended yet.

I looked at how the event was registered:

struct Widget : WidgetT<Widget>
{
    // other constructor parameters elided for expository purposes
    Widget::Widget(const WidgetController& controller)
        : m_weakController{ controller }
    {
        m_SettingChangedRevoker =
             m_settingsWatcher.SettingChanged(winrt::auto_revoke,
                { this, &Widget::OnSettingChanged });
    }

private:
    fire_and_forget OnSettingChanged(
        const SettingsManager&, const IInspectable&);

    const weak_ref<WidgetController> m_weakWidgetController;
    const SettingsWatcher m_settingsWatcher;
    SettingsWatcher::SettingChanged_revoker m_SettingChangedRevoker;

    // other members elided for expository purposes
};

Notice that the event handler is registered with a raw this pointer.

Registering with a raw this pointer means that you are taking full responsibility for ensuring that the object is alive at the time the event is received. This is manageable for events that are raised synchronously (such as UI events), since you can make sure to unregister the event handler from the thread that will raise the event, avoid the race condition where you unregister the event handler while the handler is running (or is irrevocably committed to running).

Thread 1 Thread 2
event triggered  
handler()  
  unregister handler
handler still running!  

We can see from the stack that the Settings­Watcher object raises events from a background thread, so a raw this capture is not safe. If the event is unregistered while a callback is in progress (or is irrevocably committed to running), the handler will have the object destructed out from under it (or possibly even have it destructed before it can execute a single instruction).

In this case, the registration of the handler should be done with a weak reference.

        m_SettingChangedRevoker =
             m_settingsWatcher.SettingChanged(winrt::auto_revoke,
                { get_weak(), &Widget::OnSettingChanged });

The C++/WinRT weak reference event handler pattern goes like this:

auto strong = weak.get():
if (strong) { (strong->*handler)(); }

C++/WinRT tries to upgrade the weak reference to a strong reference, and if successful, it calls the event handler with the strong reference active. When the handler returns, the strong reference is released.¹

This particular race condition is between the event handler and the revocation of the event. This particular object doesn’t revoke the event until it destructs, and the way the program uses the Widget, we don’t expect it to be destroyed until the program shuts down, and there’s no evidence that the program is shutting down.

The “object got destructed before your handler run” scenario also doesn’t seem to match the crash dump:

0:019> ?? ((winrt::Contoso::implementation::WidgetController*) 0x00000262`1b9ce810)
struct winrt::Contoso::implementation::WidgetController * 0x00000262`1b9ce810
   +0x010 vtable           : ...
   +0x018 vtable           : ...
   +0x000 __VFN_table : 0x00007ffd`ab59d940
   +0x008 m_references     : std::atomic<unsigned __int64>
    ...
   +0x0a0 m_widgetWatcher : winrt::Contoso::WidgetWatcher

0:019> ?? ((winrt::Contoso::implementation::WidgetController*) 0x00000262`1b9ce810)->m_references._Storage
struct std::_Atomic_padded<unsigned __int64>
   +0x000 _Value           : 0x80000131`0e28d450

0:019> dps 0x80000131`0e28d450*2
00000262`1c51a8a0  00007ffd`ab5998d8 Contoso!winrt::impl::weak_ref<1,1>::`vftable'
00000262`1c51a8a8  00007ffd`ab5998b8 Contoso!winrt::impl::weak_source<1,1>::`vftable'
00000262`1c51a8b0  00000262`1b9ce820 // m_object - pointer to original object (matches)
00000262`1c51a8b8  0000002e`00000006 // weak + strong references

The weak reference control block seems to be valid, and it shows a reasonable non-zero number of strong references, so it doesn’t seem that we’re in the “Object destructed while handler is running” scenario.

So we found another bug, but not the bug that caused this crash.

The crash is at the call to m_widgetWatcher.SaveWidget(), so let’s look at that m_widgetWatcher.

0:019> ?? ((winrt::Contoso::implementation::WidgetController*) 0x00000262`1b9ce810)->m_widgetWatcher
struct winrt::Contoso::WidgetWatcher
   +0x000 m_ptr            : 0x00000262`1b86ec60
0:019> dps 0x00000262`1b86ec60 L2
00000262`1b86ec60  00007ffd`ab5a1728 Contoso!winrt::impl::produce<WidgetWatcher,IWidgetWatcher>::`vftable'
00000262`1b86ec68  00000000`00000000

Wait a second, the crash dump says that we crashed because the m_widgetWatcher is null, but in the dump, the m_widgetWatcher is non-null. This suggests to me that we encountered a race condition, where the m_widgetWatcher was null at the time of the crash, but in the time it took to create the crash dump, the value was updated.

This helped focus the next step of the investigation: Let’s look at the WidgetController and how it initializes the WidgetWatcher.

struct WidgetController : WidgetControllerT<WidgetController>
{
    // constructor parameters elided for expository purposes
    WidgetController(const WidgetWatcher& watcher) :
        m_widget{ make<Widget>(*this) },
        m_widgetWatcher{ watcher }
    {
    }

    void SaveWidget(const Widget&);

private:
    const Widget m_widget;
    const WidgetWatcher m_widgetWatcher;

    // other members elided for expository purposes
};

Now the story comes into focus.

The non-static data members of C++ objects are constructed in order of declaration in the class.² In this case, we construct the Widget before we copy the WidgetWatcher.

If a setting change occurs immediately after the Widget is constructed, then we have a race between the event callback thread (which will call back into the Widget­Controller) and the construction of the Widget­Controller‘s m_widgetWatcher. If the event callback thread wins the race, then it will see a not-yet-constructed m_widgetWatcher and crash.

That is the race condition that we hit. The event callback thread tried to use a not-yet-constructed object.

Therefore, the final step of the fix is to force the m_widget­Watcher to construct first:

struct WidgetController : WidgetControllerT<WidgetController>
{
    // constructor parameters elided for expository purposes
    WidgetController(const WidgetWatcher& watcher) :
        m_widget{ make<Widget>(*this) },
        m_widgetWatcher{ watcher }
    {
    }

    void SaveWidget(const Widget&);

private:
    // Order of declaration is important:
    // WidgetWatcher must construct before the Widget,
    // because the Widget may call into the WidgetWatcher.
    const WidgetWatcher m_widgetWatcher;
    const Widget m_widget;

    // other members elided for expository purposes
};

This was a rather long investigation, with two red herrings! I don’t know whether you enjoyed it, but you can’t get that time back now.

¹ Note that if the handler is a coroutine, the handler “returns” at the point the coroutine first suspends, not when the coroutine runs to completion. That’s why we need the get_strong() inside the coroutine body: To keep the object alive through to completion.

² Note that the order in which the initializers are given in the constructor are irrelevant. It is the order of declaration in the class that controls the order of construction.

struct Example
{
    Example() : b{ 1 }, a{ 2 } {}

    Thing1 a;
    Thing2 b;
};

In this example class, the a member is constructed first, and then the b member is constructed second. You might think that b is constructed first because it is listed first in the constructor, but that’s not the case.

If construction occurred in order of initialization, and two constructors initialized the members in different orders, then the destructor would be unable to follow the rule that objects are destructed in the same order of construction, because the destructor doesn’t know which constructor was used.³

³ I guess you could solve this problem by adding a hidden member variable that keeps track of which order the members were constructed, but this adds runtime costs and would likely be surprising to programmers that it’s possible that adding a constructor to a class, even one that is never called, can change its size.

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

3 comments

Discussion is closed. Login to edit/delete existing comments.

  • Alan Thomas

    It shouldn’t be this easy to screw up what looks like bog-standard event handler logic. The fact that Raymond’s investigation uncovered 3 bugs makes me think that the blame doesn’t lie with the developer, but instead with the excessive complexity of the system. Whether it’s WinRT or the C++ language itself which is at fault… I’m not sure.

    • a b

      I agree.

      One of the problems here is requirement for explicit lifetime reference object that should always be there but is not enforced effectively making it a footgun that automatically shoots at you whenever you slip up.

      Personally I prefer the solution that Qt framework uses. It has asynchronous signals&slots callback mechanism instead of coroutines but I think it is still applicable because coroutines are just a syntactic sugar. When you register an event handler (a slot)...

      Read more
      • Raymond ChenMicrosoft employee Author

        The Qt approach does require additional planning, because you have to worry about the case where your callback never runs. If that callback is expected to clean up resources or unwind state, you'll have to do that cleanup/unwinding when you destroy the context. You could do the same thing in a coroutine, but nobody wants to because it's a big hassle. (Basically, drop your strong reference to a weak reference before awaiting, then try to...

        Read more