The case of the feature flag that didn’t stay on long enough, part 1

Raymond Chen

There was a crash in a unit test run under the Test Authoring and Execution Framework (TAEF, pronounced like tafe). The crashing stack looked like this:

kernelbase!RaiseFailFastException   
unittests!wil::details::WilDynamicLoadRaiseFailFastException
unittests!wil::details::WilRaiseFailFastException
unittests!wil::details::WilFailFast
unittests!wil::details::ReportFailure_NoReturn<3>
unittests!wil::details::ReportFailure_Base<3,0>
unittests!wil::details::ReportFailure_Hr<3>
unittests!wil::details::in1diag3::_FailFast_Unexpected
unittests!WidgetRouter::GetInstance  
unittests!winrt::Component::implementation::Widget::Close    
unittests!winrt::Component::implementation::Widget::~Widget  
unittests!winrt::impl::heap_implements<⟦...⟧>::`scalar deleting destructor'   
unittests!winrt::implements<⟦...⟧>::Release   
unittests!std::vector<winrt::Windows::Foundation::IClosable>::~vector
unittests!std::vector<winrt::Windows::Foundation::IClosable>::clear
unittests!WidgetRouter::~WidgetRouter  
unittests!WidgetRouter::GetInstance'::`2'::`dynamic atexit destructor for 'singleton''
ucrtbase!<lambda_⟦...⟧>::operator    
ucrtbase!__crt_seh_guarded_call<int>::operator()<<lambda_⟦...⟧>>
ucrtbase!_execute_onexit_table
ucrtbase!__crt_state_management::wrapped_invoke
unittests!dllmain_crt_process_detach  
unittests!dllmain_dispatch    
ntdll!LdrpCallInitRoutine   
ntdll!LdrpProcessDetachNode 
ntdll!LdrpUnloadNode    
ntdll!LdrpDecrementModuleLoadCountEx    
ntdll!LdrUnloadDll  
kernelbase!FreeLibrary  
wex_common!TAEF::Common::Private::ExecutionPeFileData::`scalar deleting destructor' 
wex_common!TAEF::Common::PeFile::~PeFile    
te_loaders!WEX::TestExecution::NativeTestFileInstance::`scalar deleting destructor' 
te_host!<lambda_⟦...⟧>::Execute   
te_common!WEX::TestExecution::CommandThread::ExecuteCommandThread   
kernel32!BaseThreadInitThunk
ntdll!RtlUserThreadStart

The team concluded that the destructor of the Widget was running at the conclusion of this function:

void WidgetTests::BasicTests()
{
  // Force the feature flag on for the duration of this test
  auto override = std::make_unique<FeatureOverride>(NewFeatureId, true);

  auto widget = winrt::Component::Widget();
  ⟦ test the widget in various ways ⟧

  // widget naturally destructs here
}

Their conclusion was that the override was being released, and then the widget was destructing, resulting in an assertion failure in WidgetRouter::GetInstance:

/* static method */
std::shared_ptr<WidgetRouter>
    WidgetRouter::GetInstance()
{
    assert(FeatureFlags::IsEnabled(NewFeatureId));

    static std::shared_ptr<WidgetRouter> singleton =
        std::make_shared<WidgetRouter>();

    return singleton;
}

The team theorized that perhaps there was a race condition between the release of the widget and the lifting of the override, and their proposed fix was to release the widget explicitly while the override is still in scope.

void WidgetTests::BasicTests()
{
  // Force the feature flag on for the duration of this test
  auto override = std::make_unique<FeatureOverride>(NewFeatureId, true);

  auto widget = winrt::Component::Widget();
  ⟦ test the widget in various ways ⟧

  widget = nullptr;
}

I commented on the pull request (which had already been completed) that this change has no effect. The rules for C++ say that local variables are destructed in reverse order of construction, so the widget will naturally be released as part of its destruction, and only after that is finished will the override be released when it destructs.

The team replied that they did observe that the problem disappeared after they made their fix, but then it came back.

Exercise: Explain why the problem went away and then came back.

Answer to exercise: I suspected that they tested their fix in their team’s test environment, where the feature is already enabled. The fix works not because it fixed anything but because it was never crashing in their test environment in the first place. The defect tracker, however, doesn’t know that. It correctly reported that the bug was not being observed in any branches that had received the fix, which was initially only their own test branch. As the fixed merged into other branches, the bug was still not observed, until it finally merged into a branch where their feature was disabled. At that point, the override was actually doing something (changing a feature from disabled to enabled), and that’s when the crashes started coming in.

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

3 comments

Discussion is closed. Login to edit/delete existing comments.

GL April 17, 2025 · Edited

From the stack trace, the crash happens during the destruction of instance of , which destructs a of s, one of which is (I assume all of them are), whose destruction enters . My theory: a tries to remove itself from the singleton upon closing, so it has to it (maybe it remembered whether it added itself to the router upon construction, dependent on the feature flag at that time, and didn't think the feature could be disabled during its lifetime). This of course happens after has returned, because singletons are destructed on...
Read more
From the stack trace, the crash happens during the destruction of singleton instance of WidgetRouter, which destructs a vector of ICloseables, one of which is Widget (I assume all of them are), whose destruction enters GetInstance. My theory: a Widget tries to remove itself from the singleton WidgetRouter upon closing, so it has to GetInstance it (maybe it remembered whether it added itself to the router upon construction, dependent on the feature flag at that time, and didn’t think the feature could be disabled during its lifetime). This of course happens after BasicTests has returned, because singletons are destructed on DLL unload. At this point, the feature is disabled (since override in BasicTests is dead) so it’s failing the assertion.

The theory is quite scary — the vector destructor is releasing Widget then Widget destructor tries to edit that vector under destruction. I hope this is not the case. (But FWIW, the act of editing the vector appears to work since the code does not crash if the feature is always enabled.) Also, since WidgetRouter references the widgets, the widgets never naturally close. This is already a huge red flag, but suppose the widget users will close them manually, then the design is somewhat similar to a typical flaw in .NET finalizers and dispose methods. When Close is called from ~Widget, it means the last outstanding reference is being released, either by the user (if the feature has always been disabled), or by the WidgetRouter destructor. In either case, it’s incorrect to try removing itself from WidgetRouter (former: it’s not there; latter: it’s not there, because the destructor virtually removes everything from the vector). One possible solution is the Dispose(bool disposing) pattern — clean-up should know whether it’s called from ICloseable::Close or the destructor.

Read less
- Kevin Norris April 18, 2025
  
  The whole thing stinks of overengineering to me.
  
  Think about it. That vector is statically-allocated mutable state. Why would you reach for such a thing? There are a bunch of bad answers I can think of, but when we further add the constraint that ~Widget exists and does anything custom at all, the rule of zero tends to constrain the solution space down to (roughly) the following: Widget is responsible for managing some process-wide state or resource, that state or resource is inherently per-process (or global to the system) and cannot be sensibly managed without some variety of global mutable state,...
  Read more
  The whole thing stinks of overengineering to me.
  
  Think about it. That vector is statically-allocated mutable state. Why would you reach for such a thing? There are a bunch of bad answers I can think of, but when we further add the constraint that ~Widget exists and does anything custom at all, the rule of zero tends to constrain the solution space down to (roughly) the following: Widget is responsible for managing some process-wide state or resource, that state or resource is inherently per-process (or global to the system) and cannot be sensibly managed without some variety of global mutable state, and this design is presumably dictated by some OS-level restriction that we can’t work around or ignore.
  
  But that’s still a bad excuse, because even in that hypothetical, it is obvious that you cannot simultaneously have WidgetRouter be responsible for cleaning up Widget and vice-versa. One of them has to be the owner, and your smart pointers have to agree with that ownership model.
  
  Read less
  - GL April 18, 2025
    
    Rereading the stack trace makes me even more curious. Why is vector::clear calling ~vector? Are they operating on the same vector (I read the code of MSVC STL but its clear doesn’t try to destroy itself)?
    
    Another suspicious point is WidgetRouter is not bound to COM lifetime but manages COM objects. (But maybe their scenario doesn’t involve uninitializing COM process-wide, so…)

The case of the feature flag that didn’t stay on long enough, part 1

Author

3 comments

Read next

The case of the feature flag that didn’t stay on long enough, part 2

What resource ID should I give my application’s main icon?

Author

3 comments

Read next

The case of the feature flag that didn’t stay on long enough, part 2

What resource ID should I give my application’s main icon?

Stay informed