There was a crash in a unit test run under the Test Authoring and Execution Framework (TAEF, pronounced like tafe). The crashing stack looked like this:
kernelbase!RaiseFailFastException unittests!wil::details::WilDynamicLoadRaiseFailFastException unittests!wil::details::WilRaiseFailFastException unittests!wil::details::WilFailFast unittests!wil::details::ReportFailure_NoReturn<3> unittests!wil::details::ReportFailure_Base<3,0> unittests!wil::details::ReportFailure_Hr<3> unittests!wil::details::in1diag3::_FailFast_Unexpected unittests!WidgetRouter::GetInstance unittests!winrt::Component::implementation::Widget::Close unittests!winrt::Component::implementation::Widget::~Widget unittests!winrt::impl::heap_implements<⟦...⟧>::`scalar deleting destructor' unittests!winrt::implements<⟦...⟧>::Release unittests!std::vector<winrt::Windows::Foundation::IClosable>::~vector unittests!std::vector<winrt::Windows::Foundation::IClosable>::clear unittests!WidgetRouter::~WidgetRouter unittests!WidgetRouter::GetInstance'::`2'::`dynamic atexit destructor for 'singleton'' ucrtbase!<lambda_⟦...⟧>::operator ucrtbase!__crt_seh_guarded_call<int>::operator()<<lambda_⟦...⟧>> ucrtbase!_execute_onexit_table ucrtbase!__crt_state_management::wrapped_invoke unittests!dllmain_crt_process_detach unittests!dllmain_dispatch ntdll!LdrpCallInitRoutine ntdll!LdrpProcessDetachNode ntdll!LdrpUnloadNode ntdll!LdrpDecrementModuleLoadCountEx ntdll!LdrUnloadDll kernelbase!FreeLibrary wex_common!TAEF::Common::Private::ExecutionPeFileData::`scalar deleting destructor' wex_common!TAEF::Common::PeFile::~PeFile te_loaders!WEX::TestExecution::NativeTestFileInstance::`scalar deleting destructor' te_host!<lambda_⟦...⟧>::Execute te_common!WEX::TestExecution::CommandThread::ExecuteCommandThread kernel32!BaseThreadInitThunk ntdll!RtlUserThreadStart
The team concluded that the destructor of the Widget
was running at the conclusion of this function:
void WidgetTests::BasicTests() { // Force the feature flag on for the duration of this test auto override = std::make_unique<FeatureOverride>(NewFeatureId, true); auto widget = winrt::Component::Widget(); ⟦ test the widget in various ways ⟧ // widget naturally destructs here }
Their conclusion was that the override was being released, and then the widget was destructing, resulting in an assertion failure in WidgetRouter::
:
/* static method */ std::shared_ptr<WidgetRouter> WidgetRouter::GetInstance() { assert(FeatureFlags::IsEnabled(NewFeatureId)); static std::shared_ptr<WidgetRouter> singleton = std::make_shared<WidgetRouter>(); return singleton; }
The team theorized that perhaps there was a race condition between the release of the widget and the lifting of the override, and their proposed fix was to release the widget
explicitly while the override is still in scope.
void WidgetTests::BasicTests()
{
// Force the feature flag on for the duration of this test
auto override = std::make_unique<FeatureOverride>(NewFeatureId, true);
auto widget = winrt::Component::Widget();
⟦ test the widget in various ways ⟧
widget = nullptr;
}
I commented on the pull request (which had already been completed) that this change has no effect. The rules for C++ say that local variables are destructed in reverse order of construction, so the widget will naturally be released as part of its destruction, and only after that is finished will the override be released when it destructs.
The team replied that they did observe that the problem disappeared after they made their fix, but then it came back.
Exercise: Explain why the problem went away and then came back.
Answer to exercise: I suspected that they tested their fix in their team’s test environment, where the feature is already enabled. The fix works not because it fixed anything but because it was never crashing in their test environment in the first place. The defect tracker, however, doesn’t know that. It correctly reported that the bug was not being observed in any branches that had received the fix, which was initially only their own test branch. As the fixed merged into other branches, the bug was still not observed, until it finally merged into a branch where their feature was disabled. At that point, the override was actually doing something (changing a feature from disabled to enabled), and that’s when the crashes started coming in.
From the stack trace, the crash happens during the destruction of instance of , which destructs a of s, one of which is (I assume all of them are), whose destruction enters . My theory: a tries to remove itself from the singleton upon closing, so it has to it (maybe it remembered whether it added itself to the router upon construction, dependent on the feature flag at that time, and didn't think the feature could be disabled during its lifetime). This of course happens after has returned, because singletons are destructed on...
The whole thing stinks of overengineering to me.
Think about it. That vector is statically-allocated mutable state. Why would you reach for such a thing? There are a bunch of bad answers I can think of, but when we further add the constraint that ~Widget exists and does anything custom at all, the rule of zero tends to constrain the solution space down to (roughly) the following: Widget is responsible for managing some process-wide state or resource, that state or resource is inherently per-process (or global to the system) and cannot be sensibly managed without some variety of global mutable state,...
Rereading the stack trace makes me even more curious. Why is
vector::clear
calling~vector
? Are they operating on the same vector (I read the code of MSVC STL but itsclear
doesn’t try to destroy itself)?Another suspicious point is WidgetRouter is not bound to COM lifetime but manages COM objects. (But maybe their scenario doesn’t involve uninitializing COM process-wide, so…)