The case of Explorer calling into an unloaded DLL while trying to run down a reference to it
There was a large number of crashes in Explorer that were tracked back to attempting to release a COM object that belonged to a DLL that was no longer in memory.
A typical call stack at the crash looked like this:
combase!<lambda_...>::operator()+0x9e combase!ObjectMethodExceptionHandlingAction<lambda_...>+0x1b combase!CStdIdentity::ReleaseCtrlUnk+0x68 combase!CStdMarshal::DisconnectWorker_ReleasesLock+0x385 combase!CStdMarshal::DisconnectSwitch_ReleasesLock+0x28 combase!CStdMarshal::DisconnectAndReleaseWorker_ReleasesLock+0x3c combase!CStdMarshal::DisconnectAndRelease+0x35 combase!COIDTable::ThreadCleanup+0xd5 combase!FinishShutdown::<lambda_...>::operator()+0x5 combase!ObjectMethodExceptionHandlingAction<lambda_...>+0x15 combase!FinishShutdown+0x45 combase!ApartmentUninitialize+0x67 combase!wCoUninitialize+0x11a combase!CoUninitialize+0xb6 imm32!CtfImmCoUninitialize+0x48 msctf!CicFlsCallback+0x50 ntdll!RtlProcessFlsData+0xf6 ntdll!LdrShutdownThread+0x32 ntdll!RtlExitUserThread+0x4c KERNELBASE!FreeLibraryAndExitThread+0x34 ucrtbase!common_end_thread+0x84 ucrtbase!_endthreadex+0x7 ucrtbase!thread_start+0x46 kernel32!BaseThreadInitThunk+0x24 ntdll!__RtlUserThreadStart+0x2f ntdll!_RtlUserThreadStart+0x1b
I took a sample of ten crashes with this stack to see if I could find a pattern. The object being released is still alive (the data for it is still present in memory, and it still has a vtable), but the code that the vtable points to has already been unloaded. Fortunately, the system remembers the DLLs that were most recently unloaded, so we can use that to look up the DLLs that hosted the objects that are being run down.
The ten crashes break down like this:
The vast majority of the issues are with Contoso, so we’ll focus on that one.
An interesting detail is that in four of the Contoso crashes, some version of the Contoso setup program is running.
I got lucky and discovered that Contoso is an open source project, so I was able to make further progress by reading the code and seeing what they were trying to do.
Contoso injects its DLL into Explorer and takes over a bunch of stuff. When Contoso wants to unload from Explorer, it unhooks all the hooks that it installed and unloads itself. It wasn’t loaded by COM, so COM is not going to call
DllCanUnloadNow to see whether the DLL has any active COM objects that would require it to remain loaded in memory.
However, it does produce COM objects, particularly, implementations of
IAccessible so that its UI objects are available to screen readers and other UI automation clients.
Once I had this foothold, it was relatively easy to reproduce the problem:
- Start Narrator.
- Launch the Contoso UI.
- Uninstall Contoso.
- Perform a developer shutdown of Explorer; Ctrl+Shift+RightClick, Exit Explorer.
Here’s what’s going on.
Launching the Contoso UI causes Narrator to ask for the
IAccessible interface so it can navigate the user interface elements.
Uninstalling Contoso causes it to remove its injected DLL, even though there are
IAccessible objects still outstanding. These are ticking time bombs waiting to be triggered.¹
Shutting down Explorer causes COM to be shut down for the process, at which time it runs down all the outstanding objects. And that’s when it trips over these
IAccessible objects that are backed by code that is no longer present in the process.
The fix is to create a custom COM context to hold your objects so that you can disconnect them prior to unloading. And the project owners agreed to make a fix to do exactly that.
One of the burdens of Explorer is that it is an attractive target for third-party code to inject itself, despite it being totally unsupported. And when that third-party code crashes, it’s Explorer that takes the blame.
One crash caused by a third party code-injector down. A few million more to go.
¹ That explains why the Contoso installer is often running at the time of these crashes. One of the things that the Contoso installer does is uninstall the previous version.
It’s quite possible to blame this one on MS anyway, but explorer isn’t the problem at all. Here we have yet another case of an application jumping through hoops because it can’t delete a file that contains running code.
I know MS doesn’t want to use such functionality, but it makes so much extra work because it’s not there for those that do. And now you’ve had to pay a piece of it.
Do you know that at the very least you can move files that contain running code? Sure, this moves the problem from how to replace the file that is in use with how to delete the files after they unloaded, but at the very least this situation can be easily prevented anyway.
> how to delete the files after they unloaded
Got a solution that doesn’t require admin rights and doesn’t require a process to be left running waiting for the file to become unlocked?
In this case? Let the setup program deal with it after explorer was restarted. But I would also say that since this is dealing with explorer, you are going to be using a setup program with admin rights anyway so using MoveFile with delay until reboot is an option.
In general? The Windows task scheduler. The requirements for this is only as much as is needed to execute the command and access what you want to clean up.
If you don’t want to be admin or leave a watcher behind, just force the user to log out. Since it’s not a reboot it’s *clearly* not any kind of downtime. /s
The actual problem here is that the file *is* being “successfully” unloaded (and probably getting replaced on disk just fine). It’s just leaving some dangling pointers behind. If you only replaced the file on disk you’d still have to either unload the original somehow or make explorer restart without unloading the DLL.
It’s slightly odd to be using the Contoso name when you did, in fact, identify the product in question.
> Perform a developer shutdown of Explorer; Ctrl+Shift+RightClick, Exit Explorer.
That is a nice trick! You should do a blog post with a few of these 😀
Ctrl+Shift+Right click where?
I found an article https://docs.microsoft.com/en-gb/windows/win32/shell/debugging-with-the-shell that claims it should be on the right side of the start menu, but that doesn’t make a menu pop up, so I don’t know where you and Raymond are right-clicking.
On the taskbar, apparently. I just tested it on Windows 10 1909.
I would expect a shell extension crash to cause a bluescreen. I think I read somewhere (old MSDN?) that they run on the kernel, but it was looong ago, so probably a mistake.
It would be interesting to see how you tracked that stack trace to Contoso. Did you have to instruct WER to gather more information? What kind of memory dump was it?
>I would expect a shell extension crash to cause a bluescreen.
Sounds like a myth being spread in the Linux community.
It was real, in Windows 98. Certain failure modes would trash the 16 bit heap, and that’s all she wrote.
Why would you expect this? Explorer.exe is a usermode process. Besides, shell extensions will load in any application with a open/save dialog etc.
I think it was specifically about taskbar extensions and why they removed them after vista (if I remember correctly). Like the little mini windows media player control bar – I miss that one. The problem might have been related to the fact that people started loading multiple versions of .net into the shell (I vaguely recall discussions on how to use silverlight on the taskbar).
That’s not the developer shutdown I know.
It is more or less the Vista one, but they probably had to move it due to the start changes in Windows 8+. Or it always worked on the task bar but nobody mentioned that.
Of course this is one of the fantastic things about open source – smart people in the wider community can contribute improvements one way or the other.