Last time, we looked at crashes caused by a DLL being removed from memory behind everybody’s back, causing crashes when somebody tried to call into that no-longer-there DLL that everybody thought was still there.
A colleague of mine who was looking at other crashes coming from this process found that most of those other crashes were also of the form “a data structure was corrupted because somebody wrote the single byte 01 into it.” That piece of information made everything fall into place for my side of the investigation.
We saw earlier that the bottom bit of the HMODULE is set for datafile module handles. Therefore, if one of these stray 01 bytes happens to overwrite the bottom byte of an existing HMODULE handle, that turns it into a (fake) datafile module handle. And then, during process destruction, a component dutifully cleans up the DLLs they loaded by freeing them (say because they were stored in an RAII type like wil::), the code will pass this (fake) datafile module handle to FreeÂLibrary. The FreeÂLibrary function sees the bottom bit set and says, “Oh, this must be the handle to a module that was loaded via LOAD_,” so it frees it as a datafile.
Freeing a datafile module means undoing the steps that were taken when the module was loaded as a datafile: Unmapping the DLL from memory. In particular, loading a module as a datafile does not add the DLL to the list of DLLs that were loaded as code; therefore, unloading a datafile module doesn’t remove it from that list. As far as the DLL list is concerned, the DLL is still in memory.
A one-bit error caused the code to lie and attempt to free a module handle that did not correspond to a LoadÂLibrary call, resulting in mass havoc.
The “DLL unmapped from memory” crash is just an alternate manifestation of the “somebody is writing 01 bytes to places they shouldn’t” bug. The original bug had a larger bucket spray than we initially thought.
The good news is that all of the crashes have funneled down to a single bug. The bad news is that you now have to debug this one memory corruption bug.
Unfortunately, at the time of this writing, the root memory corruption bug in the third party program has yet to be identified. We don’t know whether it’s coming from an operating system component or from the program itself. Though the fact that it appears to occur only in one process, where it sprays across multiple modules, suggests that it’s a problem with that program, or that there’s something peculiar about how this specific process uses the system.
If you look at the original stack trace, you can see that the problem is occurring at process termination. That’s probably why the problem has lurked for so long: Crashes at exit often go unnoticed because there is no end-user loss of functionality. The user was finished with the program anyway. Whether it exits cleanly or with a crash doesn’t affect the user much.
Sorry. Not all stories have a happy ending.
May you please ask to fix the memory leak in winrt allocation not being released used in ms phone app,
Looks like a best candidate for reproduction under Time Travel Debugger. If the problem is still reproducible (which is not granted, as the process will run much slower and some race conditions may disappear), it will be very easy to find the instruction that writes that byte.
So the root cause sounds like a major cyber security problem that is easily exploitable. The fix is to verify that the handle doesn’t point to code. That check belongs in the OS.
Forget about “how it happened” and fix the root cause!
I’m not seeing how this bug crosses a security boundary.
@Ken Settle On the next episode of the airtight hatchway…
I was chasing something (I forget what) and set up a SchTsk to trigger on Eventlog
Application – Application Error
Application – Application Hang
with action
msg.exe * /time:864000 Application Hang or Crash, see EvtLog.
Later found one of my own .NET Framework applications tripped on something on the way out, but no one had noticed.