Adventures in application compatibility: The case of the RAII type that failed to run its destructor

Raymond Chen

A customer reported that their system crashed with bugcheck code 0xEF: CRITICAL_PROCESS_DIED because the RPCSS service died. The investigation of this failure was difficult because by the time the crash occurred, everything was badly corrupted with no evidence as to who did the corrupting. The data that was being corrupted is per-thread data that is normally managed by an RAII type, so there should be no way it could be getting corrupted, assuming that the RAII type is being used properly: Always created and destructed in LIFO order, which is typically accomplished by making it a local variable in a non-coroutine function.

The team scoured their code for all uses of that RAII type and couldn’t find any place where they could possibly be using it incorrectly.

Eventually, the team discovered that a third party anti-malware application had injected itself into the RPCSS service and was detouring some functions.¹ And for some reason, the detour was unhappy, but instead of failing the call, it raised a structured exception.

This structured exception was caught by the RPC equivalent of the giant try/catch that COM wraps around every server method. For RPC, this wrapper goes by the name RpcExceptionFilter. The exception was raised from their detour and it so happens that the exception they raised is not one that the RpcExceptionFilter function deems to be noteworthy, so the RpcExceptionFilter function swallows the exception and returns an RPC failure to the client.

As I noted, this structured exception bypasses C++ destructors.

And that’s why the RAII class was not working. The code wasn’t holding it wrong. Rather, the third party code broke the rules.

Now, the third party code had this problem all along: The rogue exception caused an entire RPC call to become abandoned mid-stream, resulting in lots of leaks and missing cleanup. What changed is that there’s new code in the RPCSS service that was counting on the cleanup to avoid a crash.

This particular anti-malware program must be somewhat popular because the problem recurs about once or twice a year with a different customer each time. This is the bitter spot² in bug frequency, because it’s not frequent enough that the problem remains on your mind whenever you get a mystery crash so you can open the discussion with the customer by saying, “Okay, before I spend too much time on this, can you tell me if you recently installed or reconfigured Contoso anti-malware?” On the other hand, the issue is frequent enough that it consumes a lot of time and effort to gather crash dumps and eventually realize that it’s that Contoso anti-malware problem again.

¹ Windows does not support detouring the operating system. This anti-malware software is doing unsupported things, but good luck explaining that to their customers.

² The bitter spot is the opposite of the sweet spot. Or is the opposite of the sweet spot the salty spot? The sour spot? Maybe it’s all of them. When you find yourself in this spot, you are probably going to be sour, bitter and probably a little salty.

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

10 comments

Discussion is closed. Login to edit/delete existing comments.

Richard Russell April 6, 2022

You left out the umami spot.
Henke37 April 6, 2022

Fighting games seem to have settled on “sour spot”.
Alois Kraus April 5, 2022

It could help to get a short wpr recording of the machine which collects all loaded modules besides a memory dump as default action. I had my own share of AV induced issues and did write https://github.com/Siemens-Healthineers/ETWAnalyzer to be able to quickly query all known AV vendors in the stacks. With proper stacktags one can quickly find out if the slowness of an application comes from SmartScreen, Mc Affee or any other AV vendor. I plan to blog more about known patterns of performance degradations caused by AV engines which are with other tools notoriously hard to find.
I assume...
Read more
It could help to get a short wpr recording of the machine which collects all loaded modules besides a memory dump as default action. I had my own share of AV induced issues and did write https://github.com/Siemens-Healthineers/ETWAnalyzer to be able to quickly query all known AV vendors in the stacks. With proper stacktags one can quickly find out if the slowness of an application comes from SmartScreen, Mc Affee or any other AV vendor. I plan to blog more about known patterns of performance degradations caused by AV engines which are with other tools notoriously hard to find.
I assume that you have an automatic analyzer for memory dumps which scans for known problematic modules which should not be loaded in central system processes.

Read less
pete.d April 5, 2022

Yeah, reminds me of the full week I lost, around the 1996 time frame, when Microsoft IT mandated a particular anti-virus software (InocuLAN? I don't recall for sure which third-party software it was) that wound up corrupting input files during compilation, generating compile-time errors that didn't seem to correspond to any actual problem with the code. I wasted that whole week doing all kinds of different things to try to track down the problem in the code, until the mental light bulb lit up and I disabled the anti-virus software.

Decades later, I'm back to (barely) trusting Windows Defender to keep...
Read more
Yeah, reminds me of the full week I lost, around the 1996 time frame, when Microsoft IT mandated a particular anti-virus software (InocuLAN? I don’t recall for sure which third-party software it was) that wound up corrupting input files during compilation, generating compile-time errors that didn’t seem to correspond to any actual problem with the code. I wasted that whole week doing all kinds of different things to try to track down the problem in the code, until the mental light bulb lit up and I disabled the anti-virus software.

Decades later, I’m back to (barely) trusting Windows Defender to keep an eye on things. But I remain skeptical of any third-party component that considers it its job to insert itself into the middle of anything the OS is doing, and *especially* anything that purports to be anti-virus/malware. Seems like historically, the main thing software like that has managed to do is create really hard-to-understand-and-fix problems, and increase the attack surface of the computer in question.

Read less
David Wolff April 5, 2022

When I worked at Prime Computer, we had an automated dump-analysis tool that had rules about what things in a crash dump were clues to known crashes. For example, if a particular routine crashed, and the third parameter was invalid, it was a known crashing bug (fixed in a later release). Doesn’t Microsoft have something similar?
- Tim Weis April 12, 2022
  
  Windows Error Reporting, as I recall, (used to) categorize defects into buckets, where ideally each bucket would refer to the same core issue. Microspeak: Bucket bugs, bucket spray, bug spray, and failure shift goes over some of the challenges and illustrates the outcomes when things are less than ideal.
Jacob Manaker April 5, 2022 · Edited

Eventually, the team discovered that a third party anti-malware application had injected itself into the RPCSS service and was detouring some functions.

The first time I read this, I missed the “anti-” in “anti-malware application.” It made a lot more sense that way.
- Nathan Williams April 6, 2022
  
  Anti-malware software is the most common type of malware after all.
紅樓鍮 April 5, 2022 · Edited

Bonus chatter: Rust considers RAII leaks (which Rust supports officially through e. g. ) to be safe. This is normally OK, until someone tries to rely on destructors being run to guarantee safety, like did (crates.io thread-scoped).

Rust reconciled them with a new continuation-passing style API (), which by design prevents ting the join guard.
(Not that it would help the case of an unhandled structured exception.)

Bonus bonus chatter: Doing anything that involves 'ing is considered inherently unsafe in Rust, since you can't guarantee the contents of the DLL. And what's more fun, in many scenarios there's...
Read more
Bonus chatter: Rust considers RAII leaks (which Rust supports officially through e. g. std::mem::forget) to be safe. This is normally OK, until someone tries to rely on destructors being run to guarantee safety, like std::thread::scoped did (crates.io thread-scoped).

Rust reconciled them with a new continuation-passing style API (std::thread::scope), which by design prevents forgetting the join guard.
(Not that it would help the case of an unhandled structured exception.)

Bonus bonus chatter: Doing anything that involves LoadLibrary‘ing is considered inherently unsafe in Rust, since you can’t guarantee the contents of the DLL. And what’s more fun, in many scenarios there’s just no Rust-safe alternatives to doing LoadLibrary. Meaning there exist applications for which it’s impossible to be Rust-safe, for legitimate reasons.

I think this makes me rethink the whole thing about “program correctness”: the concept is just difficult to apply once you go out of the language abstract machine level. If you have the privileges, you can do everything to drive a proven-correct Lean program nuts, at every level from patching the OS loader to injecting rogue threads. Hey, even Rust is using LLVM for codegen, which is not proven-correct (and is known to have bugs).

Read less
Henry Skoglund April 5, 2022

When helping someone with a weirdly behaving program, in the old days #1 advice used to be “Upgrade your video driver” but today I always start with “Do you have any 3rd-party antivirus program installed” because sadly they are often the culprit.
Also thank you for making me laugh: “… The code wasn’t holding it wrong. . “” 🙂