December 3rd, 2024

Tricks from product support: We’re not smart enough to debug the problem, can you help us?

Some time ago, I shared the trick of asking customers to blow the dust out of the connector. Today I’m sharing a trick I learned from the enterprise product support team.

It can happen that investigating a problem reveals that a problem occurred when calling a function that has been patched or hooked. (In the case of enterprise customers, the offender is typically some “advanced anti-malware software” that they paid a lot of money for.) The code running in the hook ends up does something sketchy, the most common example of which is hooking a low-level function and then having the hook call a higher-level function, resulting in a deadlock. A ridiculous example would be hooking Heap­Alloc (a low-level memory allocation function) and calling Message­Box (a high-level user interface function). Another example would be hooking a function in a way that changes unspecified but observable state, such as changing the value returned by Get­Last­Error when the function succeeds.

The trick here is to not to tell the customer, “We think the problem is being caused by your anti-malware software.” That is something they don’t want to hear. After all, they paid a lot of money for that anti-malware software, and a recommendation of the form “throw away a lot of money you already spent” is not going to land well. (See also: sunk cost fallacy.)

Instead, tell the customer, “It looks like the anti-malware software is interfering with our ability to debug the problem. Can you temporarily turn it off, then reproduce the problem following the same instructions, with the same tracing and crash dump collection steps? Once you’ve done that, you can turn the software back on.”

In other words, “It’s not you. It’s me.” We are trying to debug the problem in our software, and we fully acknowledge that it’s a problem in our software, but we’re not smart enough to do it while that other software is running, so can you just help us out and remove some of the distractions?

I’m told that what usually² happens is that the customer, for some mysterious reason, is unable to get the problem to occur when the anti-malware software is disabled. “Wow, that’s weird.”

Sometimes the customer gets the hint and opens a support ticket with the anti-malware vendor. Sometimes we have to suggest to them, “Why don’t you check if there’s an update available for your anti-malware software?”

¹ A common example of this is calling Tls­Get­Value from inside the hook, which has a documented side effect of clearing the last error code.

² Usually, but not always. Sometimes, the anti-malware software not actually the source of the problem. But we’re not lying! Removing the anti-malware software from the equation does simplify the debugging: Since we don’t have the symbols for the anti-malware software, the stack traces are cluttered with mystery frames, and sometimes the frames are so badly messed up that the debugger can’t find the other end. Removing the anti-malware software produces cleaner and more complete stack frames, which definitely makes the analysis easier.

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

2 comments

  • Marcel Kilgus

    A few years ago I was consulted at work when a long running DAQ task (data acquisition, essentially sampling a voltage at high frequency) resulted in suspicious (like repeated) samples right around the 10 minutes mark. Turns out this was the time the "cloud based thread detection" determined that the executable was harmless after all and tried to unhook from it. No crash or anything, just a few garbled measurements. Solution was to sign the...

    Read more
  • Mike Morrison 5 days ago

    I've used this trick myself, and quite recently too, in supporting enterprise software. In my most recent case, an anti-malware package - hmm, let's call it "GroupAssault" - hooked the LoadLibraryW function to run some of it's own code whenever a program loaded a DLL. Some of my software wouldn't install because this detour crashed the program while loading the installer's DLLs. I saw the anti-malware's DLLs in the stack traces of...

    Read more