July 18th, 2025
like1 reaction

The case of the invalid instruction exception on an instruction that should never have executed

The image processing folks added specialized AVX2 versions of their code, but found that it was crashing with an illegal instruction exception. The code went something like this:

void SwizzleAVX2(uint32_t* source, uint32_t* destination, uint32_t count)
{
    ⟦ do stuff using AVX-only instructions ⟧
    ⟦ such as _mm256_cvtepu8_epi16 ⟧
}

void SwizzleSSE4(uint32_t* source, uint32_t* destination, uint32_t count)
{
    ⟦ do stuff using SSE4 instructions ⟧
    ⟦ such as _mm_cvtepu8_epi16 ⟧
}

bool hasAVX2; // initialized elsewhere

void Swizzle(uint32_t* source, uint32_t* destination, uint32_t count)
{
    if (hasAVX2) {
        SwizzleAVX2(source, destination, count);
    } else {
        SwizzleSSE4(source, destination, count);
    }
}

This looks good, doesn’t it? We check whether AVX2 instructions are available, and if so, we use the AVX2 version; otherwise we use the SSE4 version.

But in fact, this code crashes with an invalid instruction exception on systems that do not have AVX2. How can that be?

Compiler optimization.

According to the “as-if” rule, the compiler is permitted to perform any optimization that a program cannot legitimately detect, where “legitimately” means “within the rules of the language”.

What happened is that the compiler first inlined the SwizzleAVX2 and SwizzleSSE4 functions into the Swizzle function, and then it reordered the instructions so that some of the AVX2 instructions from SwizzleAVX2 were moved in front of the test of the hasAVX2 variable. For example, maybe SwizzleAVX2 started by setting some registers to zero. The compiler might have decided to do this because profiling revealed that hasAVX2 is usually true, so it wants to get the registers ready in anticipation of using them for the rest of the SwizzleAVX2 function.

Unfortunately, the compiler doesn’t realize that our test of hasAVX2 was specifically intended to prevent any AVX2 instructions from running. The concept of “instructions that might not be available” does not arise in the C or C++ language specifications, so there is nothing in the language itself that addresses the matter.

There are some directives you can use to tell the compiler that certain memory operations must occur in a specific order. For example, you can use interlocked operations with acquire or release semantics, or you can use std::atomic_thread_fence, or you can use explicit memory barriers.

However, none of them are of use here because the offending instruction isn’t a memory instruction, so memory ordering directives have no effect.

The (somewhat unsatisfying) solution was to mark the AVX version as noinline so that the compiler cannot reorder instructions out of it.

__declspec(noinline)
void SwizzleAVX2(uint32_t* source, uint32_t* destination, uint32_t count)
{
    ⟦ do stuff using AVX-only instructions ⟧
    ⟦ such as _mm256_cvtepu8_epi16 ⟧
}
Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

9 comments

  • Baltasar García 19 hours ago

    Does this happen with -O2? And it happens only in C++ or in C as well?

  • Henry Skoglund 2 days ago

    Couldn’t you use the short-circuiting rule of if statements to guarantee non-execution of the 2nd condition:

    change SwizzleAVX2 to return a boolean (dummy) true value

    and then

    void Swizzle(uint32_t* source, uint32_t* destination, uint32_t count)
    {
        if ((hasAVX2) && (SwizzleAVX2(source, destination, count)))
            return;
      
        SwizzleSSE4(source, destination, count);
    }
    • Raymond ChenMicrosoft employee Author

      That doesn’t help because a separate “if (hasAVX2)” is already short-circuiting: It’s a separate statement altogether!

      The standard permits compilers to reorder code as long as observable behavior is not affected. The problem is that the AVX2 instructions inside SwizzleAVX2 have side effects not covered by the standard’s definition of “observable behavior” (namely, crashing on certain hardware).

      • Erik Fjeldstrom

        You would probably have the same problem, but further down: what gives you the HasAVX2() that is used to choose which function to assign?

        In theory, if “register” was still allowed (and its semantics were honoured, which has never really happened) that would probably work.

      • Shawn Van Ness · Edited

        @Robin Hoffmann Function-pointer is how I’ve seen this done, in some codebases. I find it similar in spirit to using GetProcAddress to avoid static-linking an API that’s not available on all systems.

        Is it a “branch” or just an indirect call? I haven’t tested but I would expect modern CPUs to bench about the same.. (a) fetch a bool and do a conditional jmp then a direct call, vs (b) fetch a function ptr and do a indirect call.

      • Raymond ChenMicrosoft employee Author 19 hours ago

        @Robert Hoffmann: You could replace it with a function pointer, but that would make the branch unpredictable if there is no entry in the branch predictor history (0% success instead of 90% success if profiling hints the test as “AVX2 likely”, which is what happened here), and it costs you a CFG test (to protect against a security vulnerability if somebody could overwrite the function pointer.) In practice, there are over a dozen of these functions.

      • Robin Hoffmann · Edited

        What if the Swizzle function is replaced by a function pointer?
        This would also remove the need for the hasAVX2 variable, unless it is needed somewhere else.

  • Joshua Hudson 2 days ago

    I’m used to there being one more rule, don’t reorder asm blocks. Too bad for the compiler intrinsics 🙁

    Sometimes I wonder if the old-school solutions are better. In this case the old-school solution is distribution media has several builds with different CPU options and installs the right one. Works great when shipping on CD. How many am I used to seeing? About four.

    You typically don’t bother with whole program optimization like this; just the hotspots, which are broken down into their own dlls.

  • Robin G 2 days ago

    This is a horrible minefield, and then people who don't understand it get upset when a piece of software (like Windows maybe) suddenly has a new requirement that the CPU supports a certain instruction set. Coding for multiple instruction sets is hard and ignoring new instruction sets leaves performance gains unrealized...

    I haven't met this particular issue with the optimization, but ran in to another nasty one. If you have one or more .cpp files that have AVX (or some other optional instruction set enabled), and they include/use stuff from the standard library like std::string, the compiler will busily compile implementations...

    Read more