The case of the invalid instruction exception on an instruction that should never have executed

Raymond Chen

The image processing folks added specialized AVX2 versions of their code, but found that it was crashing with an illegal instruction exception. The code went something like this:

void SwizzleAVX2(uint32_t* source, uint32_t* destination, uint32_t count)
{
    ⟦ do stuff using AVX-only instructions ⟧
    ⟦ such as _mm256_cvtepu8_epi16 ⟧
}

void SwizzleSSE4(uint32_t* source, uint32_t* destination, uint32_t count)
{
    ⟦ do stuff using SSE4 instructions ⟧
    ⟦ such as _mm_cvtepu8_epi16 ⟧
}

bool hasAVX2; // initialized elsewhere

void Swizzle(uint32_t* source, uint32_t* destination, uint32_t count)
{
    if (hasAVX2) {
        SwizzleAVX2(source, destination, count);
    } else {
        SwizzleSSE4(source, destination, count);
    }
}

This looks good, doesn’t it? We check whether AVX2 instructions are available, and if so, we use the AVX2 version; otherwise we use the SSE4 version.

But in fact, this code crashes with an invalid instruction exception on systems that do not have AVX2. How can that be?

Compiler optimization.

According to the “as-if” rule, the compiler is permitted to perform any optimization that a program cannot legitimately detect, where “legitimately” means “within the rules of the language”.

What happened is that the compiler first inlined the SwizzleAVX2 and SwizzleSSE4 functions into the Swizzle function, and then it reordered the instructions so that some of the AVX2 instructions from SwizzleAVX2 were moved in front of the test of the hasAVX2 variable. For example, maybe SwizzleAVX2 started by setting some registers to zero. The compiler might have decided to do this because profiling revealed that hasAVX2 is usually true, so it wants to get the registers ready in anticipation of using them for the rest of the SwizzleAVX2 function.

Unfortunately, the compiler doesn’t realize that our test of hasAVX2 was specifically intended to prevent any AVX2 instructions from running. The concept of “instructions that might not be available” does not arise in the C or C++ language specifications, so there is nothing in the language itself that addresses the matter.

There are some directives you can use to tell the compiler that certain memory operations must occur in a specific order. For example, you can use interlocked operations with acquire or release semantics, or you can use std::atomic_thread_fence, or you can use explicit memory barriers.

However, none of them are of use here because the offending instruction isn’t a memory instruction, so memory ordering directives have no effect.

The (somewhat unsatisfying) solution was to mark the AVX version as noinline so that the compiler cannot reorder instructions out of it.

__declspec(noinline)
void SwizzleAVX2(uint32_t* source, uint32_t* destination, uint32_t count)
{
    ⟦ do stuff using AVX-only instructions ⟧
    ⟦ such as _mm256_cvtepu8_epi16 ⟧
}

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

20 comments

Discussion is closed. Login to edit/delete existing comments.

Ivan Kljajic July 25, 2025

Yeah cool.. I was thinking maybe the extern decl would kick that function out of the optimiser and non inline it
- Raymond Chen Author July 25, 2025
  
  In my experience, all C/C++ compilers inline variable accesses. (The Applesoft Compiler did generate function calls to access variables, but that was a BASIC compiler, not a C/C++ compiler, and it did so to minimize code size, not due to any functional requirement.)
Ivan Kljajic July 24, 2025

What if the body of that funtion declarred the variable as an extern? Like as if it lived in some different file?
- Raymond Chen Author July 24, 2025
  
  Unclear how extern-ness of the hasAVX2 variable could affect the code generation. (In the original code that ran into the problem, the variable was indeed extern.)
George Tokmaji July 22, 2025 · Edited
Would
```
std::atomic_signal_fence
```
help?
- Raymond Chen Author July 22, 2025
  
  No. atomic_signal_fence is about memory ordering, not execution ordering.
Jonathan Wilson July 21, 2025

One way I have seen this done in the past is that the main function calls through a function pointer and the code that detects which CPU is being used sets the function pointer to the right version.
- Simon Farnsworth July 21, 2025
  
  This is what GCC calls Function Multiversioning when it's done for you by the compiler; note that there's a bunch of complexity hiding here, since you still want inlining to take place when practical (e.g. if a function has baseline SSE2, AVX, and AVX2 versions, and it calls a function that has SSE2 and AVX versions, you want the AVX and AVX2 callers to be able to inline the AVX function they call, and you want the SSE2 version to inline the SSE2 version of the function it calls).
  
  Read more
  This is what GCC calls Function Multiversioning when it’s done for you by the compiler; note that there’s a bunch of complexity hiding here, since you still want inlining to take place when practical (e.g. if a function has baseline SSE2, AVX, and AVX2 versions, and it calls a function that has SSE2 and AVX versions, you want the AVX and AVX2 callers to be able to inline the AVX function they call, and you want the SSE2 version to inline the SSE2 version of the function it calls).
  
  Read less
  - Matt D. July 21, 2025 · Edited
    
    FWIW, that's applicable to the `__attribute__((target("OPTIONS")))` attribute in Clang and GCC:
    - https://clang.llvm.org/docs/AttributeReference.html#target
    - https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-target-function-attribute
    
    In particular:
    
    "On the x86, the inliner does not inline a function that has different target options than the caller, unless the callee has a subset of the target options of the caller. For example a function declared with target("sse3") can inline a function with target("sse2"), since -msse3 implies -msse2.
    
    Besides the basic rule, when a function specifies target("arch=ARCH") or target("tune=TUNE") attribute, the inlining rule will be different. It allows inlining of a function with default -march=x86-64 and -mtune=generic specified, or a function that has a subset of...
    Read more
    FWIW, that’s applicable to the `__attribute__((target(“OPTIONS”)))` attribute in Clang and GCC:
    – https://clang.llvm.org/docs/AttributeReference.html#target
    – https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-target-function-attribute
    
    In particular:
    
    “On the x86, the inliner does not inline a function that has different target options than the caller, unless the callee has a subset of the target options of the caller. For example a function declared with target(“sse3”) can inline a function with target(“sse2”), since -msse3 implies -msse2.
    
    Besides the basic rule, when a function specifies target(“arch=ARCH”) or target(“tune=TUNE”) attribute, the inlining rule will be different. It allows inlining of a function with default -march=x86-64 and -mtune=generic specified, or a function that has a subset of ISA features and marked with always_inline.”
    
    https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#Inlining-rules-1
    
    Read less
Solomon Ucko July 20, 2025

Could putting an empty inline assembly block, marked volatile if possible, at the start of SwizzleAVX2 work?
- Raymond Chen Author July 20, 2025
  
  Visual C++ does not support inline assembly in 64-bit code mode.
Baltasar García July 19, 2025

Does this happen with -O2? And it happens only in C++ or in C as well?
Henry Skoglund July 18, 2025
Couldn’t you use the short-circuiting rule of if statements to guarantee non-execution of the 2nd condition:

change SwizzleAVX2 to return a boolean (dummy) true value

and then
```
void Swizzle(uint32_t* source, uint32_t* destination, uint32_t count)
{
    if ((hasAVX2) && (SwizzleAVX2(source, destination, count)))
        return;
  
    SwizzleSSE4(source, destination, count);
}
```
- Raymond Chen Author July 18, 2025
  
  That doesn’t help because a separate “if (hasAVX2)” is already short-circuiting: It’s a separate statement altogether!
  
  The standard permits compilers to reorder code as long as observable behavior is not affected. The problem is that the AVX2 instructions inside SwizzleAVX2 have side effects not covered by the standard’s definition of “observable behavior” (namely, crashing on certain hardware).
  - Erik Fjeldstrom July 19, 2025
    
    You would probably have the same problem, but further down: what gives you the HasAVX2() that is used to choose which function to assign?
    
    In theory, if “register” was still allowed (and its semantics were honoured, which has never really happened) that would probably work.
  - Shawn Van Ness July 19, 2025 · Edited
    
    @Robin Hoffmann Function-pointer is how I’ve seen this done, in some codebases. I find it similar in spirit to using GetProcAddress to avoid static-linking an API that’s not available on all systems.
    
    Is it a “branch” or just an indirect call? I haven’t tested but I would expect modern CPUs to bench about the same.. (a) fetch a bool and do a conditional jmp then a direct call, vs (b) fetch a function ptr and do a indirect call.
  - Raymond Chen Author July 19, 2025
    
    @Robert Hoffmann: You could replace it with a function pointer, but that would make the branch unpredictable if there is no entry in the branch predictor history (0% success instead of 90% success if profiling hints the test as “AVX2 likely”, which is what happened here), and it costs you a CFG test (to protect against a security vulnerability if somebody could overwrite the function pointer.) In practice, there are over a dozen of these functions.
  - Robin Hoffmann July 18, 2025 · Edited
    
    What if the Swizzle function is replaced by a function pointer?
    This would also remove the need for the hasAVX2 variable, unless it is needed somewhere else.
Joshua Hudson July 18, 2025

I’m used to there being one more rule, don’t reorder asm blocks. Too bad for the compiler intrinsics 🙁

Sometimes I wonder if the old-school solutions are better. In this case the old-school solution is distribution media has several builds with different CPU options and installs the right one. Works great when shipping on CD. How many am I used to seeing? About four.

You typically don’t bother with whole program optimization like this; just the hotspots, which are broken down into their own dlls.
Robin G July 18, 2025

This is a horrible minefield, and then people who don't understand it get upset when a piece of software (like Windows maybe) suddenly has a new requirement that the CPU supports a certain instruction set. Coding for multiple instruction sets is hard and ignoring new instruction sets leaves performance gains unrealized...

I haven't met this particular issue with the optimization, but ran in to another nasty one. If you have one or more .cpp files that have AVX (or some other optional instruction set enabled), and they include/use stuff from the standard library like std::string, the compiler will busily compile implementations...
Read more
This is a horrible minefield, and then people who don’t understand it get upset when a piece of software (like Windows maybe) suddenly has a new requirement that the CPU supports a certain instruction set. Coding for multiple instruction sets is hard and ignoring new instruction sets leaves performance gains unrealized…

I haven’t met this particular issue with the optimization, but ran in to another nasty one. If you have one or more .cpp files that have AVX (or some other optional instruction set enabled), and they include/use stuff from the standard library like std::string, the compiler will busily compile implementations for maybe the std::string constructor with AVX optimization. Then the linker comes along and picks one of the implementations for that constructor from amongst all the compiled versions and maybe picks the one using AVX – after all, all the implementations are supposed to be the same. Now your code crashes on a non-AVX CPU the first time it tries to construct a string.

Read less