The image processing folks added specialized AVX2 versions of their code, but found that it was crashing with an illegal instruction exception. The code went something like this:
void SwizzleAVX2(uint32_t* source, uint32_t* destination, uint32_t count) { ⟦ do stuff using AVX-only instructions ⟧ ⟦ such as _mm256_cvtepu8_epi16 ⟧ } void SwizzleSSE4(uint32_t* source, uint32_t* destination, uint32_t count) { ⟦ do stuff using SSE4 instructions ⟧ ⟦ such as _mm_cvtepu8_epi16 ⟧ } bool hasAVX2; // initialized elsewhere void Swizzle(uint32_t* source, uint32_t* destination, uint32_t count) { if (hasAVX2) { SwizzleAVX2(source, destination, count); } else { SwizzleSSE4(source, destination, count); } }
This looks good, doesn’t it? We check whether AVX2 instructions are available, and if so, we use the AVX2 version; otherwise we use the SSE4 version.
But in fact, this code crashes with an invalid instruction exception on systems that do not have AVX2. How can that be?
Compiler optimization.
According to the “as-if” rule, the compiler is permitted to perform any optimization that a program cannot legitimately detect, where “legitimately” means “within the rules of the language”.
What happened is that the compiler first inlined the SwizzleAVX2
and SwizzleSSE4
functions into the Swizzle
function, and then it reordered the instructions so that some of the AVX2 instructions from SwizzleAVX2
were moved in front of the test of the hasAVX2
variable. For example, maybe SwizzleAVX2
started by setting some registers to zero. The compiler might have decided to do this because profiling revealed that hasAVX2
is usually true, so it wants to get the registers ready in anticipation of using them for the rest of the SwizzleAVX2
function.
Unfortunately, the compiler doesn’t realize that our test of hasAVX2
was specifically intended to prevent any AVX2 instructions from running. The concept of “instructions that might not be available” does not arise in the C or C++ language specifications, so there is nothing in the language itself that addresses the matter.
There are some directives you can use to tell the compiler that certain memory operations must occur in a specific order. For example, you can use interlocked operations with acquire or release semantics, or you can use std::
, or you can use explicit memory barriers.
However, none of them are of use here because the offending instruction isn’t a memory instruction, so memory ordering directives have no effect.
The (somewhat unsatisfying) solution was to mark the AVX version as noinline so that the compiler cannot reorder instructions out of it.
__declspec(noinline)
void SwizzleAVX2(uint32_t* source, uint32_t* destination, uint32_t count)
{
⟦ do stuff using AVX-only instructions ⟧
⟦ such as _mm256_cvtepu8_epi16 ⟧
}
Does this happen with -O2? And it happens only in C++ or in C as well?
Couldn’t you use the short-circuiting rule of if statements to guarantee non-execution of the 2nd condition:
change SwizzleAVX2 to return a boolean (dummy) true value
and then
That doesn’t help because a separate “if (hasAVX2)” is already short-circuiting: It’s a separate statement altogether!
The standard permits compilers to reorder code as long as observable behavior is not affected. The problem is that the AVX2 instructions inside SwizzleAVX2 have side effects not covered by the standard’s definition of “observable behavior” (namely, crashing on certain hardware).
You would probably have the same problem, but further down: what gives you the HasAVX2() that is used to choose which function to assign?
In theory, if “register” was still allowed (and its semantics were honoured, which has never really happened) that would probably work.
@Robin Hoffmann Function-pointer is how I’ve seen this done, in some codebases. I find it similar in spirit to using GetProcAddress to avoid static-linking an API that’s not available on all systems.
Is it a “branch” or just an indirect call? I haven’t tested but I would expect modern CPUs to bench about the same.. (a) fetch a bool and do a conditional jmp then a direct call, vs (b) fetch a function ptr and do a indirect call.
@Robert Hoffmann: You could replace it with a function pointer, but that would make the branch unpredictable if there is no entry in the branch predictor history (0% success instead of 90% success if profiling hints the test as “AVX2 likely”, which is what happened here), and it costs you a CFG test (to protect against a security vulnerability if somebody could overwrite the function pointer.) In practice, there are over a dozen of these functions.
What if the Swizzle function is replaced by a function pointer?
This would also remove the need for the hasAVX2 variable, unless it is needed somewhere else.
I’m used to there being one more rule, don’t reorder asm blocks. Too bad for the compiler intrinsics 🙁
Sometimes I wonder if the old-school solutions are better. In this case the old-school solution is distribution media has several builds with different CPU options and installs the right one. Works great when shipping on CD. How many am I used to seeing? About four.
You typically don’t bother with whole program optimization like this; just the hotspots, which are broken down into their own dlls.
This is a horrible minefield, and then people who don't understand it get upset when a piece of software (like Windows maybe) suddenly has a new requirement that the CPU supports a certain instruction set. Coding for multiple instruction sets is hard and ignoring new instruction sets leaves performance gains unrealized...
I haven't met this particular issue with the optimization, but ran in to another nasty one. If you have one or more .cpp files that have AVX (or some other optional instruction set enabled), and they include/use stuff from the standard library like std::string, the compiler will busily compile implementations...