I told the Microsoft Visual C++ compiler not to generate AVX instructions, but it did it anyway!


A customer passed the /arch:SSE2 flag to the Microsoft Visual C++ compiler, which means “Enable use of instructions available with SSE2-enabled CPUs.” In particular, the customer did not pass the /arch:SSE4 flag,¹ so they did not enable the use of SSE4 instructions.

And then they did this:

#include <mmintrin.h>

void something()
    __m128i v = _mm_load_si128(&mem);
    ... more SSE2 stuff ...
    v = _mm_insert_epi32(v, alpha, 3);
    ... more SSE2 stuff ...

The _mm_insert_epi32() intrinsic maps to the PINSRD instruction, which is an SSE4 instruction, not SSE2.

To the customer’s surprise, this code not only compiled, it even ran! The customer wanted to know what is happening. Did the compiler convert the _mm_insert_epi32() into an equivalent series of SSE2 instructions?

No, the compiler didn’t do that. You explicitly requested an SSE4 instruction, so the compiler honored your request. The /arch:SSE2 flag tells the compiler not to use any instructions beyond SSE2 in its own code generation, say during autovectorization or optimized memcpy. But if you invoke it explicitly, then you get what you wrote.

I guess the option could be more accurately (and verbosely) named “Enable automatic use of instructions available with SSE2-enabled CPUs.” Because what this controls is whether the compiler will use those instructions of its own volition.

The customer happened to test their program on a CPU that supported SSE4, so the instruction worked. If they had run it on a a CPU that supported SSE2 but not SSE4, it would have crashed.

The reason SSE4 intrinsics are still allowed even in SSE2 mode is that you might have identified some performance-sensitive operations and written two versions of the code, one that uses SSE2 intrinsics, and another that uses SSE4 intrinsics, choosing between the two at runtime based on a processor capability check.

The compiler won’t generate any SSE4 instructions on its own, so your code is safe on SSE2 systems. When you detect an SSE4 system, you can explicitly call the SSE4 code paths.

¹ As commenter Danielix Klimax noted, there is no actual /arch:SSE4 option. Please interpret the remark in the spirit it was intended. (“The custom did not pass any flags that would enable SSE4 instructions.”)


Comments are closed. Login to edit/delete your existing comments

    • Vk TestA

      People need this explained because it’s surprising. Consider sane compiler, e.g. clang: https://godbolt.org/z/6K57Tj

      [[gnu::target(“default”)]] void something(int alpha)
      __m128i v = _mm_load_si128(&mem);
      // … SSE2 version …

      [[gnu::target(“sse4.1”)]] void something(int alpha)
      __m128i v = _mm_load_si128(&mem);
      // using SSE4.1 is OK here.
      mem = _mm_insert_epi32(v, alpha, 3);

      void test_something() {
      // The right function would be picked automatically.

      There you can easily write different versions of functions and compiler will do all the right things: stop compilation if you use improper intrinsic, pick the right version if you support more than one CPU, etc.

      As you can see behavior MSVC exhibits is not only not needed, it’s obviously harmful: if you are planning to write two versions of the code then you would like to be sure that the one which is not supposed to use SSE4… doesn’t use it.

      MSVC fails even at simple detection phase… this is not surprising, though: MSVC was never all that good as pure C++ compiler, it’s strength lies with tight integration with other Visual Studio tools… some of which are superb and leave anything you may find on other platforms in the dust.

  • Danielix Klimax

    A thing: There is no /arch: SSE4. Compiler only supports IA32, SSE, SSE2, AVX, AVX2 and AVX512. (First three are valid only in x86 compilation) Meaning that customer wouldn’t be able at all to use SSE3, SSSE3 and SSE4.x if arch flag worked the way they thought…

  • ‪ ‪

    GCC and Clang behave as customers expect.
    Well, I prefer the behavior of MSVC.

    • Vk TestA

      Why is it preferable? I think in this particular case “explicit is better than implicit”. And it’s not hard to mark the functions where you really need something not supported in the mode selected by command-line switch with [[gnu::target(“sse4.1”)]]

      • Danielix Klimax

        Hm, how does it handle non-SSE instructions (3DNow and XOP/FMA4, (V)AES, SHA and other special cases).

  • Michael Entin

    Is there some macro to define before including to only declare prototypes of intrinsics available for specific technology?
    Customer desire to have this compile-time checked is very understandable, and it should be simple to define such macro, and sprinkle with conditional preprocessor to only enable appropriate intrinsics.