SIMD Extension to C++ OpenMP in Visual Studio



In the era of ubiquitous AI applications there is an emerging demand of the compiler accelerating computation-intensive machine-learning code for existing hardware. Such code usually does mathematical computation like matrix transformation and manipulation and it is usually in the form of loops. The SIMD extension of OpenMP provides users an effortless way to speed up loops by explicitly leveraging the vector unit of modern processors. We are proud to start offering C/C++ OpenMP SIMD vectorization in Visual Studio 2019.

The OpenMP C/C++ application program interface was originally designed to improve application performance by enabling code to be effectively executed in parallel on multiple processors in the 1990s. Over the years the OpenMP standard has been expanded to support additional concepts such as task-based parallelization, SIMD vectorization, and processor offloading. Since 2005, Visual Studio has supported the OpenMP 2.0 standard which focuses on multithreaded parallelization. As the world is moving into an AI era, we see a growing opportunity to improve code quality by expanding support of the OpenMP standard in Visual Studio. We continue our journey in Visual Studio 2019 by adding support for OpenMP SIMD.

OpenMP SIMD, first introduced in the OpenMP 4.0 standard, mainly targets loop vectorization. It is so far the most widely used OpenMP feature in machine learning according to our research. By annotating a loop with an OpenMP SIMD directive, the compiler can ignore vector dependencies and vectorize the loop as much as possible. The compiler respects users’ intention to have multiple loop iterations executed simultaneously.

As you may know, C++ in Visual Studio already provides similar non-OpenMP loop pragmas like #pragma vector and #pragma ivdep. However, the compiler can do more with OpenMP SIMD. For example:

  1. The compiler is always allowed to ignore any vector dependencies that are present.
  2. /fp:fast is enabled within the loop.
  3. Loops with function calls are vectorizable.
  4. Outer loops are vectorizable.
  5. Nested loops can be coalesced into one loop and vectorized.
  6. Hybrid acceleration is achievable with #pragma omp for simd to enable coarse-grained multithreading and fine-grained vectorization.

In addition, the OpenMP SIMD directive can take the following clauses to further enhance the vectorization:

  • simdlen(length) : specify the number of vector lanes
  • safelen(length) : specify the vector dependency distance
  • linear(list[ : linear-step]) : the linear mapping from loop induction variable to array subscription
  • aligned(list[ : alignment]): the alignment of data
  • private(list) : specify data privatization
  • lastprivate(list) : specify data privatization with final value from the last iteration
  • reduction(reduction-identifier : list) : specify customized reduction operations
  • collapse(n) : coalescing loop nest

New -openmp:experimental switch

An OpenMP-SIMD-annotated program can be compiled with a new CL switch -openmp:experimental. This new switch enables additional OpenMP features not available under -openmp. While the name of this switch is “experimental”, the switch itself, and the functionality it enables is fully supported and production-ready. The name reflects that it doesn’t enable any complete subset or version of an OpenMP standard. Future iterations of the compiler may use this switch to enable additional OpenMP features and new OpenMP-related switches may be added. The -openmp:experimental switch subsumes the -openmp switch which means it is compatible with all OpenMP 2.0 features. Note that the SIMD directive and its clauses cannot be compiled with the -openmp switch.

For loops that are not vectorized, the compiler will issue a message for each of them like below. For example,

cl -O2 -openmp:experimental mycode.cpp

mycode.cpp(84) : info C5002: Omp simd loop not vectorized due to reason ‘1200’

mycode.cpp(90) : info C5002: Omp simd loop not vectorized due to reason ‘1200’

For loops that are vectorized, the compiler keeps silent unless a vectorization logging switch is provided:

cl -O2 -openmp:experimental -Qvec-report:2 mycode.cpp

mycode.cpp(84) : info C5002: Omp simd loop not vectorized due to reason ‘1200’

mycode.cpp(90) : info C5002: Omp simd loop not vectorized due to reason ‘1200’

mycode.cpp(96) : info C5001: Omp simd loop vectorized

As the first step of supporting OpenMP SIMD we have basically hooked up the SIMD pragma with the backend vectorizer under the new switch. We focused on vectorizing innermost loops by improving the vectorizer and alias analysis. None of the SIMD clauses are effective in Visual Studio 2019 at the time of this writing. They will be parsed but ignored by the compiler with a warning issued for user’s awareness. For example, the compiler will issue

warning C4849: OpenMP ‘simdlen’ clause ignored in ‘simd’ directive

for the following code:

More about the semantics of OpenMP SIMD directive

The OpenMP SIMD directive provides users a way to dictate the compiler to vectorize a loop. The compiler is allowed to ignore the apparent legality of such vectorization by accepting users’ promise of correctness. It is users’ responsibility when unexpected behavior happens with the vectorization. By annotating a loop with the OpenMP SIMD directive, users intend to have multiple loop iterations executed simultaneously. This gives the compiler a lot of freedom to generate machine code that takes advantage of SIMD or vector resources on the target processor. While the compiler is not responsible for exploring the correctness and profit of such user-specified parallelism, it must still ensure the sequential behavior of a single loop iteration.

For example, the following loop is annotated with the OpenMP SIMD directive. There is no perfect parallelism among loop iterations since there is a backward dependency from a[i] to a[i-1]. But because of the SIMD directive the compiler is still allowed to pack consecutive iterations of the first statement into one vector instruction and run them in parallel.

Therefore, the following transformed vector form of the loop is legal because the compiler keeps the sequential behavior of each original loop iteration. In other words, a[i] is executed after a[-1], b[i] is after a[i] and the call to bar happens at last.

It is illegal to move the memory reference *c out of the loop if it may alias with a[i] or b[i]. It’s also illegal to reorder the statements inside one original iteration if it breaks the sequential dependency. As an example, the following transformed loop is not legal.


Future Plans and Feedback

We encourage you to try out this new feature. As always, we welcome your feedback. If you see an OpenMP SIMD loop that you expect to be vectorized, but isn’t or the generated code is not optimal, please let us know. We can be reached via the comments below, via email (, twitter (@visualc) , or via Developer Community.

Moving forward, we’d love to hear your need of OpenMP functionalities missing in Visual Studio. As there have been several major evolutions in OpenMP since the 2.0 standard, OpenMP now has tremendous features to ease your effort to build high-performance programs. For instance, task-based concurrency programming is available starting from OpenMP 3.0. Heterogenous computing (CPU + accelerators) is supported in OpenMP 4.0. Advanced SIMD vectorization and DOACROSS loop parallelization support are also available in the latest OpenMP standard now. Please check out the complete standard revisions and feature sets from the OpenMP official website: We sincerely ask for your thoughts on the specific OpenMP features you would like to see. We’re also interested in hearing about how you’re using OpenMP to accelerate your code. Your feedback is critical that it will help drive the direction of OpenMP support in Visual Studio.


Hongtao Yu

Follow Hongtao   

Jean-Marc Volle 2019-05-02 09:57:24
Hello, I would be interested in support for OpenMP loop optimization when iteration is done other std containers. It is not supported in openMP 2.0 and in OMP 3.0 according to this post:
Hongtao Yu 2019-04-01 22:17:39

Thanks for your suggestsion! We also see your request in Developer Community. We will take it into consideration.

Owens, Bowie (Data61, Clayton) 2019-04-01 21:47:46
I was hoping to use OpenMP to unify the approach to threading to simplify control and improve performance of several of our libraries that work together. I was hoping to utilise the task features of OpenMP 3.0. Is there some more formal way I should request that as a feature?
simmse 2019-03-29 19:05:06
Hello, Is it allowed to discuss target release dates for the other aspects?  For example, user defined reduction uses the #pragma omp declare reduction(theReductioName : UserDefinedType : omp_out = "function body" to determine using omp_in or omp_out) initializer and other variations in the OpenMP notation.  Then the named reduction above is used in the simd directive(s) later.  Is the above on the supported list of OpenMP simd features? Sincerely, simmse
Hongtao Yu 2019-03-28 17:41:46

Hi Yupeng,

 Thanks very much for your feedback.

1. The incompatibility with /permissive- /Zc:twoPhase is a known issue and we are working on a fix. It is due to the recent upgrade of our front-end parser.

2. Regarding the usage of simdlen, it's a good question. The OpenMP standard defines the simdlen number as a preferred number of loop iterations to be executed simultaneously. The standard also gives compiler freedom how to interpret that number. The number will mostly be overwritten by the target switch like /arch:AVX which makes it sound useless, though this is yet implemented in Visual Studio. However, there are a couple of situations that the number may stand out. For example,

   a. The simdlen number is a hint to the compiler that how many iterations can be executed in parallel. When the number is above the number of actual SIMD lanes specified by the target switch, the compiler still has the freedom to unroll the loop and shuffle the instructions around to improve ILP.

   b. With the use of #pragma omp declare simd for a function declaration, the simdlen number can be used to call the exact SIMD implement of that function when called in a loop.

3. The private and lastprivate also clause serves as hint to the compiler to expand scalars to avoid WAW/WAR dependency. For example, with the declaration of variable "b" as private, the compiler can promote "b" to an array to reduce the vector dependencies. 

#pragma omp simd private(b)

for (i = 0; i < N; i++)


           b = a[i];


           c[i] = b;


Yupeng Zhang 2019-03-27 21:25:11
Well, more SIMD is certainly welcome, however, I got a few questions here: 1. /openmp is not compatible with /permissive- or /Zc:twoPhase, and in the case of my team, /permissive- got the priority. Does /openmp:experimental still blocks two-phase? 2. Why can't we get these benefits by default? The 'simdlen(8)' directive is helpless if I am trying to vectorize 32-floats before AVX2. And I failed to see any reason the compiler needs this 'simdlen(8)' information when `/arch:see4` or `arch:avx2` is already specified. 3. Regarding #pragma omp private / lastprivate, I failed to see any reason I need to care data ownership or access like the multi-thread world in a single-thread-vectorization function...