SIMD Extension to C++ OpenMP in Visual Studio

Hongtao Yu

In the era of ubiquitous AI applications there is an emerging demand of the compiler accelerating computation-intensive machine-learning code for existing hardware. Such code usually does mathematical computation like matrix transformation and manipulation and it is usually in the form of loops. The SIMD extension of OpenMP provides users an effortless way to speed up loops by explicitly leveraging the vector unit of modern processors. We are proud to start offering C/C++ OpenMP SIMD vectorization in Visual Studio 2019.

The OpenMP C/C++ application program interface was originally designed to improve application performance by enabling code to be effectively executed in parallel on multiple processors in the 1990s. Over the years the OpenMP standard has been expanded to support additional concepts such as task-based parallelization, SIMD vectorization, and processor offloading. Since 2005, Visual Studio has supported the OpenMP 2.0 standard which focuses on multithreaded parallelization. As the world is moving into an AI era, we see a growing opportunity to improve code quality by expanding support of the OpenMP standard in Visual Studio. We continue our journey in Visual Studio 2019 by adding support for OpenMP SIMD.

OpenMP SIMD, first introduced in the OpenMP 4.0 standard, mainly targets loop vectorization. It is so far the most widely used OpenMP feature in machine learning according to our research. By annotating a loop with an OpenMP SIMD directive, the compiler can ignore vector dependencies and vectorize the loop as much as possible. The compiler respects users’ intention to have multiple loop iterations executed simultaneously.

#pragma omp simd 
for (i = 0; i < count; i++) 
{ 
    a[i] = b[i] + 1; 
}

As you may know, C++ in Visual Studio already provides similar non-OpenMP loop pragmas like #pragma vector and #pragma ivdep. However, the compiler can do more with OpenMP SIMD. For example:

The compiler is always allowed to ignore any vector dependencies that are present.
/fp:fast is enabled within the loop.
Loops with function calls are vectorizable.
Outer loops are vectorizable.
Nested loops can be coalesced into one loop and vectorized.
Hybrid acceleration is achievable with #pragma omp for simd to enable coarse-grained multithreading and fine-grained vectorization.

In addition, the OpenMP SIMD directive can take the following clauses to further enhance the vectorization:

simdlen(length) : specify the number of vector lanes
safelen(length) : specify the vector dependency distance
linear(list[ : linear-step]) : the linear mapping from loop induction variable to array subscription
aligned(list[ : alignment]): the alignment of data
private(list) : specify data privatization
lastprivate(list) : specify data privatization with final value from the last iteration
reduction(reduction-identifier : list) : specify customized reduction operations
collapse(n) : coalescing loop nest

New -openmp:experimental switch

An OpenMP-SIMD-annotated program can be compiled with a new CL switch -openmp:experimental. This new switch enables additional OpenMP features not available under -openmp. While the name of this switch is “experimental”, the switch itself, and the functionality it enables is fully supported and production-ready. The name reflects that it doesn’t enable any complete subset or version of an OpenMP standard. Future iterations of the compiler may use this switch to enable additional OpenMP features and new OpenMP-related switches may be added. The -openmp:experimental switch subsumes the -openmp switch which means it is compatible with all OpenMP 2.0 features. Note that the SIMD directive and its clauses cannot be compiled with the -openmp switch.

For loops that are not vectorized, the compiler will issue a message for each of them like below. For example,

cl -O2 -openmp:experimental mycode.cpp

mycode.cpp(84) : info C5002: Omp simd loop not vectorized due to reason ‘1200’

mycode.cpp(90) : info C5002: Omp simd loop not vectorized due to reason ‘1200’

For loops that are vectorized, the compiler keeps silent unless a vectorization logging switch is provided:

cl -O2 -openmp:experimental -Qvec-report:2 mycode.cpp

mycode.cpp(84) : info C5002: Omp simd loop not vectorized due to reason ‘1200’

mycode.cpp(90) : info C5002: Omp simd loop not vectorized due to reason ‘1200’

mycode.cpp(96) : info C5001: Omp simd loop vectorized

As the first step of supporting OpenMP SIMD we have basically hooked up the SIMD pragma with the backend vectorizer under the new switch. We focused on vectorizing innermost loops by improving the vectorizer and alias analysis. None of the SIMD clauses are effective in Visual Studio 2019 at the time of this writing. They will be parsed but ignored by the compiler with a warning issued for user’s awareness. For example, the compiler will issue

warning C4849: OpenMP ‘simdlen’ clause ignored in ‘simd’ directive

for the following code:

#pragma omp simd simdlen(8)
for (i = 1; i < count; i++)
{
    a[i] = a[i-1] + 1;
    b[i] = *c + 1;
    bar(i);
}

More about the semantics of OpenMP SIMD directive

The OpenMP SIMD directive provides users a way to dictate the compiler to vectorize a loop. The compiler is allowed to ignore the apparent legality of such vectorization by accepting users’ promise of correctness. It is users’ responsibility when unexpected behavior happens with the vectorization. By annotating a loop with the OpenMP SIMD directive, users intend to have multiple loop iterations executed simultaneously. This gives the compiler a lot of freedom to generate machine code that takes advantage of SIMD or vector resources on the target processor. While the compiler is not responsible for exploring the correctness and profit of such user-specified parallelism, it must still ensure the sequential behavior of a single loop iteration.

For example, the following loop is annotated with the OpenMP SIMD directive. There is no perfect parallelism among loop iterations since there is a backward dependency from a[i] to a[i-1]. But because of the SIMD directive the compiler is still allowed to pack consecutive iterations of the first statement into one vector instruction and run them in parallel.

#pragma omp simd
for (i = 1; i < count; i++)
{
    a[i] = a[i-1] + 1;
    b[i] = *c + 1;
    bar(i);
}

Therefore, the following transformed vector form of the loop is legal because the compiler keeps the sequential behavior of each original loop iteration. In other words, a[i] is executed after a[-1], b[i] is after a[i] and the call to bar happens at last.

#pragma omp simd
for (i = 1; i < count; i+=4)
{
    a[i:i+3] = a[i-1:i+2] + 1;
    b[i:i+3] = *c + 1;
    bar(i);
    bar(i+1);
    bar(i+2);
    bar(i+3);
}

It is illegal to move the memory reference *c out of the loop if it may alias with a[i] or b[i]. It’s also illegal to reorder the statements inside one original iteration if it breaks the sequential dependency. As an example, the following transformed loop is not legal.

c = b;
t = *c;
#pragma omp simd
for (i = 1; i < count; i+=4)
{
    a[i:i+3] = a[i-1:i+2] + 1;
    bar(i);            // illegal to reorder if bar[i] depends on b[i]
    b[i:i+3] = t + 1;  // illegal to move *c out of the loop
    bar(i+1);
    bar(i+2);
    bar(i+3);
}

Future Plans and Feedback

We encourage you to try out this new feature. As always, we welcome your feedback. If you see an OpenMP SIMD loop that you expect to be vectorized, but isn’t or the generated code is not optimal, please let us know. We can be reached via the comments below, via email (visualcpp@microsoft.com), twitter (@visualc) , or via Developer Community.

Moving forward, we’d love to hear your need of OpenMP functionalities missing in Visual Studio. As there have been several major evolutions in OpenMP since the 2.0 standard, OpenMP now has tremendous features to ease your effort to build high-performance programs. For instance, task-based concurrency programming is available starting from OpenMP 3.0. Heterogenous computing (CPU + accelerators) is supported in OpenMP 4.0. Advanced SIMD vectorization and DOACROSS loop parallelization support are also available in the latest OpenMP standard now. Please check out the complete standard revisions and feature sets from the OpenMP official website: https://www.openmp.org. We sincerely ask for your thoughts on the specific OpenMP features you would like to see. We’re also interested in hearing about how you’re using OpenMP to accelerate your code. Your feedback is critical that it will help drive the direction of OpenMP support in Visual Studio.

Author

Hongtao Yu

13 comments

Discussion is closed. Login to edit/delete existing comments.

Marv G September 21, 2019

One thing that seems like a simple thing to do is allow more than 64 open mp threads to operate at once on windows. Likely you haven’t had people come up with this problem because it’s hard to get that many cores on a computer. I have 88 hyper threads, however, and can only ever utilize 64 of them. And even so, I had to go through a work around initialization phase to get more than 44 threads. Default openmp will only run on one cpu socket. Please see this for specifics: https://stackoverflow.com/questions/55766303/openmp-doesnt-utilize-all-cpusdual-socket-windows-and-microsoft-visual-studio/55842914?noredirect=1#comment98361390_55842914
- Hongtao Yu Author October 23, 2019
  
  Thanks for the suggestion. The OpenMP library shipped with Visual Studio 2019 utilizes the newest Windows10 APIs to allow for more than 64 worker threads in a pool. Please give it a shot and let us know if it doesn’t work.
Jean-Marc Volle May 2, 2019

Hello, I would be interested in support for OpenMP loop optimization when iteration is done other std containers. It is not supported in openMP 2.0 and in OMP 3.0 according to this post:
https://stackoverflow.com/questions/2513988/iteration-through-std-containers-in-openmp
- Hongtao Yu Author May 16, 2019
  
  Thanks for the suggestion. I would appreciate if you can goto Developer Community and file a feature request there.
  
  Read more
  Thanks for the suggestion. I would appreciate if you can goto Developer Community and file a feature request there.
  
  Read less
Hongtao Yu Author April 1, 2019

Thanks for your suggestsion! We also see your request in Developer Community. We will take it into consideration.
Owens, Bowie (Data61, Clayton) April 1, 2019

I was hoping to use OpenMP to unify the approach to threading to simplify control and improve performance of several of our libraries that work together. I was hoping to utilise the task features of OpenMP 3.0. Is there some more formal way I should request that as a feature?
simmse March 29, 2019

Hello,
Is it allowed to discuss target release dates for the other aspects? For example, user defined reduction uses the
#pragma omp declare reduction(theReductioName : UserDefinedType : omp_out = “function body” to determine using omp_in or omp_out) initializer

and other variations in the OpenMP notation. Then the named reduction above is used in the simd directive(s) later. Is the above on the supported list of OpenMP simd features?
Sincerely,
simmse
- Hongtao Yu Author March 30, 2019
  
  Hello,
  
  Thanks for your valuable feedback. We don’t have a fixed release schedule for future OpenMP features yet. We are currently in process of collecting features per users’ requirement. Once it is done, we’ll work on a ship schedule. Therefore your feedback is really important to us. Can I ask a bit more about your OpenMP scenario? Were you thinking about using the user-defined reduction with OpenMP SIMD or OpenMP for?
  - simmse April 3, 2019
    
    Hello,
    I have two topics to understand customer requirements. The easy statement describing OpenMP user requirements, and pleasing us 100%, is to fully implement support of OpenMP 4.5 and/or newer.
    If that functionailty is too large to release in the next 30 to 60 days, then all of the clauses listed above are a good start:
    simdlen(length) : specify the number of vector lanes
    safelen(length) : specify the vector dependency distance
    linear(list[ : linear-step]) : the linear mapping from loop induction variable to array subscription
    aligned(list[ : alignment]): the alignment of data
    private(list) : specify data privatization
    lastprivate(list) : specify...
    Read more
    Hello,
    I have two topics to understand customer requirements. The easy statement describing OpenMP user requirements, and pleasing us 100%, is to fully implement support of OpenMP 4.5 and/or newer.
    If that functionailty is too large to release in the next 30 to 60 days, then all of the clauses listed above are a good start:
    simdlen(length) : specify the number of vector lanes
    safelen(length) : specify the vector dependency distance
    linear(list[ : linear-step]) : the linear mapping from loop induction variable to array subscription
    aligned(list[ : alignment]): the alignment of data
    private(list) : specify data privatization
    lastprivate(list) : specify data privatization with final value from the last iteration
    reduction(reduction-identifier : list) : specify customized reduction operations
    collapse(n) : coalescing loop nest
    The company I work for will then verify functionality and use accordingly.
    Regarding the user-defined reduction details, the best performance I have so far with a single dimension array size range from as few as ten elements, to as large as one billion elements, is the combination of the parallel and simd directives in a single pragma:
    #pragma omp parallel for simd default(none) firstprivate(theMaxValue) shared(pArrayOfDoubles) reduction(theReductionName : instanceNameOfUserStructure)
    followed by a normal looking for loop with a size_t loop control variable and condition checking by
    i < theMaxValue
    When I separate the parallel and simd into separate pragma directives on separate source code lines, and use scoping braces accordingly, the total time increases. If I use only parallel or only simd, then the total time increases more than the separated parallel and simd combination.
    In other words, supporting only one aspect of the pragma directive, for example simd, will likely not be as fast as possible. Supporting the combination of the parallel and simd directives in the single pragma line above is preferred. Users need multiple OpenMP clauses to correctly determine which aspect(s) will have the best performance. It is understood that the performance test results will vary between array sizes and algorithm implementations. Plus, supporting all of the various ways to use simd, parallel, and for allows users to test and select the directives and clauses that perform best.
    Within the instanceNameOfUserStructure above are two data members. One of them happens to be a size_t, tracking the index position of the reduction search. The other data element is the value at that index position. Implementing user defined reduction functionality this way, then functions within the deprecated Intel cilk library (e.g. https://software.intel.com/en-us/forums/intel-cilk-plus/topic/745556) are present from OpenMP user defined reduction.
    Additional details on the user defined reduction functionaliity can be found in the book titled Using OpenMP – The Next Step by Ruud van der Pas, Eric Stotzer, and Christian Terboven, section 2.4.3., MIT Press, 2017, 978-0-262-53478-9 (e.g. use no dashes when searching the Internet).
    
    Sincerely,
    simmse
    
    Read less
  - Hongtao Yu Author April 18, 2019
    
    Hello,
    
    Your request of #pragma omp parallel for simd support is now queued up offically. You can track it in Developer Community :
    
    https://developercommunity.visualstudio.com/idea/539016/openmp-support-pragma-omp-parallel-for-simd.html
    
    Feature suggestions are prioritized based on the value to our broader developer community and the product roadmap. So please go there and vote for it. In the meanwhile, please open a task there for your other requests.
  - Hongtao Yu Author April 5, 2019
    
    Thanks ver much for detailed description! Your OpenMP feature requirement is queued up.
Hongtao Yu Author March 28, 2019

Hi Yupeng,
Thanks very much for your feedback.
1. The incompatibility with /permissive- /Zc:twoPhase is a known issue and we are working on a fix. It is due to the recent upgrade of our front-end parser.
2. Regarding the usage of simdlen, it's a good question. The OpenMP standard defines the simdlen number as a preferred number of loop iterations to be executed simultaneously. The standard also gives compiler freedom how to interpret that number. The number will mostly be overwritten by the target switch like /arch:AVX which makes it sound useless, though this is yet implemented in Visual Studio....
Read more
Hi Yupeng,

Thanks very much for your feedback.

1. The incompatibility with /permissive- /Zc:twoPhase is a known issue and we are working on a fix. It is due to the recent upgrade of our front-end parser.

2. Regarding the usage of simdlen, it’s a good question. The OpenMP standard defines the simdlen number as a preferred number of loop iterations to be executed simultaneously. The standard also gives compiler freedom how to interpret that number. The number will mostly be overwritten by the target switch like /arch:AVX which makes it sound useless, though this is yet implemented in Visual Studio. However, there are a couple of situations that the number may stand out. For example,

   a. The simdlen number is a hint to the compiler that how many iterations can be executed in parallel. When the number is above the number of actual SIMD lanes specified by the target switch, the compiler still has the freedom to unroll the loop and shuffle the instructions around to improve ILP.

   b. With the use of #pragma omp declare simd for a function declaration, the simdlen number can be used to call the exact SIMD implement of that function when called in a loop.

3. The private and lastprivate also clause serves as hint to the compiler to expand scalars to avoid WAW/WAR dependency. For example, with the declaration of variable “b” as private, the compiler can promote “b” to an array to reduce the vector dependencies.

#pragma omp simd private(b)

for (i = 0; i < N; i++)

       {

           b = a[i];

           foo(b)

           c[i] = b;

       }

Read less
Yupeng Zhang March 27, 2019

Well, more SIMD is certainly welcome, however, I got a few questions here:
1. /openmp is not compatible with /permissive- or /Zc:twoPhase, and in the case of my team, /permissive- got the priority. Does /openmp:experimental still blocks two-phase?
2. Why can't we get these benefits by default? The 'simdlen(8)' directive is helpless if I am trying to vectorize 32-floats before AVX2. And I failed to see any reason the compiler needs this 'simdlen(8)' information when `/arch:see4` or `arch:avx2` is already specified.
3. Regarding #pragma omp private / lastprivate, I failed to see any reason I need to care data ownership or access like the multi-thread world...
Read more
Well, more SIMD is certainly welcome, however, I got a few questions here:
1. /openmp is not compatible with /permissive- or /Zc:twoPhase, and in the case of my team, /permissive- got the priority. Does /openmp:experimental still blocks two-phase?
2. Why can’t we get these benefits by default? The ‘simdlen(8)’ directive is helpless if I am trying to vectorize 32-floats before AVX2. And I failed to see any reason the compiler needs this ‘simdlen(8)’ information when `/arch:see4` or `arch:avx2` is already specified.
3. Regarding #pragma omp private / lastprivate, I failed to see any reason I need to care data ownership or access like the multi-thread world in a single-thread-vectorization function…

Read less