In the era of ubiquitous AI applications there is an emerging demand of the compiler accelerating computation-intensive machine-learning code for existing hardware. Such code usually does mathematical computation like matrix transformation and manipulation and it is usually in the form of loops. The SIMD extension of OpenMP provides users an effortless way to speed up loops by explicitly leveraging the vector unit of modern processors. We are proud to start offering C/C++ OpenMP SIMD vectorization in Visual Studio 2019.
The OpenMP C/C++ application program interface was originally designed to improve application performance by enabling code to be effectively executed in parallel on multiple processors in the 1990s. Over the years the OpenMP standard has been expanded to support additional concepts such as task-based parallelization, SIMD vectorization, and processor offloading. Since 2005, Visual Studio has supported the OpenMP 2.0 standard which focuses on multithreaded parallelization. As the world is moving into an AI era, we see a growing opportunity to improve code quality by expanding support of the OpenMP standard in Visual Studio. We continue our journey in Visual Studio 2019 by adding support for OpenMP SIMD.
OpenMP SIMD, first introduced in the OpenMP 4.0 standard, mainly targets loop vectorization. It is so far the most widely used OpenMP feature in machine learning according to our research. By annotating a loop with an OpenMP SIMD directive, the compiler can ignore vector dependencies and vectorize the loop as much as possible. The compiler respects users’ intention to have multiple loop iterations executed simultaneously.
#pragma omp simd for (i = 0; i < count; i++) { a[i] = b[i] + 1; }
As you may know, C++ in Visual Studio already provides similar non-OpenMP loop pragmas like #pragma vector and #pragma ivdep. However, the compiler can do more with OpenMP SIMD. For example:
- The compiler is always allowed to ignore any vector dependencies that are present.
- /fp:fast is enabled within the loop.
- Loops with function calls are vectorizable.
- Outer loops are vectorizable.
- Nested loops can be coalesced into one loop and vectorized.
- Hybrid acceleration is achievable with #pragma omp for simd to enable coarse-grained multithreading and fine-grained vectorization.
In addition, the OpenMP SIMD directive can take the following clauses to further enhance the vectorization:
- simdlen(length) : specify the number of vector lanes
- safelen(length) : specify the vector dependency distance
- linear(list[ : linear-step]) : the linear mapping from loop induction variable to array subscription
- aligned(list[ : alignment]): the alignment of data
- private(list) : specify data privatization
- lastprivate(list) : specify data privatization with final value from the last iteration
- reduction(reduction-identifier : list) : specify customized reduction operations
- collapse(n) : coalescing loop nest
New -openmp:experimental switch
An OpenMP-SIMD-annotated program can be compiled with a new CL switch -openmp:experimental. This new switch enables additional OpenMP features not available under -openmp. While the name of this switch is “experimental”, the switch itself, and the functionality it enables is fully supported and production-ready. The name reflects that it doesn’t enable any complete subset or version of an OpenMP standard. Future iterations of the compiler may use this switch to enable additional OpenMP features and new OpenMP-related switches may be added. The -openmp:experimental switch subsumes the -openmp switch which means it is compatible with all OpenMP 2.0 features. Note that the SIMD directive and its clauses cannot be compiled with the -openmp switch.
For loops that are not vectorized, the compiler will issue a message for each of them like below. For example,
cl -O2 -openmp:experimental mycode.cpp
mycode.cpp(84) : info C5002: Omp simd loop not vectorized due to reason ‘1200’
mycode.cpp(90) : info C5002: Omp simd loop not vectorized due to reason ‘1200’
For loops that are vectorized, the compiler keeps silent unless a vectorization logging switch is provided:
cl -O2 -openmp:experimental -Qvec-report:2 mycode.cpp
mycode.cpp(84) : info C5002: Omp simd loop not vectorized due to reason ‘1200’
mycode.cpp(90) : info C5002: Omp simd loop not vectorized due to reason ‘1200’
mycode.cpp(96) : info C5001: Omp simd loop vectorized
As the first step of supporting OpenMP SIMD we have basically hooked up the SIMD pragma with the backend vectorizer under the new switch. We focused on vectorizing innermost loops by improving the vectorizer and alias analysis. None of the SIMD clauses are effective in Visual Studio 2019 at the time of this writing. They will be parsed but ignored by the compiler with a warning issued for user’s awareness. For example, the compiler will issue
warning C4849: OpenMP ‘simdlen’ clause ignored in ‘simd’ directive
for the following code:
#pragma omp simd simdlen(8) for (i = 1; i < count; i++) { a[i] = a[i-1] + 1; b[i] = *c + 1; bar(i); }
More about the semantics of OpenMP SIMD directive
The OpenMP SIMD directive provides users a way to dictate the compiler to vectorize a loop. The compiler is allowed to ignore the apparent legality of such vectorization by accepting users’ promise of correctness. It is users’ responsibility when unexpected behavior happens with the vectorization. By annotating a loop with the OpenMP SIMD directive, users intend to have multiple loop iterations executed simultaneously. This gives the compiler a lot of freedom to generate machine code that takes advantage of SIMD or vector resources on the target processor. While the compiler is not responsible for exploring the correctness and profit of such user-specified parallelism, it must still ensure the sequential behavior of a single loop iteration.
For example, the following loop is annotated with the OpenMP SIMD directive. There is no perfect parallelism among loop iterations since there is a backward dependency from a[i] to a[i-1]. But because of the SIMD directive the compiler is still allowed to pack consecutive iterations of the first statement into one vector instruction and run them in parallel.
#pragma omp simd for (i = 1; i < count; i++) { a[i] = a[i-1] + 1; b[i] = *c + 1; bar(i); }
Therefore, the following transformed vector form of the loop is legal because the compiler keeps the sequential behavior of each original loop iteration. In other words, a[i] is executed after a[-1], b[i] is after a[i] and the call to bar happens at last.
#pragma omp simd for (i = 1; i < count; i+=4) { a[i:i+3] = a[i-1:i+2] + 1; b[i:i+3] = *c + 1; bar(i); bar(i+1); bar(i+2); bar(i+3); }
It is illegal to move the memory reference *c out of the loop if it may alias with a[i] or b[i]. It’s also illegal to reorder the statements inside one original iteration if it breaks the sequential dependency. As an example, the following transformed loop is not legal.
c = b; t = *c; #pragma omp simd for (i = 1; i < count; i+=4) { a[i:i+3] = a[i-1:i+2] + 1; bar(i); // illegal to reorder if bar[i] depends on b[i] b[i:i+3] = t + 1; // illegal to move *c out of the loop bar(i+1); bar(i+2); bar(i+3); }
Future Plans and Feedback
We encourage you to try out this new feature. As always, we welcome your feedback. If you see an OpenMP SIMD loop that you expect to be vectorized, but isn’t or the generated code is not optimal, please let us know. We can be reached via the comments below, via email (visualcpp@microsoft.com), twitter (@visualc) , or via Developer Community.
Moving forward, we’d love to hear your need of OpenMP functionalities missing in Visual Studio. As there have been several major evolutions in OpenMP since the 2.0 standard, OpenMP now has tremendous features to ease your effort to build high-performance programs. For instance, task-based concurrency programming is available starting from OpenMP 3.0. Heterogenous computing (CPU + accelerators) is supported in OpenMP 4.0. Advanced SIMD vectorization and DOACROSS loop parallelization support are also available in the latest OpenMP standard now. Please check out the complete standard revisions and feature sets from the OpenMP official website: https://www.openmp.org. We sincerely ask for your thoughts on the specific OpenMP features you would like to see. We’re also interested in hearing about how you’re using OpenMP to accelerate your code. Your feedback is critical that it will help drive the direction of OpenMP support in Visual Studio.
One thing that seems like a simple thing to do is allow more than 64 open mp threads to operate at once on windows. Likely you haven’t had people come up with this problem because it’s hard to get that many cores on a computer. I have 88 hyper threads, however, and can only ever utilize 64 of them. And even so, I had to go through a work around initialization phase to get more...
Thanks for the suggestion. The OpenMP library shipped with Visual Studio 2019 utilizes the newest Windows10 APIs to allow for more than 64 worker threads in a pool. Please give it a shot and let us know if it doesn’t work.
Hello, I would be interested in support for OpenMP loop optimization when iteration is done other std containers. It is not supported in openMP 2.0 and in OMP 3.0 according to this post:
https://stackoverflow.com/questions/2513988/iteration-through-std-containers-in-openmp
Thanks for the suggestion. I would appreciate if you can goto Developer Community and file a feature request there.
Thanks for your suggestsion! We also see your request in Developer Community. We will take it into consideration.
I was hoping to use OpenMP to unify the approach to threading to simplify control and improve performance of several of our libraries that work together. I was hoping to utilise the task features of OpenMP 3.0. Is there some more formal way I should request that as a feature?
Hello,
Is it allowed to discuss target release dates for the other aspects? For example, user defined reduction uses the
#pragma omp declare reduction(theReductioName : UserDefinedType : omp_out = “function body” to determine using omp_in or omp_out) initializer
and other variations in the OpenMP notation. Then the named reduction above is used in the simd directive(s) later. Is the above on the supported list of OpenMP simd features?
Sincerely,
simmse
Hello,
Thanks for your valuable feedback. We don't have a fixed release schedule for future OpenMP features yet. We are currently in process of collecting features per users' requirement. Once it is done, we'll work on a ship schedule. Therefore your feedback is really important to us. Can I ask a bit more about your OpenMP scenario? Were you thinking about using the user-defined reduction with OpenMP SIMD or OpenMP for?
Hello,
I have two topics to understand customer requirements. The easy statement describing OpenMP user requirements, and pleasing us 100%, is to fully implement support of OpenMP 4.5 and/or newer.
If that functionailty is too large to release in the next 30 to 60 days, then all of the clauses listed above are a good start:
simdlen(length) : specify the number of vector lanes
safelen(length) : specify the vector dependency distance
linear(list[ : linear-step]) :...
Hello,
Your request of #pragma omp parallel for simd support is now queued up offically. You can track it in Developer Community :
https://developercommunity.visualstudio.com/idea/539016/openmp-support-pragma-omp-parallel-for-simd.html
Feature suggestions are prioritized based on the value to our broader developer community and the product roadmap. So please go there and vote for it. In the meanwhile, please open a task there for your other requests.
Thanks ver much for detailed description! Your OpenMP feature requirement is queued up.
Hi Yupeng,
Thanks very much for your feedback.
1. The incompatibility with /permissive- /Zc:twoPhase is a known issue and we are working on a fix. It is due to the recent upgrade of our front-end parser.
2. Regarding the usage of simdlen, it's a good question. The OpenMP standard defines the simdlen number as a preferred number of loop iterations to be executed simultaneously. The standard also gives compiler freedom how to interpret that number....
Well, more SIMD is certainly welcome, however, I got a few questions here:
1. /openmp is not compatible with /permissive- or /Zc:twoPhase, and in the case of my team, /permissive- got the priority. Does /openmp:experimental still blocks two-phase?
2. Why can't we get these benefits by default? The 'simdlen(8)' directive is helpless if I am trying to vectorize 32-floats before AVX2. And I failed to see any reason the compiler needs this 'simdlen(8)' information when `/arch:see4` or `arch:avx2` is already...