SIMD Extension to C++ OpenMP in Visual Studio

Avatar

Hongtao

In the era of ubiquitous AI applications there is an emerging demand of the compiler accelerating computation-intensive machine-learning code for existing hardware. Such code usually does mathematical computation like matrix transformation and manipulation and it is usually in the form of loops. The SIMD extension of OpenMP provides users an effortless way to speed up loops by explicitly leveraging the vector unit of modern processors. We are proud to start offering C/C++ OpenMP SIMD vectorization in Visual Studio 2019.

The OpenMP C/C++ application program interface was originally designed to improve application performance by enabling code to be effectively executed in parallel on multiple processors in the 1990s. Over the years the OpenMP standard has been expanded to support additional concepts such as task-based parallelization, SIMD vectorization, and processor offloading. Since 2005, Visual Studio has supported the OpenMP 2.0 standard which focuses on multithreaded parallelization. As the world is moving into an AI era, we see a growing opportunity to improve code quality by expanding support of the OpenMP standard in Visual Studio. We continue our journey in Visual Studio 2019 by adding support for OpenMP SIMD.

OpenMP SIMD, first introduced in the OpenMP 4.0 standard, mainly targets loop vectorization. It is so far the most widely used OpenMP feature in machine learning according to our research. By annotating a loop with an OpenMP SIMD directive, the compiler can ignore vector dependencies and vectorize the loop as much as possible. The compiler respects users’ intention to have multiple loop iterations executed simultaneously.

As you may know, C++ in Visual Studio already provides similar non-OpenMP loop pragmas like #pragma vector and #pragma ivdep. However, the compiler can do more with OpenMP SIMD. For example:

  1. The compiler is always allowed to ignore any vector dependencies that are present.
  2. /fp:fast is enabled within the loop.
  3. Loops with function calls are vectorizable.
  4. Outer loops are vectorizable.
  5. Nested loops can be coalesced into one loop and vectorized.
  6. Hybrid acceleration is achievable with #pragma omp for simd to enable coarse-grained multithreading and fine-grained vectorization.

In addition, the OpenMP SIMD directive can take the following clauses to further enhance the vectorization:

  • simdlen(length) : specify the number of vector lanes
  • safelen(length) : specify the vector dependency distance
  • linear(list[ : linear-step]) : the linear mapping from loop induction variable to array subscription
  • aligned(list[ : alignment]): the alignment of data
  • private(list) : specify data privatization
  • lastprivate(list) : specify data privatization with final value from the last iteration
  • reduction(reduction-identifier : list) : specify customized reduction operations
  • collapse(n) : coalescing loop nest

New -openmp:experimental switch

An OpenMP-SIMD-annotated program can be compiled with a new CL switch -openmp:experimental. This new switch enables additional OpenMP features not available under -openmp. While the name of this switch is “experimental”, the switch itself, and the functionality it enables is fully supported and production-ready. The name reflects that it doesn’t enable any complete subset or version of an OpenMP standard. Future iterations of the compiler may use this switch to enable additional OpenMP features and new OpenMP-related switches may be added. The -openmp:experimental switch subsumes the -openmp switch which means it is compatible with all OpenMP 2.0 features. Note that the SIMD directive and its clauses cannot be compiled with the -openmp switch.

For loops that are not vectorized, the compiler will issue a message for each of them like below. For example,

cl -O2 -openmp:experimental mycode.cpp

mycode.cpp(84) : info C5002: Omp simd loop not vectorized due to reason ‘1200’

mycode.cpp(90) : info C5002: Omp simd loop not vectorized due to reason ‘1200’

For loops that are vectorized, the compiler keeps silent unless a vectorization logging switch is provided:

cl -O2 -openmp:experimental -Qvec-report:2 mycode.cpp

mycode.cpp(84) : info C5002: Omp simd loop not vectorized due to reason ‘1200’

mycode.cpp(90) : info C5002: Omp simd loop not vectorized due to reason ‘1200’

mycode.cpp(96) : info C5001: Omp simd loop vectorized

As the first step of supporting OpenMP SIMD we have basically hooked up the SIMD pragma with the backend vectorizer under the new switch. We focused on vectorizing innermost loops by improving the vectorizer and alias analysis. None of the SIMD clauses are effective in Visual Studio 2019 at the time of this writing. They will be parsed but ignored by the compiler with a warning issued for user’s awareness. For example, the compiler will issue

warning C4849: OpenMP ‘simdlen’ clause ignored in ‘simd’ directive

for the following code:

More about the semantics of OpenMP SIMD directive

The OpenMP SIMD directive provides users a way to dictate the compiler to vectorize a loop. The compiler is allowed to ignore the apparent legality of such vectorization by accepting users’ promise of correctness. It is users’ responsibility when unexpected behavior happens with the vectorization. By annotating a loop with the OpenMP SIMD directive, users intend to have multiple loop iterations executed simultaneously. This gives the compiler a lot of freedom to generate machine code that takes advantage of SIMD or vector resources on the target processor. While the compiler is not responsible for exploring the correctness and profit of such user-specified parallelism, it must still ensure the sequential behavior of a single loop iteration.

For example, the following loop is annotated with the OpenMP SIMD directive. There is no perfect parallelism among loop iterations since there is a backward dependency from a[i] to a[i-1]. But because of the SIMD directive the compiler is still allowed to pack consecutive iterations of the first statement into one vector instruction and run them in parallel.

Therefore, the following transformed vector form of the loop is legal because the compiler keeps the sequential behavior of each original loop iteration. In other words, a[i] is executed after a[-1], b[i] is after a[i] and the call to bar happens at last.

It is illegal to move the memory reference *c out of the loop if it may alias with a[i] or b[i]. It’s also illegal to reorder the statements inside one original iteration if it breaks the sequential dependency. As an example, the following transformed loop is not legal.

 

Future Plans and Feedback

We encourage you to try out this new feature. As always, we welcome your feedback. If you see an OpenMP SIMD loop that you expect to be vectorized, but isn’t or the generated code is not optimal, please let us know. We can be reached via the comments below, via email (visualcpp@microsoft.com), twitter (@visualc) , or via Developer Community.

Moving forward, we’d love to hear your need of OpenMP functionalities missing in Visual Studio. As there have been several major evolutions in OpenMP since the 2.0 standard, OpenMP now has tremendous features to ease your effort to build high-performance programs. For instance, task-based concurrency programming is available starting from OpenMP 3.0. Heterogenous computing (CPU + accelerators) is supported in OpenMP 4.0. Advanced SIMD vectorization and DOACROSS loop parallelization support are also available in the latest OpenMP standard now. Please check out the complete standard revisions and feature sets from the OpenMP official website: https://www.openmp.org. We sincerely ask for your thoughts on the specific OpenMP features you would like to see. We’re also interested in hearing about how you’re using OpenMP to accelerate your code. Your feedback is critical that it will help drive the direction of OpenMP support in Visual Studio.

 

Avatar
Hongtao Yu

Follow Hongtao   

11 comments

  • Avatar
    Yupeng Zhang

    Well, more SIMD is certainly welcome, however, I got a few questions here:
    1. /openmp is not compatible with /permissive- or /Zc:twoPhase, and in the case of my team, /permissive- got the priority. Does /openmp:experimental still blocks two-phase?
    2. Why can’t we get these benefits by default? The ‘simdlen(8)’ directive is helpless if I am trying to vectorize 32-floats before AVX2. And I failed to see any reason the compiler needs this ‘simdlen(8)’ information when /arch:see4 or arch:avx2 is already specified.
    3. Regarding #pragma omp private / lastprivate, I failed to see any reason I need to care data ownership or access like the multi-thread world in a single-thread-vectorization function…

  • Avatar
    Hongtao Yu

    Hi Yupeng,

     Thanks very much for your feedback.

    1. The incompatibility with /permissive- /Zc:twoPhase is a known issue and we are working on a fix. It is due to the recent upgrade of our front-end parser.

    2. Regarding the usage of simdlen, it’s a good question. The OpenMP standard defines the simdlen number as a preferred number of loop iterations to be executed simultaneously. The standard also gives compiler freedom how to interpret that number. The number will mostly be overwritten by the target switch like /arch:AVX which makes it sound useless, though this is yet implemented in Visual Studio. However, there are a couple of situations that the number may stand out. For example,

       a. The simdlen number is a hint to the compiler that how many iterations can be executed in parallel. When the number is above the number of actual SIMD lanes specified by the target switch, the compiler still has the freedom to unroll the loop and shuffle the instructions around to improve ILP.

       b. With the use of #pragma omp declare simd for a function declaration, the simdlen number can be used to call the exact SIMD implement of that function when called in a loop.

    3. The private and lastprivate also clause serves as hint to the compiler to expand scalars to avoid WAW/WAR dependency. For example, with the declaration of variable “b” as private, the compiler can promote “b” to an array to reduce the vector dependencies. 

    #pragma omp simd private(b)

    for (i = 0; i < N; i++)

           {

               b = a[i];

               foo(b)

               c[i] = b;

           }

  • Avatar
    simmse

    Hello,
    Is it allowed to discuss target release dates for the other aspects?  For example, user defined reduction uses the
    #pragma omp declare reduction(theReductioName : UserDefinedType : omp_out = “function body” to determine using omp_in or omp_out) initializer

    and other variations in the OpenMP notation.  Then the named reduction above is used in the simd directive(s) later.  Is the above on the supported list of OpenMP simd features?
    Sincerely,
    simmse

    • Avatar
      Hongtao Yu

      Hello,  

      Thanks for your valuable feedback. We don’t have a fixed release schedule for future OpenMP features yet. We are currently in process of collecting features per users’ requirement. Once it is done, we’ll work on a ship schedule. Therefore your feedback is really important to us. Can I ask a bit more about your OpenMP scenario? Were you thinking about using the user-defined reduction with OpenMP SIMD or OpenMP for?

      • Avatar
        simmse

        Hello,
        I have two topics to understand customer requirements.  The easy statement describing OpenMP user requirements, and pleasing us 100%, is to fully implement support of OpenMP 4.5 and/or newer. 
        If that functionailty is too large to release in the next 30 to 60 days, then all of the clauses listed above are a good start:
        simdlen(length) : specify the number of vector lanes
        safelen(length) : specify the vector dependency distance
        linear(list[ : linear-step]) : the linear mapping from loop induction variable to array subscription
        aligned(list[ : alignment]): the alignment of data
        private(list) : specify data privatization
        lastprivate(list) : specify data privatization with final value from the last iteration
        reduction(reduction-identifier : list) : specify customized reduction operations
        collapse(n) : coalescing loop nest
        The company I work for will then verify functionality and use accordingly.
        Regarding the user-defined reduction details, the best performance I have so far with a single dimension array size range from as few as ten elements, to as large as one billion elements, is the combination of the parallel and simd directives in a single pragma:
        #pragma omp parallel for simd default(none) firstprivate(theMaxValue) shared(pArrayOfDoubles) reduction(theReductionName : instanceNameOfUserStructure)
        followed by a normal looking for loop with a size_t loop control variable and condition checking by
        i < theMaxValue
        When I separate the parallel and simd into separate pragma directives on separate source code lines, and use scoping braces accordingly, the total time increases.  If I use only parallel or only simd, then the total time increases more than the separated parallel and simd combination. 
        In other words, supporting only one aspect of the pragma directive, for example simd, will likely not be as fast as possible.  Supporting the combination of the parallel and simd directives in the single pragma line above is preferred.  Users need multiple OpenMP clauses to correctly determine which aspect(s) will have the best performance.  It is understood that the performance test results will vary between array sizes and algorithm implementations.  Plus, supporting all of the various ways to use simd, parallel, and for allows users to test and select the directives and clauses that perform best.
        Within the instanceNameOfUserStructure above are two data members.  One of them happens to be a size_t, tracking the index position of the reduction search.  The other data element is the value at that index position.  Implementing user defined reduction functionality this way, then functions within the deprecated Intel cilk library (e.g. https://software.intel.com/en-us/forums/intel-cilk-plus/topic/745556) are present from OpenMP user defined reduction. 
        Additional details on the user defined reduction functionaliity can be found in the book titled Using OpenMP – The Next Step by Ruud van der Pas, Eric Stotzer, and Christian Terboven, section 2.4.3., MIT Press, 2017, 978-0-262-53478-9 (e.g. use no dashes when searching the Internet).

        Sincerely,
        simmse

  • Avatar
    Owens, Bowie (Data61, Clayton)

    I was hoping to use OpenMP to unify the approach to threading to simplify control and improve performance of several of our libraries that work together. I was hoping to utilise the task features of OpenMP 3.0. Is there some more formal way I should request that as a feature?

Leave a comment