MSVC OpenMP Update

In our previous blog post, we announced support for OpenMP tasks starting with Visual Studio 17.2. Now, we are pleased to announce we have added further OpenMP features to Visual Studio 17.4, which brings us closer to conformance with OpenMP 3.1.

`#pragma atomic` with OpenMP 3.1 semantics

We added support for #pragma omp atomic a while ago but we now also support the full OpenMP 3.1 syntax and semantics for atomic operations. Specifically, we now support a read, write, update or capture clause in the pragma while the pragma can now apply either to an expression-statement (as before) or a structured block, which has particular restrictions that the compiler will check.

When the compiler encounters the new OpenMP atomic clauses, it will make sure that the LLVM OpenMP runtime (libomp) is being used:

example.cpp(14): error C7660: '#pragma omp atomic update': requires '-openmp:llvm' command line option(s)

This is because we support the newer semantics only on the new LLVM-based OpenMP runtime.

omp atomic may seem like a duplication of omp critical but it is different in that omp critical is a generalized mutual exclusion mechanism and can wrap any kind of code, while omp atomic limits the kinds of operations that it supports. Based on these restrictions, the compiler can, in principle, generate more optimized code. For example, a critical section always requires acquiring a lock from the underlying operating system, but an atomic operation can use the underlying hardware guarantees to avoid such locking for, say, loads or stores of variables smaller than a register.

Consider this example from the OpenMP 3.1 Specification:

int work1(int i);
int work2(int i);

void atomic_example(int* x, int* y, int* index, int n)
{
    int i;
    #pragma omp parallel for shared(x, y, index, n)
    for (i = 0; i < n; i++) {
        #pragma omp atomic update
        x[index[i]] += work1(i);

        y[i] += work2(i);
    }
}

Compiling the above for x86 with full optimizations, this is what gets generated:

    push    esi
    call    ?work1@@YAHH@Z              ; work1

; 9    : #pragma omp atomic update
; 10   : x[index[i]] += work1(i);

    mov ecx, DWORD PTR _x$[esp+20]
    push    eax
    mov eax, DWORD PTR _index$[esp+24]
    mov eax, DWORD PTR [eax+edi]
    lea eax, DWORD PTR [ecx+eax*4]
    push    eax
    push    ebx
    push    0
    call    ___kmpc_atomic_fixed4_add

; 11   :         y[i] += work2(i);

    push    esi
    call    ?work2@@YAHH@Z              ; work2
    add DWORD PTR [edi], eax

Note that to update x[index[i]], the code first calculates the address of that array location, and then calls the libomp API __kpmc_atmoc_fixed4_add to do the actual update atomically, while for the subsequent update of y[i], the code is just an add instruction.

Given that the OpenMP atomic operations are meant to be an especially efficient form of critical section, it’s possible to optimize the above code by generating the code for the __kmp_atomic_fixed4_add library call inline and avoid a function call. We don’t currently do this but this work is planned for future versions of MSVC.

We now also support capture as a clause for omp atomic, with both the expression-statement and structured block syntax. Using the capure clause allows atomic update of an l-value while capturing its initial or final value at the same time. E.g., consider a team of threads which we have to allocate work to. Assume the work is allocated based on “slots” which are identified by a variable slot, with the idea being that each thread gets assigned a different value of this variable. This could be implemented using atomic capture in this way:

void assign_work()
{
    int slot = 0;
    int my_slot;
    const int max_slot = 1'000'000;

    #pragma omp parallel private(my_slot)
    while (slot < max_slot)
    {
        // Get the current value of slot and update it.
        // Note that all threads are going through the
        // slots in parallel
        #pragma omp atomic capture
        { my_slot = slot; ++slot; }
        do_work(my_slot);
    }
}

Each parallel thread running the loop body will atomically save the current value of slot into its private variable my_slot and then increment slot, the whole operation being executed atomically with repect to other threads. Consequently, no two threads will get the same value of slot passed to do_work and eventually all values up to max_slot will be allocated.

We could also write the above atomic operation more compactly using the expression-statement version of capture:

#pragma omp atomic capture
my_slot = slot++;

The compiler has added diagnostics for required expression forms for omp atomic. E.g.,:

#pragma omp atomic
{ v = x; +x; }

produces:

.\atomic-capture-block.c(14,24): error C3048: '#pragma omp atomic capture': expression or block-statement following pragma does not conform to the OpenMP specification
            v = x; +x;
                   ^

Attempting to use an overloaded operator in a capture block or expression gives:

.\atomic_capture_neg.cpp(18,11): error C3943: '#pragma omp atomic': operator '+=' is overloaded; only built-in operators are allowed
    x += s;
      ^

We’ve added diagnostics to help with the validating the semantics and syntax of #pragma omp atomic but one thing should be borne in mind: because MSVC doesn’t print expressions in diagnostics, using /diagnostics:caret is helpful in getting the most from the new diagnostics. E.g.,

int test(int initial)
{
    int v, x;
    #pragma omp atomic capture
    {
        v = x; v = v + 1;
    }
    return v;
}

produces

.\atomic-capture-block.cpp(6,20): error C5300: '#pragma omp atomic capture': expression mismatch for lvalue being updated
            v = x; v = v + 1;
                   ^
.\atomic-capture-block.cpp(6,17): note: see the lvalue expression here
            v = x; v = v + 1;
                ^

Without /diagnostics:caret we would have just the line numbers which don’t help in understanding the diagnostic.

`min` and `max` reduction operators

MSVC has supported reduction operators since implementing OpenMP 2.0, to which we have now added support for min and max operations as well. Consider the simple case of determining the maximum of an array of values. Serial code to do this is given below:

double serial_max(double* A, int size)
{
    int max = INT_MIN;
    for (int i = 0; i < size; ++i)
        if (A[i] > max)
            max = A[i];
    return max;
}

Parallelizing this to run on multiple threads in a naive way requires a critical section to update max:

double parallel_max(double* A, int size)
{
    int maxval = INT_MIN;
    #pragma omp parallel for shared(maxval)
    for (int i = 0; i < size; ++i)
        #pragma omp critical
        if (A[i] > maxval)
            maxval = A[i];
    return maxval;
}

It’s obvious that the above version has a performance problem: every comparison of max to an array variable is being done in a critical section! To improve this, we can have each parallel thread maintain its own maximum and merge them at the end into one maximum for all of them. This will require maintaining an auxiliary vector of maximum values, one per thread, updating the right ones per thread and finally merging the maximum values into a single one, quite a chore if written out by hand. Instead, we can take advantage of the new max reduction operator and write a simple loop:

double parallel_max(double* A, int size)
{
    int maxval = INT_MIN;
    #pragma omp parallel for reduction(max : maxval)
    for (i = 0; i < size; ++i)
        if (A[i] > maxval) maxval = A[i];
    return maxval;
}

The above version creates a private maxval for each thread, which avoids the need for a critical section in the loop and at the end merges them all into a single maximum. Computing the minimum would use the min reduction operator in an analogous fashion.

Pointers as loop-index variables for `#pragma omp for`

MSVC has hitherto restricted loop variables to integral types while the OpenMP specification allowed pointer types as well. We have now implemented this feature and it is now possible to loop over arrays in parallel using either pointers or integer indices. E.g.,

void test()
{
    int a[100];
    int *p, *begin = &a[0], *end = &a[100];
    int k = 0;

    #pragma omp parallel for
    for (p = begin; p < end; ++p) 
        *p = k;
}

As part of a general policy of supporting newer OpenMP features only on the LLVM libomp runtime, pointer loop variables require the compiler option /openmp:llvm to be used.

Note: C++ iterator support is not yet implemented but is planned for the future.

Miscellaneous improvements: diagnostic messages and bug fixes

We’ve improved the accuracy or user friendliness of OpenMP diagnostic messages in several places. E.g., for loops, we’ve added checks for loop comparison operators:

    #pragma omp parallel for
    for (p = begin; p > end; ++p) *p = k;  // C5301

produces:

.\loop_warnings_ptr.cpp(10,23): warning C5301: '#pragma omp for': 'p' increases while loop condition uses '>'; non-terminating loop?
    for (p = begin; p > end; ++p) *p = k;  // C5301
                      ^

We’ve also fixed several bugs reported by users or discovered during our testing.

A note about the LLVM runtime

Currently, the LLVM runtime matching compiler is based on version 11. We plan to upgrade the runtime to a more recent version in a future release, but meanwhile we’ve ported a couple of critical bug fixes: rGb7b498657685 (llvm.org) and rG1b968467c057 (llvm.org). Many thanks to Jonathan Peyton who provided these fixes!

With the help of our colleague Vadim Paretsky from Intel, we’ve upstreamed changes (1, 2) to main that we’ve made so far to the libomp runtime. The only missing change is for atomics for ARM64.

We’re interested in hearing from you if you want to build your own libomp for Windows. Please reply in the blog comments in this case.

A word of caution: we had to accept a breaking change where ordinals for the exported symbols were changed in 17.4. Due to this, older libomp140 runtime binaries won’t work with code if it’s built with newer libomp.lib, or vice versa. The best thing to do is to re-build all code using /openmp:llvm.

Summary

MSVC continues to improve its OpenMP support and a full, optimized implementation of 3.1 is planned for the future. Based on user feedback, we may consider support for further versions or selected features from newer versions of OpenMP. Please use Developer Community to add your voice to feature requests or report bugs.