In our previous blog post, we announced support for OpenMP tasks starting with Visual Studio 17.2. Now, we are pleased to announce we have added further OpenMP features to Visual Studio 17.4, which brings us closer to conformance with OpenMP 3.1.
#pragma atomic
with OpenMP 3.1 semantics
We added support for #pragma omp atomic
a while ago but we now also support the full OpenMP 3.1 syntax and semantics for atomic operations. Specifically, we now support a read
, write
, update
or capture
clause in the pragma while the pragma can now apply either to an expression-statement (as before) or a structured block, which has particular restrictions that the compiler will check.
When the compiler encounters the new OpenMP atomic clauses, it will make sure that the LLVM OpenMP runtime (libomp
) is being used:
example.cpp(14): error C7660: '#pragma omp atomic update': requires '-openmp:llvm' command line option(s)
This is because we support the newer semantics only on the new LLVM-based OpenMP runtime.
omp atomic
may seem like a duplication of omp critical
but it is different in that omp critical
is a generalized mutual exclusion mechanism and can wrap any kind of code, while omp atomic
limits the kinds of operations that it supports. Based on these restrictions, the compiler can, in principle, generate more optimized code. For example, a critical section always requires acquiring a lock from the underlying operating system, but an atomic operation can use the underlying hardware guarantees to avoid such locking for, say, loads or stores of variables smaller than a register.
Consider this example from the OpenMP 3.1 Specification:
int work1(int i);
int work2(int i);
void atomic_example(int* x, int* y, int* index, int n)
{
int i;
#pragma omp parallel for shared(x, y, index, n)
for (i = 0; i < n; i++) {
#pragma omp atomic update
x[index[i]] += work1(i);
y[i] += work2(i);
}
}
Compiling the above for x86 with full optimizations, this is what gets generated:
push esi
call ?work1@@YAHH@Z ; work1
; 9 : #pragma omp atomic update
; 10 : x[index[i]] += work1(i);
mov ecx, DWORD PTR _x$[esp+20]
push eax
mov eax, DWORD PTR _index$[esp+24]
mov eax, DWORD PTR [eax+edi]
lea eax, DWORD PTR [ecx+eax*4]
push eax
push ebx
push 0
call ___kmpc_atomic_fixed4_add
; 11 : y[i] += work2(i);
push esi
call ?work2@@YAHH@Z ; work2
add DWORD PTR [edi], eax
Note that to update x[index[i]]
, the code first calculates the address of that array location, and then calls the libomp
API __kpmc_atmoc_fixed4_add
to do the actual update atomically, while for the subsequent update of y[i]
, the code is just an add
instruction.
Given that the OpenMP atomic operations are meant to be an especially efficient form of critical section, it’s possible to optimize the above code by generating the code for the __kmp_atomic_fixed4_add
library call inline and avoid a function call. We don’t currently do this but this work is planned for future versions of MSVC.
We now also support capture
as a clause for omp atomic
, with both the expression-statement and structured block syntax. Using the capure
clause allows atomic update of an l-value while capturing its initial or final value at the same time. E.g., consider a team of threads which we have to allocate work to. Assume the work is allocated based on “slots” which are identified by a variable slot
, with the idea being that each thread gets assigned a different value of this variable. This could be implemented using atomic capture in this way:
void assign_work()
{
int slot = 0;
int my_slot;
const int max_slot = 1'000'000;
#pragma omp parallel private(my_slot)
while (slot < max_slot)
{
// Get the current value of slot and update it.
// Note that all threads are going through the
// slots in parallel
#pragma omp atomic capture
{ my_slot = slot; ++slot; }
do_work(my_slot);
}
}
Each parallel thread running the loop body will atomically save the current value of slot
into its private variable my_slot
and then increment slot
, the whole operation being executed atomically with repect to other threads. Consequently, no two threads will get the same value of slot
passed to do_work
and eventually all values up to max_slot
will be allocated.
We could also write the above atomic operation more compactly using the expression-statement version of capture
:
#pragma omp atomic capture
my_slot = slot++;
The compiler has added diagnostics for required expression forms for omp atomic
. E.g.,:
#pragma omp atomic
{ v = x; +x; }
produces:
.\atomic-capture-block.c(14,24): error C3048: '#pragma omp atomic capture': expression or block-statement following pragma does not conform to the OpenMP specification
v = x; +x;
^
Attempting to use an overloaded operator in a capture block or expression gives:
.\atomic_capture_neg.cpp(18,11): error C3943: '#pragma omp atomic': operator '+=' is overloaded; only built-in operators are allowed
x += s;
^
We’ve added diagnostics to help with the validating the semantics and syntax of #pragma omp atomic
but one thing should be borne in mind: because MSVC doesn’t print expressions in diagnostics, using /diagnostics:caret
is helpful in getting the most from the new diagnostics. E.g.,
int test(int initial)
{
int v, x;
#pragma omp atomic capture
{
v = x; v = v + 1;
}
return v;
}
produces
.\atomic-capture-block.cpp(6,20): error C5300: '#pragma omp atomic capture': expression mismatch for lvalue being updated
v = x; v = v + 1;
^
.\atomic-capture-block.cpp(6,17): note: see the lvalue expression here
v = x; v = v + 1;
^
Without /diagnostics:caret
we would have just the line numbers which don’t help in understanding the diagnostic.
min
and max
reduction operators
MSVC has supported reduction operators since implementing OpenMP 2.0, to which we have now added support for min
and max
operations as well. Consider the simple case of determining the maximum of an array of values. Serial code to do this is given below:
double serial_max(double* A, int size)
{
int max = INT_MIN;
for (int i = 0; i < size; ++i)
if (A[i] > max)
max = A[i];
return max;
}
Parallelizing this to run on multiple threads in a naive way requires a critical section to update max
:
double parallel_max(double* A, int size)
{
int maxval = INT_MIN;
#pragma omp parallel for shared(maxval)
for (int i = 0; i < size; ++i)
#pragma omp critical
if (A[i] > maxval)
maxval = A[i];
return maxval;
}
It’s obvious that the above version has a performance problem: every comparison of max
to an array variable is being done in a critical section! To improve this, we can have each parallel thread maintain its own maximum and merge them at the end into one maximum for all of them. This will require maintaining an auxiliary vector of maximum values, one per thread, updating the right ones per thread and finally merging the maximum values into a single one, quite a chore if written out by hand. Instead, we can take advantage of the new max
reduction operator and write a simple loop:
double parallel_max(double* A, int size)
{
int maxval = INT_MIN;
#pragma omp parallel for reduction(max : maxval)
for (i = 0; i < size; ++i)
if (A[i] > maxval) maxval = A[i];
return maxval;
}
The above version creates a private maxval
for each thread, which avoids the need for a critical section in the loop and at the end merges them all into a single maximum. Computing the minimum would use the min
reduction operator in an analogous fashion.
Pointers as loop-index variables for #pragma omp for
MSVC has hitherto restricted loop variables to integral types while the OpenMP specification allowed pointer types as well. We have now implemented this feature and it is now possible to loop over arrays in parallel using either pointers or integer indices. E.g.,
void test()
{
int a[100];
int *p, *begin = &a[0], *end = &a[100];
int k = 0;
#pragma omp parallel for
for (p = begin; p < end; ++p)
*p = k;
}
As part of a general policy of supporting newer OpenMP features only on the LLVM libomp
runtime, pointer loop variables require the compiler option /openmp:llvm
to be used.
Note: C++ iterator support is not yet implemented but is planned for the future.
Miscellaneous improvements: diagnostic messages and bug fixes
We’ve improved the accuracy or user friendliness of OpenMP diagnostic messages in several places. E.g., for loops, we’ve added checks for loop comparison operators:
#pragma omp parallel for
for (p = begin; p > end; ++p) *p = k; // C5301
produces:
.\loop_warnings_ptr.cpp(10,23): warning C5301: '#pragma omp for': 'p' increases while loop condition uses '>'; non-terminating loop?
for (p = begin; p > end; ++p) *p = k; // C5301
^
We’ve also fixed several bugs reported by users or discovered during our testing.
A note about the LLVM runtime
Currently, the LLVM runtime matching compiler is based on version 11. We plan to upgrade the runtime to a more recent version in a future release, but meanwhile we’ve ported a couple of critical bug fixes: rGb7b498657685 (llvm.org) and rG1b968467c057 (llvm.org). Many thanks to Jonathan Peyton who provided these fixes!
With the help of our colleague Vadim Paretsky from Intel, we’ve upstreamed changes (1, 2) to main
that we’ve made so far to the libomp
runtime. The only missing change is for atomics for ARM64.
We’re interested in hearing from you if you want to build your own libomp
for Windows. Please reply in the blog comments in this case.
A word of caution: we had to accept a breaking change where ordinals for the exported symbols were changed in 17.4. Due to this, older libomp140
runtime binaries won’t work with code if it’s built with newer libomp.lib
, or vice versa. The best thing to do is to re-build all code using /openmp:llvm
.
Summary
MSVC continues to improve its OpenMP support and a full, optimized implementation of 3.1 is planned for the future. Based on user feedback, we may consider support for further versions or selected features from newer versions of OpenMP. Please use Developer Community to add your voice to feature requests or report bugs.
Will the OpenMP update also be available for VS 2019?
The arm64 atomic changes have now been upstreamed to LLVM main as well.