In our previous blog post about OpenMP support in Visual Studio 2019 version 16.10, we announced support for the -openmp:llvm switch to enable the compiler to target LLVM’s OpenMP runtime library on x86, x64, and arm64 platforms. In that release, -openmp:llvm included some correctness fixes and support for unsigned indices in parallel for loops, but otherwise only supported the same OpenMP 2.0 features as -openmp.
-openmp:llvm support for tasks
Starting with Visual Studio 2022 version 17.2, we have added support for the first major new OpenMP feature specific to the -openmp:llvm flag: the task
directive as defined by the OpenMP 3.1 standard, including the accompanying if
, private
, firstprivate
, shared
, default
, untied
, mergeable
, and final
clauses, and the associated taskwait
and taskyield
directives. The compiler does not yet support task
clauses added in later versions of the OpenMP standard.
The OpenMP task
directive is used to specify a unit of work that can be done in parallel by any thread. A task executes once when a thread becomes available, instead of executing once per thread as a parallel region would. Creating tasks is a useful alternative to the OpenMP parallel for
directive when the number of iterations is not known at the start of the calculation, such as when operating on a linked list or implementing a recursive algorithm.
The following is a simple example of using the task
directive to sum the elements of an array by splitting the array into pieces and having each task sum one piece.
#include <stdio.h>
int sumArrayImpl(int* arr, int arrayLength) {
if (arrayLength == 1)
return arr[0];
if (arrayLength == 0)
return 0;
int left, right;
int halfLength = arrayLength / 2;
#pragma omp task default(none) firstprivate(arr, halfLength), shared(left) final(halfLength >= 2)
{
left = sumArray(arr, halfLength);
}
#pragma omp task default(none) firstprivate(arr, halfLength, arrayLength) shared(right) final(halfLength >= 2)
{
right = sumArray(arr + halfLength, halfLength + (arrayLength % 2));
}
#pragma omp taskwait
return left + right;
}
int sumArray(int* array, int arrayLength)
#pragma omp parallel
{
#pragma omp single
{
printf("Sum = %dn", sumArrayImpl(array, arrayLength));
}
}
}
Although this example does not do enough work for each element to see a speedup over a sequential for loop, and is consistent enough that a parallel for
directive could accomplish the same degree of paralellism, it has a similar form to more complicated examples that would see a speedup and illustrates many of the task
directive clauses.
The private
, firstprivate
, shared
, and default
clauses specify the scope of variables used in a task, following a similar pattern to the same clauses on the parallel
directive. Note that marking a pointer as firstprivate
means that each task will have its own copy of the pinter. The underlying array is still shared across all tasks.
The final
clause prevents creating an excessive number of tasks by causing any further child tasks to be executed sequentially when the clause’s condition is true. An if
clause, conversely, causes the current task
region to execute sequentially, but it can still create child tasks that execute in parallel. The taskwait
directive allows for synchronization between tasks by waiting until a task’s children complete before continuing.
A few task
clauses and an additional task-related directive are missing from this example. The taskyield
directive allows the runtime to suspend a task’s execution in order to execute other tasks, and is useful when a task may need to wait for some other work to complete. The mergeable
and untied
clauses on the task
directive are optimization hints. An untied
task that yields may resume on any thread, instead of only resuming on the thread that spawned the task. A mergeable
task allows the runtime to reuse the data environment of its parent for the child task.
Now, let’s take a look at an example that shows the usefulness of task
in a scenario where parallel for
is insufficient. For this example, we’ll be using a data structure designed for storing words called a Trie tree. In a Trie tree, each word is stored as a path through the tree, terminating in a node marked with sequenceEnd == true
. A node with sequenceEnd == true
may still have children if it is a substring of another word. For example, a Trie tree containing the words “car”, “care”, and “cars” would start with a c
node, which has an a
as a child, which in turn has an r
as a child. The r
node would be marked as an end node and also have two children, an e
leaf and an s
leaf, both also marked as terminating nodes, like so:
c
a
r*
/
e* s*
A parallel for
could not traverse a tree like this because there is no random access iterator, but a tree traversal can still take advantage of parallelism by creating a task for each node visited. Consider the following code for counting and printing out all of the words in a trie tree:
struct TrieNode {
TrieNode* parent;
std::unordered_map<char, TrieNode*> children;
bool sequenceEnd;
char currentChar;
void Print();
void PrintAllWordsWithSubstring();
~TrieNode();
};
void TrieNode::Print() {
if (parent) {
parent->Print();
std::cout << currentChar;
}
}
int totalWords;
void TrieNode::PrintAllWordsWithSubstring() {
#pragma omp task
{
for (auto child : this->children) {
child.second->PrintAllWordsWithSubstring();
}
if (this->sequenceEnd) {
#pragma omp critical
{
this->Print();
std::cout << std::endl;
}
#pragma omp atomic
++totalWords;
}
}
}
void PrintAllWords(TrieNode* root) {
totalWords = 0;
#pragma omp parallel
{
#pragma omp single
root->PrintAllWordsWithSubstring();
}
}
In this example, the PrintAllWordsWithSubstring
member function recursively creates a new task for each node. The this
pointer is implicitly firstprivate
inside the task
region. Choosing the correct data-sharing attributes for variables is especially important for task
because a task’s execution is deferred instead of executed immediately, as a parallel
region is. As a result, the function that creates the task may return before the task
region is executed and any shared
variables on the stack may fall out of scope. If the this
pointer was shared, a task in a member function couldn’t reliably access member variables. To see the difference more clearly, consider the following broken alternative implementation of TrieNode::PrintAllWordsWithSubstring
:
void TrieNode::PrintAllWordsWithSubstring() {
for (auto child : this->children) {
#pragma omp task shared(child)
{
// This line will cause an Access Violation.
child.second->PrintAllWordsWithSubstring();
}
}
if (this->sequenceEnd) {
this->Print();
std::cout << std::endl;
#pragma omp atomic
++totalWords;
}
}
If a variable is only read and never written to inside a parallel
region, marking it as shared
doesn’t change the end result. No writes will occur during the execution of the parallel
region, so all threads will see the same value. However, code outside of a task
region may execute concurrently with the execution of a task
. In the flawed implementation above, by the time the recursive call to PrintAllWordsWithSubstring(child.second)
is made, the iterator will likely already have reached the end of children
and child.second
will no longer have a valid value.
Our OpenMP Plans
As of 17.2, all of the OpenMP 2.5 standard is supported, as well as tasks and parallel for loops with unsigned indices from the OpenMP 3.1 standard. We have started the long process to support newer versions of the OpenMP standard. Our ultimate goal is to support the most recent OpenMP standard by leveraging LLVM’s OpenMP runtime, but this will take time. Our next step for OpenMP will be to support the remaining features added in the OpenMP 3.1 standard. Which features are added first will depend on your feedback. We would love to hear from you which specific OpenMP 3.1 features you would like to see first.
Feedback
We encourage you to try out this update in the latest Visual Studio 2022 version 17.2 Preview. If you encounter a correctness issue in code generated with the -openmp:llvm switch or bugs in the libomp140 DLLs shipped with Visual Studio, please let us know. We can be reached via the comments below, via twitter (@visualc) or via Developer Community.
Any support for it in Visual Studio Code for Linux?
What kind of support are you looking for on Visual Studio Code for Linux?
The work here is for the Microsoft C++ compiler. Visual Studio Code is primarily a text editor, and can be used with any C++ compiler. Most Linux machines will come with the GCC C++ compiler installed. Both the Clang and GCC compilers already support compiling for at least OpenMP 4.0 with the -fopenmp flag.
For GROMACS (/Folding@Home) the max reduction operator in OpenMP 3.1 is most important.
We’re are going to add `min` and `max`. I’m curious though, does the code use any C++ overloaded operators in reduction code? We have those on the list but not sure if there is any demand for them.
GROMACS only needs such operators for built-in types. My guess is that there is indeed little demand for them elsewhere, too.
I looks that libomp140.x86_64.dll is not included in Visual C++ redistributable package?
In Visual Studio, the files are in the folder “debug_nonredist”.
Could we develop the application with LLVM OpenMP runtime and release it to users now?
(just copy the dll?)
Currently the /openmp:llvm switch is still experimental, which is why the dlls are in the debug_nonredist. You can redistribute these dlls for testing purposes as described at https://docs.microsoft.com/en-us/cpp/windows/preparing-a-test-machine-to-run-a-debug-executable?view=msvc-170 . As they are still experimental, we won't service these versions of libomp140 if a bug is found.
Once /openmp:llvm is moved out of experimental, we'll move the libomp dlls to the redist so that they can be redistributed.
If you have any questions about a more specific use case, you can also contact me directly. My alias is brhagger and you can contact me at alias@microsoft.com . We’re always happy to hear about how people are using OpenMP.
@Heresy K: According to the documentation of the /openmp switch, /openmp:llvm is currently experimental because “support currently isn’t available for production code, since the required libomp DLLs aren’t redistributable.”, so I presume the answer to your question is no.
What’s the plan for /openmp with vcomp.lib? Will it stay on OpenMP 2.0 until it’s eventually deprecated?
Yes, the current plan is to keep vcomp.lib on OpenMP 2.0 until it is eventually deprecated and libomp becomes the default.