OpenMP Task Support for C++ in Visual Studio

Bran Hagger

In our previous blog post about OpenMP support in Visual Studio 2019 version 16.10, we announced support for the -openmp:llvm switch to enable the compiler to target LLVM’s OpenMP runtime library on x86, x64, and arm64 platforms. In that release, -openmp:llvm included some correctness fixes and support for unsigned indices in parallel for loops, but otherwise only supported the same OpenMP 2.0 features as -openmp.

-openmp:llvm support for tasks

Starting with Visual Studio 2022 version 17.2, we have added support for the first major new OpenMP feature specific to the -openmp:llvm flag: the task directive as defined by the OpenMP 3.1 standard, including the accompanying if, private, firstprivate, shared, default, untied, mergeable, and final clauses, and the associated taskwait and taskyield directives. The compiler does not yet support task clauses added in later versions of the OpenMP standard.

The OpenMP task directive is used to specify a unit of work that can be done in parallel by any thread. A task executes once when a thread becomes available, instead of executing once per thread as a parallel region would. Creating tasks is a useful alternative to the OpenMP parallel for directive when the number of iterations is not known at the start of the calculation, such as when operating on a linked list or implementing a recursive algorithm.

The following is a simple example of using the task directive to sum the elements of an array by splitting the array into pieces and having each task sum one piece.

#include <stdio.h>

int sumArrayImpl(int* arr, int arrayLength) {
  if (arrayLength == 1)
     return arr[0];
  if (arrayLength == 0)
     return 0;

  int left, right;
  int halfLength = arrayLength / 2;
  #pragma omp task default(none) firstprivate(arr, halfLength), shared(left) final(halfLength >= 2)
  {
     left = sumArray(arr, halfLength);
  }
  #pragma omp task default(none) firstprivate(arr, halfLength, arrayLength) shared(right) final(halfLength >= 2)
  {
     right = sumArray(arr + halfLength, halfLength + (arrayLength % 2));
  }
  #pragma omp taskwait
     return left + right;
}

int sumArray(int* array, int arrayLength)
   #pragma omp parallel
   {
      #pragma omp single
      { 
         printf("Sum = %dn", sumArrayImpl(array, arrayLength));
      }
   }
}

Although this example does not do enough work for each element to see a speedup over a sequential for loop, and is consistent enough that a parallel for directive could accomplish the same degree of paralellism, it has a similar form to more complicated examples that would see a speedup and illustrates many of the task directive clauses.

The private, firstprivate, shared, and default clauses specify the scope of variables used in a task, following a similar pattern to the same clauses on the parallel directive. Note that marking a pointer as firstprivate means that each task will have its own copy of the pinter. The underlying array is still shared across all tasks.

The final clause prevents creating an excessive number of tasks by causing any further child tasks to be executed sequentially when the clause’s condition is true. An if clause, conversely, causes the current task region to execute sequentially, but it can still create child tasks that execute in parallel. The taskwait directive allows for synchronization between tasks by waiting until a task’s children complete before continuing.

A few task clauses and an additional task-related directive are missing from this example. The taskyield directive allows the runtime to suspend a task’s execution in order to execute other tasks, and is useful when a task may need to wait for some other work to complete. The mergeable and untied clauses on the task directive are optimization hints. An untied task that yields may resume on any thread, instead of only resuming on the thread that spawned the task. A mergeable task allows the runtime to reuse the data environment of its parent for the child task.

Now, let’s take a look at an example that shows the usefulness of task in a scenario where parallel for is insufficient. For this example, we’ll be using a data structure designed for storing words called a Trie tree. In a Trie tree, each word is stored as a path through the tree, terminating in a node marked with sequenceEnd == true. A node with sequenceEnd == true may still have children if it is a substring of another word. For example, a Trie tree containing the words “car”, “care”, and “cars” would start with a c node, which has an a as a child, which in turn has an r as a child. The r node would be marked as an end node and also have two children, an e leaf and an s leaf, both also marked as terminating nodes, like so:

c
 
  a
   
    r*
   / 
  e*  s*   

A parallel for could not traverse a tree like this because there is no random access iterator, but a tree traversal can still take advantage of parallelism by creating a task for each node visited. Consider the following code for counting and printing out all of the words in a trie tree:

struct TrieNode {
   TrieNode* parent;
   std::unordered_map<char, TrieNode*> children;

   bool sequenceEnd;
   char currentChar;

   void Print();
   void PrintAllWordsWithSubstring();

   ~TrieNode();
};

void TrieNode::Print() {
  if (parent) {
     parent->Print();
     std::cout << currentChar;
  }
}

int totalWords;

void TrieNode::PrintAllWordsWithSubstring() {
   #pragma omp task
   {
      for (auto child : this->children) {
         child.second->PrintAllWordsWithSubstring();
      }

      if (this->sequenceEnd) {
          #pragma omp critical
          {
             this->Print();
             std::cout << std::endl;
          }

          #pragma omp atomic
             ++totalWords;
      }
   }
}

void PrintAllWords(TrieNode* root) {
   totalWords = 0;

   #pragma omp parallel
   {
      #pragma omp single
      root->PrintAllWordsWithSubstring();
   }
}

In this example, the PrintAllWordsWithSubstring member function recursively creates a new task for each node. The this pointer is implicitly firstprivate inside the task region. Choosing the correct data-sharing attributes for variables is especially important for task because a task’s execution is deferred instead of executed immediately, as a parallel region is. As a result, the function that creates the task may return before the task region is executed and any shared variables on the stack may fall out of scope. If the this pointer was shared, a task in a member function couldn’t reliably access member variables. To see the difference more clearly, consider the following broken alternative implementation of TrieNode::PrintAllWordsWithSubstring:

void TrieNode::PrintAllWordsWithSubstring() {
   for (auto child : this->children) {
      #pragma omp task shared(child)
      {
         // This line will cause an Access Violation.
         child.second->PrintAllWordsWithSubstring();
      }
   }

   if (this->sequenceEnd) {
      this->Print();
      std::cout << std::endl;

      #pragma omp atomic
         ++totalWords;
   }
}

If a variable is only read and never written to inside a parallel region, marking it as shared doesn’t change the end result. No writes will occur during the execution of the parallel region, so all threads will see the same value. However, code outside of a task region may execute concurrently with the execution of a task. In the flawed implementation above, by the time the recursive call to PrintAllWordsWithSubstring(child.second) is made, the iterator will likely already have reached the end of children and child.second will no longer have a valid value.

Our OpenMP Plans

As of 17.2, all of the OpenMP 2.5 standard is supported, as well as tasks and parallel for loops with unsigned indices from the OpenMP 3.1 standard. We have started the long process to support newer versions of the OpenMP standard. Our ultimate goal is to support the most recent OpenMP standard by leveraging LLVM’s OpenMP runtime, but this will take time. Our next step for OpenMP will be to support the remaining features added in the OpenMP 3.1 standard. Which features are added first will depend on your feedback. We would love to hear from you which specific OpenMP 3.1 features you would like to see first.

Feedback

We encourage you to try out this update in the latest Visual Studio 2022 version 17.2 Preview. If you encounter a correctness issue in code generated with the -openmp:llvm switch or bugs in the libomp140 DLLs shipped with Visual Studio, please let us know. We can be reached via the comments below, via twitter (@visualc) or via Developer Community.

11 comments

Leave a comment

  • Heresy K

    I looks that libomp140.x86_64.dll is not included in Visual C++ redistributable package?
    In Visual Studio, the files are in the folder “debug_nonredist”.

    Could we develop the application with LLVM OpenMP runtime and release it to users now?
    (just copy the dll?)

    • Tanveer GaniMicrosoft employee

      We’re are going to add `min` and `max`. I’m curious though, does the code use any C++ overloaded operators in reduction code? We have those on the list but not sure if there is any demand for them.

      • Mark Abraham

        GROMACS only needs such operators for built-in types. My guess is that there is indeed little demand for them elsewhere, too.

    • Bran HaggerMicrosoft employee

      What kind of support are you looking for on Visual Studio Code for Linux?

      The work here is for the Microsoft C++ compiler. Visual Studio Code is primarily a text editor, and can be used with any C++ compiler. Most Linux machines will come with the GCC C++ compiler installed. Both the Clang and GCC compilers already support compiling for at least OpenMP 4.0 with the -fopenmp flag.