{"id":31286,"date":"2022-11-29T16:00:19","date_gmt":"2022-11-29T16:00:19","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cppblog\/?p=31286"},"modified":"2022-11-30T08:04:00","modified_gmt":"2022-11-30T08:04:00","slug":"msvc-openmp-update","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cppblog\/msvc-openmp-update\/","title":{"rendered":"MSVC OpenMP Update"},"content":{"rendered":"<p>In our previous <a href=\"https:\/\/devblogs.microsoft.com\/cppblog\/openmp-task-support-for-c-in-visual-studio\/\">blog post<\/a>, we announced support for OpenMP tasks starting with Visual Studio 17.2. Now, we are pleased to announce we have added further OpenMP features to Visual Studio 17.4, which brings us closer to conformance with OpenMP 3.1.<\/p>\n<h2><code>#pragma atomic<\/code> with OpenMP 3.1 semantics<\/h2>\n<p>We added support for <code>#pragma omp atomic<\/code> a while ago but we now also support the full OpenMP 3.1 syntax and semantics for atomic operations. Specifically, we now support a <code>read<\/code>, <code>write<\/code>, <code>update<\/code> or <code>capture<\/code> clause in the pragma while the pragma can now apply either to an expression-statement (as before) or a structured block, which has particular restrictions that the compiler will check.<\/p>\n<p>When the compiler encounters the new OpenMP atomic clauses, it will make sure that the LLVM OpenMP runtime (<code>libomp<\/code>) is being used:<\/p>\n<pre><code class=\"language-text\">example.cpp(14): error C7660: '#pragma omp atomic update': requires '-openmp:llvm' command line option(s)<\/code><\/pre>\n<p>This is because we support the newer semantics only on the new LLVM-based OpenMP runtime.<\/p>\n<p><code>omp atomic<\/code> may seem like a duplication of <code>omp critical<\/code> but it is different in that <code>omp critical<\/code> is a generalized mutual exclusion mechanism and can wrap any kind of code, while <code>omp atomic<\/code> limits the kinds of operations that it supports. Based on these restrictions, the compiler can, in principle, generate more optimized code. For example, a critical section always requires acquiring a lock from the underlying operating system, but an atomic operation can use the underlying hardware guarantees to avoid such locking for, say, loads or stores of variables smaller than a register.<\/p>\n<p>Consider this example from the OpenMP 3.1 Specification:<\/p>\n<pre><code class=\"language-c++\">int work1(int i);\r\nint work2(int i);\r\n\r\nvoid atomic_example(int* x, int* y, int* index, int n)\r\n{\r\n    int i;\r\n    #pragma omp parallel for shared(x, y, index, n)\r\n    for (i = 0; i &lt; n; i++) {\r\n        #pragma omp atomic update\r\n        x[index[i]] += work1(i);\r\n\r\n        y[i] += work2(i);\r\n    }\r\n}<\/code><\/pre>\n<p>Compiling the above for x86 with full optimizations, this is what gets generated:<\/p>\n<pre><code class=\"language-nasm\">    push    esi\r\n    call    ?work1@@YAHH@Z              ; work1\r\n\r\n; 9    : #pragma omp atomic update\r\n; 10   : x[index[i]] += work1(i);\r\n\r\n    mov ecx, DWORD PTR _x$[esp+20]\r\n    push    eax\r\n    mov eax, DWORD PTR _index$[esp+24]\r\n    mov eax, DWORD PTR [eax+edi]\r\n    lea eax, DWORD PTR [ecx+eax*4]\r\n    push    eax\r\n    push    ebx\r\n    push    0\r\n    call    ___kmpc_atomic_fixed4_add\r\n\r\n; 11   :         y[i] += work2(i);\r\n\r\n    push    esi\r\n    call    ?work2@@YAHH@Z              ; work2\r\n    add DWORD PTR [edi], eax<\/code><\/pre>\n<p>Note that to update <code>x[index[i]]<\/code>, the code first calculates the address of that array location, and then calls the <code>libomp<\/code> API <code>__kpmc_atmoc_fixed4_add<\/code> to do the actual update atomically, while for the subsequent update of <code>y[i]<\/code>, the code is just an <code>add<\/code> instruction.<\/p>\n<p>Given that the OpenMP atomic operations are meant to be an especially efficient form of critical section, it&#8217;s possible to optimize the above code by generating the code for the <code>__kmp_atomic_fixed4_add<\/code> library call inline and avoid a function call. We don&#8217;t currently do this but this work is planned for future versions of MSVC.<\/p>\n<p>We now also support <code>capture<\/code> as a clause for <code>omp atomic<\/code>, with both the expression-statement and structured block syntax. Using the <code>capure<\/code> clause allows atomic update of an l-value while capturing its initial or final value at the same time. E.g., consider a team of threads which we have to allocate work to. Assume the work is allocated based on &#8220;slots&#8221; which are identified by a variable <code>slot<\/code>, with the idea being that each thread gets assigned a different value of this variable. This could be implemented using atomic capture in this way:<\/p>\n<pre><code class=\"language-c++\">void assign_work()\r\n{\r\n    int slot = 0;\r\n    int my_slot;\r\n    const int max_slot = 1'000'000;\r\n\r\n    #pragma omp parallel private(my_slot)\r\n    while (slot &lt; max_slot)\r\n    {\r\n        \/\/ Get the current value of slot and update it.\r\n        \/\/ Note that all threads are going through the\r\n        \/\/ slots in parallel\r\n        #pragma omp atomic capture\r\n        { my_slot = slot; ++slot; }\r\n        do_work(my_slot);\r\n    }\r\n}<\/code><\/pre>\n<p>Each parallel thread running the loop body will atomically save the current value of <code>slot<\/code> into its private variable <code>my_slot<\/code> and then increment <code>slot<\/code>, the whole operation being executed atomically with repect to other threads. Consequently, no two threads will get the same value of <code>slot<\/code> passed to <code>do_work<\/code> and eventually all values up to <code>max_slot<\/code> will be allocated.<\/p>\n<p>We could also write the above atomic operation more compactly using the expression-statement version of <code>capture<\/code>:<\/p>\n<pre><code class=\"language-c++\">#pragma omp atomic capture\r\nmy_slot = slot++;<\/code><\/pre>\n<p>The compiler has added diagnostics for required expression forms for <code>omp atomic<\/code>. E.g.,:<\/p>\n<pre><code class=\"language-c++\">#pragma omp atomic\r\n{ v = x; +x; }<\/code><\/pre>\n<p>produces:<\/p>\n<pre><code class=\"language-text\">.\\atomic-capture-block.c(14,24): error C3048: '#pragma omp atomic capture': expression or block-statement following pragma does not conform to the OpenMP specification\r\n            v = x; +x;\r\n                   ^<\/code><\/pre>\n<p>Attempting to use an overloaded operator in a capture block or expression gives:<\/p>\n<pre><code class=\"language-text\">.\\atomic_capture_neg.cpp(18,11): error C3943: '#pragma omp atomic': operator '+=' is overloaded; only built-in operators are allowed\r\n    x += s;\r\n      ^<\/code><\/pre>\n<p>We&#8217;ve added diagnostics to help with the validating the semantics and syntax of <code>#pragma omp atomic<\/code> but one thing should be borne in mind: because MSVC doesn&#8217;t print expressions in diagnostics, using <code>\/diagnostics:caret<\/code> is helpful in getting the most from the new diagnostics. E.g.,<\/p>\n<pre><code class=\"language-c++\">int test(int initial)\r\n{\r\n    int v, x;\r\n    #pragma omp atomic capture\r\n    {\r\n        v = x; v = v + 1;\r\n    }\r\n    return v;\r\n}<\/code><\/pre>\n<p>produces<\/p>\n<pre><code class=\"language-text\">.\\atomic-capture-block.cpp(6,20): error C5300: '#pragma omp atomic capture': expression mismatch for lvalue being updated\r\n            v = x; v = v + 1;\r\n                   ^\r\n.\\atomic-capture-block.cpp(6,17): note: see the lvalue expression here\r\n            v = x; v = v + 1;\r\n                ^<\/code><\/pre>\n<p>Without <code>\/diagnostics:caret<\/code> we would have just the line numbers which don&#8217;t help in understanding the diagnostic.<\/p>\n<h2><code>min<\/code> and <code>max<\/code> reduction operators<\/h2>\n<p>MSVC has supported reduction operators since implementing OpenMP 2.0, to which we have now added support for <code>min<\/code> and <code>max<\/code> operations as well. Consider the simple case of determining the maximum of an array of values. Serial code to do this is given below:<\/p>\n<pre><code class=\"language-c++\">double serial_max(double* A, int size)\r\n{\r\n    int max = INT_MIN;\r\n    for (int i = 0; i &lt; size; ++i)\r\n        if (A[i] &gt; max)\r\n            max = A[i];\r\n    return max;\r\n}<\/code><\/pre>\n<p>Parallelizing this to run on multiple threads in a naive way requires a critical section to update <code>max<\/code>:<\/p>\n<pre><code class=\"language-c++\">double parallel_max(double* A, int size)\r\n{\r\n    int maxval = INT_MIN;\r\n    #pragma omp parallel for shared(maxval)\r\n    for (int i = 0; i &lt; size; ++i)\r\n        #pragma omp critical\r\n        if (A[i] &gt; maxval)\r\n            maxval = A[i];\r\n    return maxval;\r\n}<\/code><\/pre>\n<p>It&#8217;s obvious that the above version has a performance problem: every comparison of <code>max<\/code> to an array variable is being done in a critical section! To improve this, we can have each parallel thread maintain its own maximum and merge them at the end into one maximum for all of them. This will require maintaining an auxiliary vector of maximum values, one per thread, updating the right ones per thread and finally merging the maximum values into a single one, quite a chore if written out by hand. Instead, we can take advantage of the new <code>max<\/code> reduction operator and write a simple loop:<\/p>\n<pre><code class=\"language-c++\">double parallel_max(double* A, int size)\r\n{\r\n    int maxval = INT_MIN;\r\n    #pragma omp parallel for reduction(max : maxval)\r\n    for (i = 0; i &lt; size; ++i)\r\n        if (A[i] &gt; maxval) maxval = A[i];\r\n    return maxval;\r\n}<\/code><\/pre>\n<p>The above version creates a private <code>maxval<\/code> for each thread, which avoids the need for a critical section in the loop and at the end merges them all into a single maximum. Computing the minimum would use the <code>min<\/code> reduction operator in an analogous fashion.<\/p>\n<h2>Pointers as loop-index variables for <code>#pragma omp for<\/code><\/h2>\n<p>MSVC has hitherto restricted loop variables to integral types while the OpenMP specification allowed pointer types as well. We have now implemented this feature and it is now possible to loop over arrays in parallel using either pointers or integer indices. E.g.,<\/p>\n<pre><code class=\"language-c++\">void test()\r\n{\r\n    int a[100];\r\n    int *p, *begin = &amp;a[0], *end = &amp;a[100];\r\n    int k = 0;\r\n\r\n    #pragma omp parallel for\r\n    for (p = begin; p &lt; end; ++p) \r\n        *p = k;\r\n}<\/code><\/pre>\n<p>As part of a general policy of supporting newer OpenMP features only on the LLVM <code>libomp<\/code> runtime, pointer loop variables require the compiler option <code>\/openmp:llvm<\/code> to be used.<\/p>\n<p>Note: C++ iterator support is not yet implemented but is planned for the future.<\/p>\n<h2>Miscellaneous improvements: diagnostic messages and bug fixes<\/h2>\n<p>We&#8217;ve improved the accuracy or user friendliness of OpenMP diagnostic messages in several places. E.g., for loops, we&#8217;ve added checks for loop comparison operators:<\/p>\n<pre><code class=\"language-c++\">    #pragma omp parallel for\r\n    for (p = begin; p &gt; end; ++p) *p = k;  \/\/ C5301<\/code><\/pre>\n<p>produces:<\/p>\n<pre><code class=\"language-text\">.\\loop_warnings_ptr.cpp(10,23): warning C5301: '#pragma omp for': 'p' increases while loop condition uses '&gt;'; non-terminating loop?\r\n    for (p = begin; p &gt; end; ++p) *p = k;  \/\/ C5301\r\n                      ^<\/code><\/pre>\n<p>We&#8217;ve also fixed several bugs reported by users or discovered during our testing.<\/p>\n<h2>A note about the LLVM runtime<\/h2>\n<p>Currently, the LLVM runtime matching compiler is based on version 11. We plan to upgrade the runtime to a more recent version in a future release, but meanwhile we&#8217;ve ported a couple of critical bug fixes: <a href=\"https:\/\/reviews.llvm.org\/rGb7b498657685d7a305987b9140253523e77fd4e1\">rGb7b498657685 (llvm.org)<\/a> and <a href=\"https:\/\/reviews.llvm.org\/rG1b968467c057df980df214a88cddac74dccff15e\">rG1b968467c057 (llvm.org)<\/a>. Many thanks to <a href=\"https:\/\/reviews.llvm.org\/p\/jlpeyton\/\">Jonathan Peyton<\/a> who provided these fixes!<\/p>\n<p>With the help of our colleague <a href=\"https:\/\/reviews.llvm.org\/p\/vadikp-intel\/\">Vadim Paretsky<\/a> from Intel, we&#8217;ve upstreamed changes (<a href=\"https:\/\/github.com\/llvm\/llvm-project\/commit\/f58fe2e1865d631b228d0bc78ebd4d95f752c51b\">1<\/a>, <a href=\"https:\/\/github.com\/llvm\/llvm-project\/commit\/43d5c4d5394e522be87a9a1dfda24f5ce0e3a855\">2<\/a>) to <code>main<\/code> that we&#8217;ve made so far to the <code>libomp<\/code> runtime. The only missing change is for atomics for ARM64.<\/p>\n<p>We&#8217;re interested in hearing from you if you want to build your own <code>libomp<\/code> for Windows. Please reply in the blog comments in this case.<\/p>\n<p>A word of caution: we had to accept a breaking change where ordinals for the exported symbols were changed in 17.4. Due to this, older <code>libomp140<\/code> runtime binaries won&#8217;t work with code if it&#8217;s built with newer <code>libomp.lib<\/code>, or vice versa. The best thing to do is to re-build all code using <code>\/openmp:llvm<\/code>.<\/p>\n<h2>Summary<\/h2>\n<p>MSVC continues to improve its OpenMP support and a full, optimized implementation of 3.1 is planned for the future. Based on user feedback, we may consider support for further versions or selected features from newer versions of OpenMP. Please use <a href=\"https:\/\/developercommunity.visualstudio.com\/cpp\">Developer Community<\/a> to add your voice to feature requests or report bugs.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Summary of your post, shown on the home page next to the featured image<\/p>\n","protected":false},"author":67691,"featured_media":31287,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,3921],"tags":[],"class_list":["post-31286","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cplusplus","category-openmp"],"acf":[],"blog_post_summary":"<p>Summary of your post, shown on the home page next to the featured image<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/31286","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/users\/67691"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/comments?post=31286"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/31286\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media\/31287"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media?parent=31286"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/categories?post=31286"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/tags?post=31286"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}