Our previous blog post about OpenMP support in Visual Studio announced support for the loop `collapse`

feature in version 17.8.

In the meantime, we have continued working on improving the feature and, in Visual Studio 2022 17.10, we have added a new, more efficient algorithm for partitioning work of certain types of collapsed triangular loops supported by the OpenMP Standard 5.2. All this work continues to be accessible if you are using the `-openmp:llvm`

switch (see the Improved OpenMP Support for C++ in Visual Studio blogpost for details about this switch). In this blog we will describe this algorithm in some detail:

## Triangular loops in OpenMP 5.2

In the most recent OpenMP Standard 5.2, inner loops’ lower and upper bounds can depend on outer loops’ induction variables, making the loop space non-rectangular. For example, a double nested loop case like this:

```
void bar(float *a, int i, int j);
void foo(float *a)
{
#pragma omp parallel for collapse(2)
for (int i = 0; i < M; i++)
for (int j = i; j < M; j++)
bar(a, i, j);
}
```

Since a triangular loop space makes only one half of the corresponding rectangular loop space, using the rectangular space partitioning algorithm for distributing the work would be functionally correct, but only 50% efficient. To approach 100% efficiency, we have added a separate algorithm to handle collapsed triangular loops which considers only the applicable half of the rectangular space. Our current implementation covers only certain types of triangular loops, specifically double nested lower (both excluding and including the diagonal) and upper equilateral triangular one-increment loops described below. However, this could be extended to cover a wider range of triangular loop types.

We begin by identifying a supported nested loop case by calling a new function `kmp_identify_nested_loop_structure`

from `__kmpc_for_collapsed_init`

.

```
/**************************************************************************
* Identify nested loop structure - loops come in the canonical form
* Lower triangle matrix: i = 0; i <= M; i++ {0,0}:{M,0}
* j = 0; j <= 0/-1+1*i; j++ {0,0}:{0/-1,1}
* Upper Triangle matrix
* i = 0; i <= M; i++ {0,0}:{M,0}
* j = 0+1*i; j <= M; j++ {0,1}:{M,0}
* ************************************************************************/
nested_loop_type_t
kmp_identify_nested_loop_structure(/*in*/ bounds_info_t *original_bounds_nest,
/*in*/ kmp_index_t n)
```

If the loop in question is of a supported type, we call the corresponding function, i.e. `kmp_handle_lower_triangle_matrix`

or `kmp_handle_upper_triangle_matrix`

, to handle work partitioning, with each of these functions returning the chunk assigned to the current thread for execution.

We first describe the algorithm that handles the case of a lower triangular loop that excludes the diagonal, i.e.:

```
#pragma omp parallel for collapse(2)
for (int i = 0; i < M; i++)
for (int j = 0; j < i; j++) {
//...
}
```

For this case, the total number of iterations in the combined loop space will be `T=M*(M-1)/2`

, and the most efficient work distribution between `N`

threads will be for each of them to receive a chunk of `T/N`

iterations, with the first `T%N`

threads getting one iteration extra. We assign chunks of combined space iterations to threads sequentially. The chunk for thread `n`

begins at the (0-based) iteration number equal to the number of iterations executed by all the previous threads `A=n*T/N`

and ends at `A+T/N`

.

We then translate `A`

into the `{Is,Js}`

tuple describing the chunk’s start iteration. `Is`

is the biggest such outer loop iteration that satisfies the condition `(Is-1)*Is/2 <= A => Is^2-Is-2A <= 0`

. We then solve `Is^2-Is-2A = 0`

for `Is`

, i.e., `Is = (sqrt(1+8*A)-1)/2`

. `Js`

is then the remaining number of inner loop iterations required to reach `A`

, i.e. `Js =A-Is*(Is-1)`

. We then apply the same approach to translate `A+T/N`

into the `{Ie,Je}`

tuple describing the chunk’s end iteration. This tuple pair represents the work assigned to thread `n`

and is given back to it for execution.

To avoid the use of floating-point arithmetic, we use an additional function `sqrt_newton_approx`

to calculate `sqrt`

using the Newton method. The method is iterative, and since the chunk calculation rounds the result down to the nearest integer, we can limit its precision to 0.1 to reduce the number of iterations. The method converges quickly, with less than 10 iterations to achieve the result for M values below 10^3, and less than 20 iterations for M values below 10^9.

The use of rounding in calculating chunk boundaries may introduce the risk of missing or duplicate iterations when dividing thread work. However, our implementation does not carry such risk since the same algorithm with the same inputs is used to calculate both the end iteration of thread `n`

chunk and the start iteration of thread `n+1`

chunk. This ensures the absence of gaps and overlaps between threads.

We now consider a lower triangle loop that includes the diagonal, i.e.

```
#pragma omp parallel for collapse(2)
for (int i = 0; i < M; i++)
for (int j = 0; j <= i; j++) {
//...
}
```

For this case we can employ the same approach as above, with the main difference that `T=M*(M+1)/2`

instead of `T=M*(M-1)/2`

and the subsequent `Is`

conditions are similarly `(Is+1)*Is`

rather than `(Is-1)*Is`

. `kmp_handle_lower_triangle_matrix`

handles this simple difference by parameterizing calculations based on the loop description.

We now consider an upper triangular loop, i.e.

```
#pragma omp parallel for collapse(2)
for (int i = 0; i < M; i++)
for (int j = i; j < M; j++) {
//...
}
```

This loop is a flipped case of the lower triangular loop including the diagonal. `kmp_handle_upper_triangle_matrix`

thus follows the same algorithm as `kmp_handle_lower_triangle_matrix`

for the lower triangular loop that includes the diagonal and then simply flips the result back to the upper triangular case. With some further parameterization, these two functions could be combined into one, but we decided to keep them separate for code clarity.

While the performance benefits of the improved handling for these triangular loop types will always depend on the nature and size of the underlying workload, including its micro-architectural behavior, our tests show that the new code achieves the perfect distribution of the supported triangular loops workload between threads, giving the expected up to 2x performance benefit on simulated perfectly scaling workloads.

## Feedback

We encourage you to try this new code in Visual Studio 2022 version 17.10 or newer. It has also been up-streamed into LLVM (https://github.com/llvm/llvm-project/pull/83939). As always, we welcome your feedback which we could use to continue improving the loop collapse feature. Please share your thoughts, comments and questions with us through Developer Community. You can also reach us on X @VisualC, or via email at visualcpp@microsoft.com.

## 0 comments