OpenMP improvements in Visual Studio C/C++ compiler: loop collapse

Natalia Glagoleva

Our previous blog post about OpenMP support in Visual Studio was published a while ago. Meanwhile, in Visual Studio 2022 version 17.8, we’ve added support for the collapse feature, fixed several bugs, and made some code quality improvements. All this work is accessible if you are using the -opemp:llvm switch (see the Improved OpenMP Support for C++ in Visual Studio blogpost for details about this switch). In this blog we’ll talk about loop collapsing, which we’ve implemented on for loops according to the OpenMP Standard 5.2.

collapse clause in OpenMP Standard 3.1

The collapse clause applies to a nest of loops. The whole nest is then considered to be one giant loop over the combined loop space and can be split to run on several threads, similar to a single for loop. For example:

void bar(float *a, int i, int j);
int jl, ju, il, iu;
void foo(float *a)
{
  int i, j;
  #pragma omp for collapse(2) private(i, j)
  for (i=il; i<=iu; ++i){
    for (j=jl; j<=ju; ++j)
      bar(a,i,j);
    }
}

Here the overall loop space is all iterations from i=il, j=jl up to iu, ju (inclusive). All iterations will be split between threads more or less evenly.

In the OpenMP Standard 3.1 the loop space has to be rectangular. This means that the bounds and the increment of the inner loops can’t depend on the variables which are non-invariant in the outer loops. The example above complies with this definition: jl and ju do not depend on i and do not change between iterations of the outer loop.

With this restriction, the collapsed loop nest can have any of the schedules applicable to an ordinary for loop.

Implementation of collapse clause for rectangular loop nests: code generation

For a rectangular loop nest like in example above the compiler generates code similar to this:

void bar(float *a, int i, int j);
int jl, ju, il, iu;
void foo(float *a)
{
  int i, j;
  kmp_uint64 ij, iju;

  bounds_info_t boundsNest[2] = { { loop_type_t::loop_type_int32, 
                                    loop_type_t::loop_type_int32, 
                                    comparison_t::comp_less_or_eq, 0, 
                                    static_cast<kmp_uint64>(il), 0,
                                    static_cast<kmp_uint64>(iu), 0, 
                                    static_cast<kmp_int64>(1), 0 },
                                  { loop_type_t::loop_type_int32, 
                                    loop_type_t::loop_type_int32, 
                                    comparison_t::comp_less_or_eq, 0, 
                                    static_cast<kmp_uint64>(jl), 0,
                                    static_cast<kmp_uint64>(ju), 0, 
                                    static_cast<kmp_int64>(1), 0 } };
  kmp_uint64 originalIVs[2];

  iju = __kmpc_process_loop_nest_rectang(
        nullptr, omp_get_thread_num(), /*in/out*/ boundsNest, 2);

  #pragma omp for private(i, j, ij)
  for (ij=0; ij<=iju; ++ij){
      __kmpc_calc_original_ivs_rectang(nullptr, ij,
                    boundsNest, /*out*/ originalIVs, 2);
      i = originalIVs[0];
      j = originalIVs[1];
      bar(a,i,j);
    }
}

We generate an additional call to the runtime to calculate the overall number of iterations (__kmpc_process_loop_nest_rectang), and in the loop we call another runtime function (__kmpc_calc_original_ivs_rectang) to get the values of the original induction variables (IVs).

Implementation of collapse clause for rectangular loop nests: runtime

The code generation above is supported by two additional functions in the runtime:

__kmpc_process_loop_nest_rectang – to calculate bounds for new collapsed loop;

__kmpc_calc_original_ivs_rectang – to get the original loop induction variables’ values from the value of the new induction variable.

Information is passed from the generated code to the runtime and back using arrays of structs, which define the type of bounds and increment, type of the loop IV, what operation the loop is using and the loop bounds. This information is filled in for all loops in the nest.

Both functions and all the needed definitions are included in the Visual Studio version of LLVM OpenMP runtime (libomp140[d] dll, active with the -openmp:llvm flag), and also upstreamed into LLVM libomp. It’s included in LLVM 17.x; live has an additional performance improvement (same as in Visual Studio version).

collapse clause in OpenMP Standard 5.2

In the most recent OpenMP Standard 5.2 inner loops’ lower and upper bounds can depend on outer loops’ induction variables, e.g.:

void bar(float *a, int i, int j);

void foo(float *a)
{
  #pragma omp parallel for collapse(2)
  for (int i = 0; i < M; i++)
    for (int j = i; j < M; j++){
      bar(a, i, j);
    }
}

For a complete definition of what is allowed please refer to the standard.

Implementation of collapse clause for non-rectangular loop nests: code generation

Non-rectangular loop nests are generated differently from rectangular ones. The runtime gets a description of the loop nest and returns a chunk to execute, with chunk bounds expressed in the terms of the original loop IVs.

For the non-rectangular nest example above the compiler keeps all loops and generates additional checks for bounds:

void bar(float *a, int i, int j);
int jl, ju, il, iu;
void foo(float *a)
{
  int i, j;
  kmp_uint64 ij, iju;

  bounds_info_t boundsNest[2] = { { loop_type_t::loop_type_int32, 
                                    loop_type_t::loop_type_int32, 
                                    comparison_t::comp_less_or_eq, 0, 0, 0,
                                    static_cast<kmp_uint64>(M), 0, 
                                    static_cast<kmp_int64>(1), 0 },
                                  { loop_type_t::loop_type_int32, 
                                    loop_type_t::loop_type_int32, 
                                    comparison_t::comp_less_or_eq, 0, 0, 
                                    static_cast<kmp_uint64>(1),
                                    static_cast<kmp_uint64>(M), 0, 
                                    static_cast<kmp_int64>(1), 0 } };
  bounds_info_t chunkBoundsNest[2];
  kmp_int32 lastiter = false;

  #pragma omp for private(i, j, ij)
  auto haveWork = __kmpc_for_collapsed_init(nullptr, omp_get_thread_num(),
                  /*in/out*/ boundsNest, /*out*/ chunkBoundsNest, 2, /*out*/ &lastiter);

  if (haveWork) {
            int ilnew = chunkBoundsNest[0].lb0_u64;

            int jA0new = chunkBoundsNest[1].lb0_u64;
            int jA1new = 0;

            int iunew = chunkBoundsNest[0].ub0_u64;
            int junew = chunkBoundsNest[1].ub0_u64;

    for (i = ilnew; i <= iunew; ++i) {
        for (j = i * jA1new + jA0new; j <= i * jB1 + jB0; ++j) {
            if ((i >= iunew) && (j > junew)) goto done2;
            bar(a, i, j);
        }
        jA0new = jA0;
        jA1new = jA1;
    }
done2:
    ;

  }
}

This way the runtime can have different strategies for how to split the overall space, and we don’t need to change the code generation.

Implementation of collapse clause for non-rectangular loop nests: runtime

For the non-rectangular case we need only one additional exposed function in the runtime, __kmpc_for_collapsed_init, to calculate the lower and upper bounds of the chunks to execute.

The overall loop nest space is enlarged to a parallelogram (parallelepiped), taking care to not produce an overflow, then this enlarged space is split into more or less equal chunks, then we figure out how it translates back to the original space (chunks will become less evenly split during this translation).

Current implementation is generic, and should be functional with all combinations of loop bounds and increments allowed by the standard, for any number of nested loops. Performance is on par with LLVM (which is also doing a similar enlarging of the loop space).

We think that in practice collapse should be most useful for 2- or 3- level nested loops of specific shapes, e.g. triangular with the upper or lower bound being loop nest invariant. It would be great to recognize such loops and use specialized algorithms in these cases, to split overall space more evenly (not necessarily perfectly). These specializations won’t need different code generation. A recognition of the special case and branching to a different algorithm can be done in the runtime. Adding specialized cases could greatly improve performance. We believe that implementing such algorithms should be much easier in the runtime (using C and some features of C++) than doing experiments with new code generation.

Same as for the rectangular case, all runtime support is included in the Visual Studio version of the LLVM OpenMP runtime and also upstreamed into the LLVM libomp runtime.

Feedback

We encourage you to try out the collapse feature in Visual Studio 2022 version 17.8 or newer. As always, we welcome your feedback. Please share your thoughts and comments with us through Developer Community. You can also reach us on X (@VisualC), or via email at visualcpp@microsoft.com.

0 comments

Leave a comment

Feedback usabilla icon