Creating double-precision integer multiplication with a quad-precision result from single-precision multiplication with a double-precision result using intrinsics (part 3)

Last time, we converted our original assembly language code for creating double-precision integer multiplication with a quad-precision result from single-precision multiplication with a double-precision result to C++ code with intrinsics working entirely in registers, thereby making the function eligible for leaf function optimizations.

Our last step is adding support for signed multiplication. This is a straightforward translation of the original assembly language into intrinsics.

__m128i Multiply64x64To128(int64_t x, int64_t y)
{
    auto x128 = _mm_loadl_epi64((__m128i*) &x);
    auto term1 = _mm_unpacklo_epi32(x128, x128);

    auto y128 = _mm_loadl_epi64((__m128i*) &y);
    auto term2 = _mm_unpacklo_epi32(y128, y128);

    auto flip2 = _mm_shuffle_epi32(term2, _MM_SHUFFLE(1, 0, 3, 2));

    auto result = _mm_mul_epu32(term1, term2);
    auto crossterms = _mm_mul_epu32(term1, flip2);

    // Now apply the cross-terms to the provisional result
    unsigned temp;

    auto result1 = _mm_srli_si128(result, 4);
    auto carry = _addcarry_u32(0,
                               _mm_cvtsi128_si32(result1),
                               _mm_cvtsi128_si32(crossterms),
                               &temp);
    result1 = _mm_cvtsi32_si128(temp);

    auto result2 = _mm_srli_si128(result, 8);
    crossterms = _mm_srli_si128(crossterms, 4);
    carry = _addcarry_u32(carry,
                          _mm_cvtsi128_si32(result2),
                          _mm_cvtsi128_si32(crossterms),
                          &temp);
    result2 = _mm_cvtsi32_si128(temp);

    auto result3 = _mm_srli_si128(result, 12);
    _addcarry_u32(carry,
                  _mm_cvtsi128_si32(result3),
                  0,
                  &temp);
    result3 = _mm_cvtsi32_si128(temp);

    crossterms = _mm_srli_si128(crossterms, 4);
    carry = _addcarry_u32(0,
                          _mm_cvtsi128_si32(result1),
                          _mm_cvtsi128_si32(crossterms),
                          &temp);
    result1 = _mm_cvtsi32_si128(temp);

    crossterms = _mm_srli_si128(crossterms, 4);
    carry = _addcarry_u32(carry,
                          _mm_cvtsi128_si32(result2),
                          _mm_cvtsi128_si32(crossterms),
                          &temp);
    result2 = _mm_cvtsi32_si128(temp);

    _addcarry_u32(carry,
                  _mm_cvtsi128_si32(result3),
                  0,
                  &temp);
    result3 = _mm_cvtsi32_si128(temp);

    result = _mm_unpacklo_epi64(
       _mm_unpacklo_epi32(result, result1),
       _mm_unpacklo_epi32(result2, result3));

    // Apply sign adjustment.
    __m128i xsign = _mm_shuffle_epi32(x128, _MM_SHUFFLE(1, 1, 3, 2));
    xsign = _mm_srai_epi32(xsign, 31);
    __m128i ysign = _mm_shuffle_epi32(y128, _MM_SHUFFLE(1, 1, 3, 2));
    ysign = _mm_srai_epi32(ysign, 31);
    __m128i xshift64 = _mm_shuffle_epi32(x128, _MM_SHUFFLE(1, 0, 3, 2));
    __m128i yshift64 = _mm_shuffle_epi32(x128, _MM_SHUFFLE(1, 0, 3, 2));
    __m128i xadjust = _mm_and_si128(xsign, xshift64);
    __m128i yadjust = _mm_and_si128(ysign, yshift64);
    result = _mm_sub_epi64(result, xadjust);
    result = _mm_sub_epi64(result, yadjust);

    return result;
}

Each of the new statements translates into a single instruction. There are still enough of XMM registers available so that we can still do all the work in registers. (And if you look at the resulting assembly, you’ll see that the compiler reordered the operations, presumably for optimization purposes.)

So there you have it, creating 64-bit by 64-bit multiplication with a 128-bit result (either signed or unsigned) from 32-bit code without any inline assembly. Intrinsics let you express what you want in C++, and give the variables meaningful names, and you can let the compiler do the tedious work of of register assignment, something compiler are generally better at than humans anyway.

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

Creating double-precision integer multiplication with a quad-precision result from single-precision multiplication with a double-precision result using intrinsics (part 3)

Category

Topics

Author

0 comments

Read next

What kind of messages can a message-only window receive?

How do I ShellExecute a file, but with a specific program instead of the default program?

Category

Topics

Share

Author

0 comments

Read next

What kind of messages can a message-only window receive?

How do I ShellExecute a file, but with a specific program instead of the default program?

Stay informed