{"id":97586,"date":"2017-12-15T07:00:00","date_gmt":"2017-12-15T22:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/?p=97586"},"modified":"2019-03-13T01:36:32","modified_gmt":"2019-03-13T08:36:32","slug":"20171215-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20171215-00\/?p=97586","title":{"rendered":"Creating double-precision integer multiplication with a quad-precision result from single-precision multiplication with a double-precision result using intrinsics (part 3)"},"content":{"rendered":"<p><a HREF=\"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/20171214-00\/?p=97577\">Last time<\/a>, we converted our original assembly language code for <a HREF=\"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/20141208-00\/?p=43453\">creating double-precision integer multiplication with a quad-precision result from single-precision multiplication with a double-precision result<\/a> to C++ code with intrinsics working entirely in registers, thereby making the function eligible for leaf function optimizations. <\/p>\n<p>Our last step is adding support for signed multiplication. This is a straightforward translation of the original assembly language into intrinsics. <\/p>\n<pre>\n__m128i Multiply64x64To128(<font COLOR=\"blue\">int64_t<\/font> x, <font COLOR=\"blue\">int64_t<\/font> y)\n{\n    auto x128 = _mm_loadl_epi64((__m128i*) &amp;x);\n    auto term1 = _mm_unpacklo_epi32(x128, x128);\n\n    auto y128 = _mm_loadl_epi64((__m128i*) &amp;y);\n    auto term2 = _mm_unpacklo_epi32(y128, y128);\n\n    auto flip2 = _mm_shuffle_epi32(term2, _MM_SHUFFLE(1, 0, 3, 2));\n\n    auto result = _mm_mul_epu32(term1, term2);\n    auto crossterms = _mm_mul_epu32(term1, flip2);\n\n    \/\/ Now apply the cross-terms to the provisional result\n    unsigned temp;\n\n    auto result1 = _mm_srli_si128(result, 4);\n    auto carry = _addcarry_u32(0,\n                               _mm_cvtsi128_si32(result1),\n                               _mm_cvtsi128_si32(crossterms),\n                               &amp;temp);\n    result1 = _mm_cvtsi32_si128(temp);\n\n    auto result2 = _mm_srli_si128(result, 8);\n    crossterms = _mm_srli_si128(crossterms, 4);\n    carry = _addcarry_u32(carry,\n                          _mm_cvtsi128_si32(result2),\n                          _mm_cvtsi128_si32(crossterms),\n                          &amp;temp);\n    result2 = _mm_cvtsi32_si128(temp);\n\n    auto result3 = _mm_srli_si128(result, 12);\n    _addcarry_u32(carry,\n                  _mm_cvtsi128_si32(result3),\n                  0,\n                  &amp;temp);\n    result3 = _mm_cvtsi32_si128(temp);\n\n    crossterms = _mm_srli_si128(crossterms, 4);\n    carry = _addcarry_u32(0,\n                          _mm_cvtsi128_si32(result1),\n                          _mm_cvtsi128_si32(crossterms),\n                          &amp;temp);\n    result1 = _mm_cvtsi32_si128(temp);\n\n    crossterms = _mm_srli_si128(crossterms, 4);\n    carry = _addcarry_u32(carry,\n                          _mm_cvtsi128_si32(result2),\n                          _mm_cvtsi128_si32(crossterms),\n                          &amp;temp);\n    result2 = _mm_cvtsi32_si128(temp);\n\n    _addcarry_u32(carry,\n                  _mm_cvtsi128_si32(result3),\n                  0,\n                  &amp;temp);\n    result3 = _mm_cvtsi32_si128(temp);\n\n    result = _mm_unpacklo_epi64(\n       _mm_unpacklo_epi32(result, result1),\n       _mm_unpacklo_epi32(result2, result3));\n\n    <font COLOR=\"blue\">\/\/ Apply sign adjustment.\n    __m128i xsign = _mm_shuffle_epi32(x128, _MM_SHUFFLE(1, 1, 3, 2));\n    xsign = _mm_srai_epi32(xsign, 31);\n    __m128i ysign = _mm_shuffle_epi32(y128, _MM_SHUFFLE(1, 1, 3, 2));\n    ysign = _mm_srai_epi32(ysign, 31);\n    __m128i xshift64 = _mm_shuffle_epi32(x128, _MM_SHUFFLE(1, 0, 3, 2));\n    __m128i yshift64 = _mm_shuffle_epi32(x128, _MM_SHUFFLE(1, 0, 3, 2));\n    __m128i xadjust = _mm_and_si128(xsign, xshift64);\n    __m128i yadjust = _mm_and_si128(ysign, yshift64);\n    result = _mm_sub_epi64(result, xadjust);\n    result = _mm_sub_epi64(result, yadjust);<\/font>\n\n    return result;\n}\n<\/pre>\n<p>Each of the new statements translates into a single instruction. There are still enough of XMM registers available so that we can still do all the work in registers. (And if you look at the resulting assembly, you&#8217;ll see that the compiler reordered the operations, presumably for optimization purposes.) <\/p>\n<p>So there you have it, creating 64-bit by 64-bit multiplication with a 128-bit result (either signed or unsigned) from 32-bit code without any inline assembly. Intrinsics let you express what you want in C++, and give the variables meaningful names, and you can let the compiler do the tedious work of of register assignment, something compiler are generally better at than humans anyway. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Applying the sign adjustment.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-97586","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Applying the sign adjustment.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/97586","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=97586"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/97586\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=97586"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=97586"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=97586"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}