{"id":97577,"date":"2017-12-14T07:00:00","date_gmt":"2017-12-14T22:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/?p=97577"},"modified":"2019-03-13T01:22:54","modified_gmt":"2019-03-13T08:22:54","slug":"20171214-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20171214-00\/?p=97577","title":{"rendered":"Creating double-precision integer multiplication with a quad-precision result from single-precision multiplication with a double-precision result using intrinsics (part 2)"},"content":{"rendered":"<p><a HREF=\"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/20171213-00\/?p=97575\">Last time<\/a>, we converted our original assembly language code for <a HREF=\"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/20141208-00\/?p=43453\">creating double-precision integer multiplication with a quad-precision result from single-precision multiplication with a double-precision result<\/a> to C++ code with intrinsics. We observed that the compiler was able to optimize out some memory accesses by extracting the values using shifts. <\/p>\n<p>Let&#8217;s see if we can tweak the code to remove the last of the memory accesses. Although the Windows calling convention for x86 does not have many general purpose registers available (only <code>eax<\/code> <code>ecx<\/code>, and <code>edx<\/code>), it does have eight <code>xmm<\/code> registers available, so we can use those as temporary holding places. <\/p>\n<pre>\n__m128i Multiply64x64To128(uint64_t x, uint64_t y)\n{\n    auto x128 = _mm_loadl_epi64((__m128i*) &amp;x);\n    auto term1 = _mm_unpacklo_epi32(x128, x128);\n\n    auto y128 = _mm_loadl_epi64((__m128i*) &amp;y);\n    auto term2 = _mm_unpacklo_epi32(y128, y128);\n\n    auto flip2 = _mm_shuffle_epi32(term2, _MM_SHUFFLE(1, 0, 3, 2));\n\n    auto result = _mm_mul_epu32(term1, term2);\n    auto crossterms = _mm_mul_epu32(term1, flip2);\n\n    \/\/ Now apply the cross-terms to the provisional result\n    unsigned temp;\n\n    auto result1 = _mm_srli_si128(result, 4);\n    auto carry = _addcarry_u32(0,\n                               _mm_cvtsi128_si32(result1),\n                               _mm_cvtsi128_si32(crossterms),\n                               &amp;temp);\n    result1 = _mm_cvtsi32_si128(temp);\n\n    auto result2 = _mm_srli_si128(result, 8);\n    crossterms = _mm_srli_si128(crossterms, 4);\n    carry = _addcarry_u32(carry,\n                          _mm_cvtsi128_si32(result2),\n                          _mm_cvtsi128_si32(crossterms),\n                          &amp;temp);\n    result2 = _mm_cvtsi32_si128(temp);\n\n    auto result3 = _mm_srli_si128(result, 12);\n    _addcarry_u32(carry,\n                  _mm_cvtsi128_si32(result3),\n                  0,\n                  &amp;temp);\n    result3 = _mm_cvtsi32_si128(temp);\n\n    crossterms = _mm_srli_si128(crossterms, 4);\n    carry = _addcarry_u32(0,\n                          _mm_cvtsi128_si32(result1),\n                          _mm_cvtsi128_si32(crossterms),\n                          &amp;temp);\n    result1 = _mm_cvtsi32_si128(temp);\n\n    crossterms = _mm_srli_si128(crossterms, 4);\n    carry = _addcarry_u32(carry,\n                          _mm_cvtsi128_si32(result2),\n                          _mm_cvtsi128_si32(crossterms),\n                          &amp;temp);\n    result2 = _mm_cvtsi32_si128(temp);\n\n    _addcarry_u32(carry,\n                  _mm_cvtsi128_si32(result3),\n                  0,\n                  &amp;temp);\n    result3 = _mm_cvtsi32_si128(temp);\n\n    result = _mm_unpacklo_epi64(\n       _mm_unpacklo_epi32(result, result1),\n       _mm_unpacklo_epi32(result2, result3));\n\n    return result;\n}\n<\/pre>\n<p>We keep each of the four pieces of the result in a separate MMX register and convert it to an integer for the purpose of the <code>_addcarry_u32<\/code>, then convert it back to an MMX register once the arithmetic is complete. At the end, recombine the four pieces into a single value. <\/p>\n<p>The convert-on-demand-and-then-back pattern is <\/p>\n<pre>\n    carry = _addcarry_u32(carry,\n                          _mm_cvtsi128_si32(blah),\n                          _mm_cvtsi128_si32(crossterms),\n                          &amp;temp);\n    blah = _mm_cvtsi32_si128(temp);\n<\/pre>\n<p>where we take the low-order 32-bit value in <code>blah<\/code>, perform an add-with-carry with the low-order 32-bit vlaue in <code>crossterms<\/code>, then save the result back into <code>blah<\/code> while retaining the carry. <\/p>\n<p>The other trick is that the lanes of the cross-terms are consumed only once each, and in order, so we can shift them into position and use <code>_mm_cvtsi128_si32<\/code> to pull them out one at a time. <\/p>\n<p>The resulting compiler-generated assembly goes like this: <\/p>\n<pre>\n    ; xmm0 = y = { 0, 0, C, D }\n        movq     xmm0, QWORD PTR _y$[esp-4]\n    ; xmm1 = x = { 0, 0, A, B }\n        movq     xmm1, QWORD PTR _x$[esp-4]\n\n    ; xmm0 = { C, C, D, D }\n        punpckldq xmm0, xmm0\n\n    ; xmm4 = { C, C, D, D }\n        movaps   xmm4, xmm0\n\n    ; xmm1 = { A, A, B, B }\n        punpckldq xmm1, xmm1\n\n    ; xmm4 = { A * C, B * D } ; \"result\"\n        pmuludq xmm4, xmm1\n\n    ; xmm3 = { D, D, C, C }\n        pshufd   xmm3, xmm0, 78\n\n    ; xmm3 = { A * D, B * C } ; \"crossterms\"\n        pmuludq xmm3, xmm1\n\n    ; ecx = result[1]\n        movaps   xmm0, xmm4\n        psrldq   xmm0, 4\n        movd     ecx, xmm0\n\n    ; prepare to load result[2] from xmm0\n        movaps   xmm0, xmm4\n\n    ; eax = crossterms[0]\n        movd     eax, xmm3\n\n    ; prepare to load result[2] from xmm0\n        psrldq   xmm0, 8\n\n    ; shift crossterms[1] into position\n        psrldq   xmm3, 4\n\n    ; result[1] += crossterms[0], carry set appropriately\n        add      ecx, eax\n\n    ; eax = crossterms[1]\n        movd     eax, xmm3\n\n    ; shift crossterms[2] into position\n        psrldq   xmm3, 4\n\n    ; xmm1 = result[1]\n        movd     xmm1, ecx\n\n    ; ecx = result[2]\n        movd     ecx, xmm0\n\n    ; prepare to load result[3] from xmm0\n        movaps   xmm0, xmm4\n        psrldq   xmm0, 12\n\n    ; result[2] += crossterms[1] + carry, carry set appropriate\n        adc      ecx, eax\n\n    ; eax = result[3]\n        movd     eax, xmm0 \n\n    ; result[3] += carry\n        adc      eax, 0\n\n    ; xmm2 = result[2]\n        movd     xmm2, ecx\n\n    ; ecx = result[1]\n        movd     ecx, xmm1\n\n    ; xmm0 = result[3]\n        movd     xmm0, eax\n\n    ; eax = crossterms[2]\n        movd     eax, xmm3\n\n    ; shift crossterms[3] into position\n        psrldq   xmm3, 4\n\n    ; result[1] += crossterms[2], carry set appropriately\n        add      ecx, eax\n\n    ; eax = crossterms[3]\n        movd     eax, xmm3\n\n    ; xmm1 = result[1]\n        movd     xmm1, ecx\n\n    ; ecx = result[2]\n        movd     ecx, xmm2\n\n    ; xmm4 = { *, *, result[1], result[0] }\n        punpckldq xmm4, xmm1\n\n    ; result[2] += crossterms[3]\n        adc      ecx, eax\n\n    ; eax = result[3]\n        movd     eax, xmm0\n\n    ; result[3] += carry\n        adc      eax, 0\n\n    ; xmm2 = result[2]\n        movd     xmm2, ecx\n\n    ; xmm1 = result[3]\n        movd     xmm1, eax\n\n    ; xmm2 = { *, *, result[3], result[2] }\n        punpckldq xmm2, xmm1\n\n    ; xmm4 = { result[3], result[2], result[1], result[0] }\n        punpcklqdq xmm4, xmm2\n\n    ; set as return value\n        movaps   xmm0, xmm4\n        ret\n<\/pre>\n<p>I could go even further and realize that one of the <code>result#<\/code> variables could be left in a general-purpose register, since we need only two registers to perform the integer add. I also could have shifted the result a little bit at a time the same way I shifted the cross-terms a little bit at a time. <\/p>\n<p>This version can perform all its work in registers, which means that there&#8217;s no need for stack variables, which means that it becomes a lightweight leaf function. That means it doesn&#8217;t need to create a stack frame. <\/p>\n<p>Next time, we&#8217;ll move on to signed multiplication. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Put it all in registers.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-97577","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Put it all in registers.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/97577","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=97577"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/97577\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=97577"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=97577"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=97577"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}