{"id":97575,"date":"2017-12-13T07:00:00","date_gmt":"2017-12-13T22:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/?p=97575"},"modified":"2019-03-13T01:21:04","modified_gmt":"2019-03-13T08:21:04","slug":"20171213-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20171213-00\/?p=97575","title":{"rendered":"Creating double-precision integer multiplication with a quad-precision result from single-precision multiplication with a double-precision result using intrinsics (part 1)"},"content":{"rendered":"<p>Some time ago, I derived how to <a HREF=\"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/20141208-00\/?p=43453\">create double-precision integer multiplication with a quad-precision result from single-precision multiplication with a double-precision result<\/a>, but I expressed it in assembly language. This time, I&#8217;ll express it in C using intrinsics. <\/p>\n<p>Using intrinsics rather than inline assembly language is slightly more portable, since all the compiler toolchains that implement the intrinsics agree on what the intrinsics mean. They <a HREF=\"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/20171102-00\/?p=97335\">disagree on the members of a <code>__m128i<\/code><\/a>, but at least they agree on the intrinsics. <\/p>\n<p>First, a straightforward translation of the assembly language code to intrinsics: <\/p>\n<pre>\n__m128i Multiply64x64To128(uint64_t x, uint64_t y)\n{\n    \/\/ movq xmm0, x         \/\/ xmm0 = { 0, 0, A, B } = { *, *, A, B }\n    auto x00AB = _mm_loadl_epi64((__m128i*) &amp;x);\n\n    \/\/ movq xmm1, y         \/\/ xmm1 = { 0, 0, C, D } = { *, *, C, D }\n    auto x00CD = _mm_loadl_epi64((__m128i*) &amp;y);\n\n    \/\/ punpckldq xmm0, xmm0 \/\/ xmm0 = { A, A, B, B } = { *, A, *, B }\n    auto xAABB = _mm_unpacklo_epi32(x00AB, x00AB);\n\n    \/\/ punpckldq xmm1, xmm1 \/\/ xmm1 = { C, C, D, D } = { *, C, *, D }\n    auto xCCDD = _mm_unpacklo_epi32(x00CD, x00CD);\n\n    \/\/ pshufd xmm2, xmm1, _MM_SHUFFLE(1, 0, 3, 2)\n    \/\/                      \/\/ xmm2 = { D, D, C, C } = { *, D, *, C }\n    auto xDDCC = _mm_shuffle_epi32(xCCDD, _MM_SHUFFLE(1, 0, 3, 2));\n\n    \/\/ pmuludq xmm1, xmm0 \/\/ xmm1 = { AC, BD } \/\/ provisional result\n    auto result = _mm_mul_epu32(xAABB, xCCDD);\n\n    \/\/ pmuludq xmm2, xmm0 \/\/ xmm2 = { AD, BC } \/\/ cross-terms\n    auto crossterms = _mm_mul_epu32(xAABB, xDDCC);\n\n    \/\/ mov    eax, crossterms[0]\n    \/\/ add    result[4], eax\n    auto carry = _addcarry_u32(0,\n                               result.m128i_u32[1],\n                               crossterms.m128i_u32[0],\n                               &amp;result.m128i_u32[1]);\n\n    \/\/ mov    edx, crossterms[4] \/\/ edx:eax = BC\n    \/\/ adc    result[8], edx\n    carry = _addcarry_u32(carry,\n                          result.m128i_u32[2],\n                          crossterms.m128i_u32[1],\n                          &amp;result.m128i_u32[2]);\n\n    \/\/ adc    result[12], 0      \/\/ add the first cross-term\n    _addcarry_u32(carry,\n                  result.m128i_u32[3],\n                  0,\n                  &amp;result.m128i_u32[3]);\n\n    \/\/ mov    eax, crossterms[8]\n    \/\/ add    result[4], eax\n    carry = _addcarry_u32(0,\n                          result.m128i_u32[1],\n                          crossterms.m128i_u32[2],\n                          &amp;result.m128i_u32[1]);\n\n    \/\/ mov    edx, crossterms[12] \/\/ edx:eax = AD\n    \/\/ adc    result[8], edx\n    carry = _addcarry_u32(carry,\n                          result.m128i_u32[2],\n                          crossterms.m128i_u32[3],\n                          &amp;result.m128i_u32[2]);\n\n    \/\/ adc    result[12], 0      \/\/ add the second cross-term\n    _addcarry_u32(carry,\n                  result.m128i_u32[3],\n                  0,\n                  &amp;result.m128i_u32[3]);\n\n    return result;\n}\n<\/pre>\n<p>The Microsoft Visual Studio compiler produces the following: <\/p>\n<pre>\n    ; standard function prologue\n        push     ebp\n        mov      ebp, esp\n        and      esp, -16   ; room for _result on the stack\n        sub      esp, 24\n\n    ; load up x and y\n        movq     xmm1, QWORD PTR _y$[ebp]\n        movq     xmm2, QWORD PTR _x$[ebp]\n\n    ; duplicate x\n        punpckldq xmm1, xmm1\n\n    ; make another copy of the duplicated x\n        movaps   xmm0, xmm1\n\n    ; duplicate y\n        punpckldq xmm2, xmm2\n\n    ; multiply main terms, result in xmm0\n        pmuludq xmm0, xmm2\n\n    ; shuffle and multiply cross terms, cross-terms in xmm1\n        pshufd   xmm1, xmm1, 141\n        pmuludq xmm1, xmm2\n\n    ; Now the adjustments for cross-terms\n\n    ; save a register\n        push     esi\n\n    ; save result to memory\n        movaps   XMMWORD PTR _result$[esp+32], xmm0\n\n    ; load up result[2] into esi and result[3] into ecx\n        mov      esi, DWORD PTR _result$[esp+40]\n        mov      ecx, DWORD PTR _result$[esp+44]\n\n    ; load result[1] into edx by shifting\n        psrldq   xmm0, 4\n        movd     edx, xmm0\n\n    ; xmm0 holds cross-terms\n        movaps   xmm0, xmm1\n\n    ; load crossterms[0] into eax\n        movd     eax, xmm1\n\n    ; prepare to load crossterms[1] into eax by shifting\n        psrldq   xmm0, 4\n\n    ; add crossterms[0] to result[1]\n        add      edx, eax\n\n    ; load crossterms[1] into eax\n        movd     eax, xmm0\n\n    ; xmm0 holds cross-terms (again)\n        movaps   xmm0, xmm1\n\n    ; prepare to load crossterms[3] into eax by shifting\n        psrldq   xmm1, 12\n\n    ; prepare to load crossterms[2] into eax by shifting\n        psrldq   xmm0, 8\n\n    ; add crossterms[1] to result[2], with carry\n        adc      esi, eax\n\n    ; load crossterms[2] into eax\n        movd     eax, xmm0\n\n    ; propagate carry into result[3]\n        adc      ecx, 0\n\n    ; add crossterms[2] to result[1]\n        add      edx, eax\n\n    ; load crossterms[3] into eax\n        movd     eax, xmm1\n\n    ; save final result[1]\n        mov      DWORD PTR _result$[esp+36], edx\n\n    ; add crossterms[3] to result[2]\n        adc      esi, eax\n\n    ; save final result[2]\n        mov      DWORD PTR _result$[esp+40], esi\n\n    ; propagate carry into result[3]\n        adc      ecx, 0\n\n    ; save final result[3]\n        mov      DWORD PTR _result$[esp+44], ecx\n\n    ; load final result\n        movaps   xmm0, XMMWORD PTR _result$[esp+32]\n\n    ; clean up stack and return\n        pop      esi\n        mov      esp, ebp\n        pop      ebp\n        ret\n<\/pre>\n<p>I was impressed that the compiler was able to convert our direct accesses to the internal members of the <code>__m128i<\/code> into corresponding shifts and extractions. (Since this code was compiled with only SSE2 support, the compiler could not use the <code>pextrd<\/code> instruction to do the extraction.) <\/p>\n<p>Even with this very lame conversion, the compiler does quite a good job of optimiznig the code. The opening instructions match our handwritten assembly almost exactly; the second half doesn&#8217;t match as well, but that&#8217;s because the compiler was able to replace many of our memory accesses with register accesses. <\/p>\n<p>The compiler was able to optimize our inline assembly! <\/p>\n<p>We&#8217;ll take this as inspiration for trying to get rid of all the memory accesses. The story continues next time. <\/p>\n<p><b>Bonus chatter<\/b>: In the three years since I wrote the original article, nobody picked up on the fact that I got the parameters to <code>_MM_SHUFFLE<\/code> wrong. <\/p>\n<p><b>Bonus bonus chatter<\/b>: I could have reduced the dependency chain a bit by tweaking the calculation of <code>xDDCC<\/code>: <\/p>\n<pre>\n    \/\/ pshufd xmm2, xmm1, _MM_SHUFFLE(0, 0, 1, 1);\n    \/\/                      \/\/ xmm2 = { D, D, C, C } = { *, D, *, C }\n    auto xDDCC = _mm_shuffle_epi32(x00CD, _MM_SHUFFLE(0, 0, 1, 1));\n\n    \/\/ punpckldq xmm1, xmm1 \/\/ xmm1 = { C, C, D, D } = { *, C, *, D }\n    auto xCCDD = _mm_unpacklo_epi32(x00CD, x00CD);\n<\/pre>\n<p>Basing the calculation of <code>xDDCC<\/code> on <code>x00CD<\/code> rather than <code>0xCCDD<\/code> removes one instruction from the dependency chain. <\/p>\n<p><b>Bonus bonus bonus chatter<\/b>: I chose to use <code>punpckldq<\/code> instead of <code>pshufd<\/code> to calculate <code>xCCDD<\/code> because <code>punpckldq<\/code> encodes one byte shorter. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Once more with intrinsics.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-97575","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Once more with intrinsics.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/97575","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=97575"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/97575\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=97575"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=97575"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=97575"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}