{"id":106223,"date":"2022-02-07T07:00:00","date_gmt":"2022-02-07T15:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=106223"},"modified":"2022-03-29T13:55:35","modified_gmt":"2022-03-29T20:55:35","slug":"20220207-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20220207-00\/?p=106223","title":{"rendered":"On finding the average of two unsigned integers without overflow"},"content":{"rendered":"<p>Finding the average of two unsigned integers, rounding toward zero, sounds easy:<\/p>\n<pre>unsigned average(unsigned a, unsigned b)\r\n{\r\n    return (a + b) \/ 2;\r\n}\r\n<\/pre>\n<p>However, this <a title=\"Answers to exercise from Scrollbars Part 11\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20030917-00\/?p=42453\"> gives the wrong answer in the face of integer overflow<\/a>: For example, if unsigned integers are 32 bits wide, then it says that <code>average(0x80000000U, 0x80000000U)<\/code> is zero.<\/p>\n<p>If you know which number is the larger number (which is often the case), then you can <a title=\"Extra, Extra - Read All About It: Nearly All Binary Searches and Mergesorts are Broken\" href=\"https:\/\/ai.googleblog.com\/2006\/06\/extra-extra-read-all-about-it-nearly.html\"> calculate the width and halve it<\/a>:<\/p>\n<pre>unsigned average(unsigned low, unsigned high)\r\n{\r\n    return low + (high - low) \/ 2;\r\n}\r\n<\/pre>\n<p>There&#8217;s another algorithm that doesn&#8217;t depend on knowing which value is larger, the U.S. patent for which <a href=\"https:\/\/patents.google.com\/patent\/US6007232A\/en\"> expired in 2016<\/a>:<\/p>\n<pre>unsigned average(unsigned a, unsigned b)\r\n{\r\n    return (a \/ 2) + (b \/ 2) + (a &amp; b &amp; 1);\r\n}\r\n<\/pre>\n<p>The trick here is to pre-divide the values before adding. This will be too low if the original addition contained a carry from bit 0 to bit 1, which happens if bit 0 is set in both of the terms, so we detect that case and make the necessary adjustment.<\/p>\n<p>And then there&#8217;s the technique in the style known as <a href=\"https:\/\/en.wikipedia.org\/wiki\/SWAR\">SWAR<\/a>, which stands for &#8220;SIMD within a register&#8221;.<\/p>\n<pre>unsigned average(unsigned a, unsigned b)\r\n{\r\n    return (a &amp; b) + (a ^ b) \/ 2;\r\n}\r\n<\/pre>\n<p>If your compiler supports integers larger than the size of an <code>unsigned<\/code>, say because <code>unsigned<\/code> is a 32-bit value but the native register size is 64-bit, or because the compiler supports multiword arithmetic, then you can cast to the larger data type:<\/p>\n<pre>unsigned average(unsigned a, unsigned b)\r\n{\r\n    \/\/ Suppose \"unsigned\" is a 32-bit type and\r\n    \/\/ \"unsigned long long\" is a 64-bit type.\r\n    return ((unsigned long long)a + b) \/ 2;\r\n}\r\n<\/pre>\n<p>The results would look something like this for processor with native 64-bit registers. (I follow the processor&#8217;s natural calling convention for what is in the upper 32 bits of 64-bit registers.)<\/p>\n<pre>\/\/ x86-64: Assume ecx = a, edx = b, upper 32 bits unknown\r\n    mov     eax, ecx        ; rax = ecx zero-extended to 64-bit value\r\n    mov     edx, edx        ; rdx = edx zero-extended to 64-bit value\r\n    add     rax, rdx        ; 64-bit addition: rax = rax + rdx\r\n    shr     rax, 1          ; 64-bit shift:    rax = rax &gt;&gt; 1\r\n                            ;                  result is zero-extended\r\n                            ; Answer in eax\r\n\r\n\/\/ AArch64 (ARM 64-bit): Assume w0 = a, w1 = b, upper 32 bits unknown\r\n    uxtw    x0, w0          ; x0 = w0 zero-extended to 64-bit value\r\n    add     x0, w1, uxtw    ; 64-bit addition: x0 = x0 + (uint32_t)w1\r\n    ubfx    x0, x0, 1, 32   ; Extract bits 1 through 32 from result\r\n                            ; (shift + zero-extend in one instruction)\r\n                            ; Answer in x0\r\n\r\n\/\/ Alpha AXP: Assume a0 = a, a1 = b, both in canonical form\r\n    insll   a0, #0, a0      ; a0 = a0 zero-extended to 64-bit value\r\n    insll   a1, #0, a1      ; a1 = a1 zero-extended to 64-bit value\r\n    addq    a0, a1, v0      ; 64-bit addition: v0 = a0 + a1\r\n    srl     v0, #1, v0      ; 64-bit shift:    v0 = v0 &gt;&gt; 1\r\n    addl    zero, v0, v0    ; Force canonical form\r\n                            ; Answer in v0\r\n\r\n\/\/ MIPS64: Assume a0 = a, a1 = b, sign-extended\r\n    dext    a0, a0, 0, 32   ; Zero-extend a0 to 64-bit value\r\n    dext    a1, a1, 0, 32   ; Zero-extend a1 to 64-bit value\r\n    daddu   v0, a0, a1      ; 64-bit addition: v0 = a0 + a1\r\n    dsrl    v0, v0, #1      ; 64-bit shift:    v0 = v0 &gt;&gt; 1\r\n    sll     v0, #0, v0      ; Sign-extend result\r\n                            ; Answer in v0\r\n\r\n\/\/ Power64: Assume r3 = a, r4 = b, zero-extended\r\n    add     r3, r3, r4      ; 64-bit addition: r3 = r3 + r4\r\n    rldicl  r3, r3, 63, 32  ; Extract bits 63 through 32 from result\r\n                            ; (shift + zero-extend in one instruction)\r\n                            ; result in r3\r\n\r\n\/\/ Itanium Ia64: Assume r32 = a, r4 = b, upper 32 bits unknown\r\n    extr    r32 = r32, 0, 32    \/\/ zero-extend r32 to 64-bit value\r\n    extr    r33 = r33, 0, 32 ;; \/\/ zero-extend r33 to 64-bit value\r\n    add.i8  r8 = r32, r33 ;;    \/\/ 64-bit addition: r8 = r32 + r33\r\n    shr     r8 = r8, 1          \/\/ 64-bit shift:    r8 = r8 &gt;&gt; 1\r\n<\/pre>\n<p>Note that we must ensure that the upper 32 bits of the 64-bit registers are zero, so that any leftover values in bit 32 don&#8217;t infect the sum. The instructions to zero out the upper 32 bits may be elided if you know ahead of time that they are already zero. This is common on x86-64 and AArch64 since those architectures naturally zero-extend 32-bit values to 64-bit values, but not common on Alpha AXP and MIPS64 because those architectures naturally <i>sign<\/i>-extend 32-bit values to 64-bit values.<\/p>\n<p>I find it amusing that the PowerPC, <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20180815-00\/?p=99495\"> patron saint<\/a> of <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20180814-00\/?p=99485\"> ridiculous instructions<\/a>, has an instruction whose name almost literally proclaims its ridiculousness: rldicl. (It stands for &#8220;rotate left doubleword by immediate and clear left&#8221;.)<\/p>\n<p>For 32-bit processors with compiler support for multiword arithmetic, you end up with something like this:<\/p>\n<pre>\/\/ x86-32\r\n    mov     eax, a          ; eax = a\r\n    xor     ecx, ecx        ; Zero-extend to 64 bits\r\n    add     eax, b          ; Accumulate low 32 bits in eax, set carry on overflow\r\n    adc     ecx, ecx        ; Accumulate high 32 bits in ecx\r\n                            ; ecx:eax = 64-bit result\r\n    shrd    eax, ecx, 1     ; Multiword shift right\r\n                            ; Answer in eax\r\n\r\n\/\/ ARM 32-bit: Assume r0 = a, r1 = b\r\n    mov     r2, #0          ; r2 = 0\r\n    adds    r0, r1, r2      ; Accumulate low 32 bits in r0, set carry on overflow\r\n    adc     r1, r2, #0      ; Accumulate high 32 bits in r1\r\n                            ; r1:r0 = 64-bit result\r\n    lsrs    r1, r1, #1      ; Shift high 32 bits right one position\r\n                            ; Bottom bit goes into carry\r\n    rrx     r0, r0          ; Rotate bottom 32 bits right one position\r\n                            ; Carry bit goes into top bit\r\n                            ; Answer in r0\r\n\r\n\/\/ SH-3: Assume r4 = a, r5 = b\r\n    ; (MSVC 13.10.3343 code generation here isn't that great)\r\n    clrt                    ; Clear T flag\r\n    mov     #0, r3          ; r3 = 0, zero-extended high 32 bits of a\r\n    addc    r5, r4          ; r4 = r4 + r5 + T, overflow goes into T bit\r\n    mov     #0, r2          ; r2 = 0, zero-extended high 32 bits of b\r\n    addc    r3, r2          ; r2 = r2 + r3 + T, calculate high 32 bits\r\n                            ; r3:r2 = 64-bit result\r\n    mov     #31, r3         ; Prepare for left shift\r\n    shld    r3, r2          ; r2 = r2 &lt;&lt; r3\r\n    shlr    r4              ; r4 = r4 &gt;&gt; 1\r\n    mov     r2, r0          ; r0 = r2\r\n    or      r4, r0          ; r0 = r0 | r4\r\n                            ; Answer in r0\r\n\r\n\/\/ MIPS: Assume a0 = a, a1 = b\r\n    addu    v0, a0, a1      ; v0 = a0 + a1\r\n    sltu    a0, v0, a0      ; a0 = 1 if overflow occurred\r\n    sll     a0, 31          ; Move to bit 31\r\n    srl     v0, v0, #1      ; Shift low 32 bits right one position\r\n    or      v0, v0, a0      ; Combine the two parts\r\n                            ; Answer in v0\r\n\r\n\/\/ PowerPC: Assume r3 = a, r4 = b\r\n    ; (gcc 4.8.5 -O3 code generation here isn't that great)\r\n    mr      r9, r3          ; r9 = r3 (low 32 bits of 64-bit a)\r\n    mr      r11, r4         ; r11 = r4 (low 32 bits of 64-bit b)\r\n    li      r8, #0          ; r8 = 0 (high 32 bits of 64-bit a)\r\n    li      r10, #0         ; r10 = 0 (high 32 bits of 64-bit b)\r\n    addc    r11, r11, r9    ; r11 = r11 + r9, set carry on overflow\r\n    adde    r10, r10, r8    ; r10 = r10 + r8, high 32 bits of 64-bit result\r\n    rlwinm  r3, r10, 31, 1, 31 ; r3 = r10 &gt;&gt; 1\r\n    rlwinm  r9, r11, 31, 0, 0 ; r9 = r1 &lt;&lt; 31\r\n    or      r3, r3, r9      ; Combine the two parts\r\n                            ; Answer in r3\r\n\r\n\/\/ RISC-V: Assume a0 = a, a1 = b\r\n    add     a1, a0, a1      ; a1 = a0 + a1\r\n    sltu    a0, a1, a0      ; a0 = 1 if overflow occurred\r\n    slli    a0, a0, 31      ; Shift to bit 31\r\n    slri    a1, a1, 1       ; a1 = a1 &gt;&gt; 1\r\n    or      a0, a0, a1      ; Combine the two parts\r\n                            ; Answer in a0\r\n<\/pre>\n<p>Or if you have access to SIMD registers that are larger than the native register size, you can do the math there. (Though crossing the boundary from general-purpose register to SIMD register and back may end up too costly.)<\/p>\n<pre>\/\/ x86-32\r\nunsigned average(unsigned a, unsigned b)\r\n{\r\n    auto a128 = _mm_cvtsi32_si128(a);\r\n    auto b128 = _mm_cvtsi32_si128(b);\r\n    auto sum = _mm_add_epi64(a128, b128);\r\n    auto avg = _mm_srli_epi64(sum, 1);\r\n    return _mm_cvtsi128_si32(avg);\r\n}\r\n\r\n    movd    xmm0, a         ; Load a into bottom 32 bits of 128-bit register\r\n    movd    xmm1, b         ; Load b into bottom 32 bits of 128-bit register\r\n    paddq   xmm1, xmm0      ; Add as 64-bit integers\r\n    psrlq   xmm1, 1         ; Shift 64-bit integer right one position\r\n    movd    eax, xmm1       ; Extract bottom 32 bits of result\r\n\r\n\/\/ 32-bit ARM (A32) has an \"average\" instruction built in\r\nunsigned average(unsigned a, unsigned b)\r\n{\r\n    auto a64 = vdup_n_u32(a);\r\n    auto b64 = vdup_n_u32(b);\r\n    auto avg = vhadd_u32(a64, b64); \/\/ hadd = half of add (average)\r\n    return vget_lane_u32(avg);\r\n}\r\n\r\n    vdup.32 d16, r0         ; Broadcast r0 into both halves of d16\r\n    vdup.32 d17, r1         ; Broadcast r1 into both halves of d17\r\n    vhadd.u32 d16, d16, d17 ; d16 = average of d16 and d17\r\n    vmov.32 r0, d16[0]      ; Extract result\r\n<\/pre>\n<p>But you can still do better, if only you had access to better intrinsics.<\/p>\n<p>In processors that support add-with-carry, you can view the sum of register-sized integers as a (<var>N<\/var> + 1)-bit result, where the bonus bit <var>N<\/var> is the carry bit. If the processor also supports rotate-right-through-carry, you can shift (<var>N<\/var> + 1)-bit result right one place, recovering the correct average without losing the bit that overflows.<\/p>\n<pre>\/\/ x86-32\r\n    mov     eax, a\r\n    add     eax, b          ; Add, overflow goes into carry bit\r\n    rcr     eax, 1          ; Rotate right one place through carry\r\n\r\n\/\/ x86-64\r\n    mov     rax, a\r\n    add     rax, b          ; Add, overflow goes into carry bit\r\n    rcr     rax, 1          ; Rotate right one place through carry\r\n\r\n\/\/ 32-bit ARM (A32)\r\n    mov     r0, a\r\n    adds    r0, b           ; Add, overflow goes into carry bit\r\n    rrx     r0              ; Rotate right one place through carry\r\n\r\n\/\/ SH-3\r\n    clrt                    ; Clear T flag\r\n    mov     a, r0\r\n    addc    b, r0           ; r0 = r0 + b + T, overflow goes into T bit\r\n    rotcr   r0              ; Rotate right one place through carry\r\n<\/pre>\n<p>While there is an intrinsic for the operation of &#8220;add two values and report the result as well as carry&#8221;, we don&#8217;t have one for &#8220;rotate right through carry&#8221;, so we can get only halfway there:<\/p>\n<pre>unsigned average(unsigned a, unsigned b)\r\n{\r\n#if defined(_MSC_VER)\r\n    unsigned sum;\r\n    auto carry = _addcarry_u32(0, a, b, &amp;sum);\r\n    return _rotr1_carry(sum, carry); \/\/ missing intrinsic!\r\n#elif defined(__clang__)\r\n    unsigned carry;\r\n    auto sum = _builtin_adc(a, b, 0, &amp;carry);\r\n    return _builtin_rotateright1throughcarry(sum, carry); \/\/ missing intrinsic!\r\n#elif defined(__GNUC__)\r\n    unsigned sum;\r\n    auto carry = __builtin_add_overflow(a, b, &amp;sum);\r\n    return _builtin_rotateright1throughcarry(sum, carry); \/\/ missing intrinsic!\r\n#else\r\n#error Unsupported compiler.\r\n#endif\r\n}\r\n<\/pre>\n<p>We&#8217;ll have to fake it, alas. Here&#8217;s one way:<\/p>\n<pre>unsigned average(unsigned a, unsigned b)\r\n{\r\n#if defined(_MSC_VER)\r\n    unsigned sum;\r\n    auto carry = _addcarry_u32(0, a, b, &amp;sum);\r\n    return (sum \/ 2) | (carry &lt;&lt; 31);\r\n#elif defined(__clang__)\r\n    unsigned carry;\r\n    auto sum = _builtin_addc(a, b, 0, &amp;carry);\r\n    return (sum \/ 2) | (carry &lt;&lt; 31);\r\n#elif defined(__GNUC__)\r\n    unsigned sum;\r\n    auto carry = __builtin_add_overflow(a, b, &amp;sum);\r\n    return (sum \/ 2) | (carry &lt;&lt; 31);\r\n#else\r\n#error Unsupported compiler.\r\n#endif\r\n}\r\n\r\n\/\/ _MSC_VER\r\n    mov     ecx, a\r\n    add     ecx, b          ; Add, overflow goes into carry bit\r\n    setc    al              ; al = 1 if carry set\r\n    shr     ecx, 1          ; Shift sum right one position\r\n    movzx   eax, al         ; eax = 1 if carry set\r\n    shl     eax, 31         ; Move to bit 31\r\n    or      eax, ecx        ; Combine\r\n                            ; Result in eax\r\n\r\n\/\/ __clang__\r\n    mov     ecx, a\r\n    add     ecx, b          ; Add, overflow goes into carry bit\r\n    setc    al              ; al = 1 if carry set\r\n    shld    eax, ecx, 31    ; Shift left 64-bit value\r\n                            ; Result in eax\r\n\r\n\/\/ __clang__ with ARM-Thumb2\r\n    adds    r0, r0, r1      ; Calculate sum with flags\r\n    blo     nope            ; Jump if carry clear\r\n    movs    r1, #1          ; Carry is 1\r\n    lsls    r1, r1, #31     ; Move carry to bit 31\r\n    lsrs    r0, r0, #1      ; Shift sum right one position\r\n    adcs    r0, r0, r1      ; Combine\r\n    b       done\r\nnope:\r\n    movs    r1, #0          ; Carry is 0\r\n    lsrs    r0, r0, #1      ; Shift sum right one position\r\n    adds    r0, r0, r1      ; Combine\r\ndone:\r\n\r\n\r\n\/\/ __GNUC__\r\n    mov     eax, a\r\n    xor     edx, edx        ; Preset edx = 0 for later setc\r\n    add     eax, b          ; Add, overflow goes into carry bit\r\n    setc    dl              ; dl = 1 if carry set\r\n    shr     eax, 1          ; Shift sum right one position\r\n    shl     edx, 31         ; Move carry to bit 31\r\n    or      eax, edx        ; Combine\r\n<\/pre>\n<p>I considered trying a sneaky trick: Use the rotation intrinsic. (gcc doesn&#8217;t have a rotation intrinsic, so I couldn&#8217;t try it there.)<\/p>\n<pre>unsigned average(unsigned a, unsigned b)\r\n{\r\n#if defined(_MSC_VER)\r\n    unsigned sum;\r\n    auto carry = _addcarry_u32(0, a, b, &amp;sum);\r\n    sum = (sum &amp; ~1) | carry;\r\n    return _rotr(sum, 1);\r\n#elif defined(__clang__)\r\n    unsigned carry;\r\n    sum = (sum &amp; ~1) | carry;\r\n    auto sum = __builtin_addc(a, b, 0, &amp;carry);\r\n    return __builtin_rotateright32(sum, 1);\r\n#else\r\n#error Unsupported compiler.\r\n#endif\r\n}\r\n\r\n\/\/ _MSC_VER\r\n    mov     ecx, a\r\n    add     ecx, b          ; Add, overflow goes into carry bit\r\n    setc    al              ; al = 1 if carry set\r\n    and     ecx, -2         ; Clear bottom bit\r\n    movzx   ecx, al         ; Zero-extend byte to 32-bit value\r\n    or      eax, ecx        ; Combine\r\n    ror     ear, 1          ; Rotate right one position\r\n                            ; Result in eax\r\n\r\n\/\/ __clang__\r\n    mov     ecx, a\r\n    add     ecx, b          ; Add, overflow goes into carry bit\r\n    setc    al              ; al = 1 if carry set\r\n    shld    eax, ecx, 31    ; Shift left 64-bit value\r\n\r\n\/\/ __clang__ with ARM-Thumb2\r\n    movs    r2, #0          ; Prepare to receive carry\r\n    adds    r0, r0, r1      ; Calculate sum with flags\r\n    adcs    r2, r2          ; r2 holds carry\r\n    lsrs    r0, r0, #1      ; Shift sum right one position\r\n    lsls    r1, r2, #31     ; Move carry to bit 31\r\n    adds    r0, r1, r0      ; Combine\r\n<\/pre>\n<p>Mixed results. For <code>_MSC_VER<\/code>, the code generation got worse. For <code>__clang__<\/code> for ARM-Thumb2, the code generation got better. And for <code>__clang__<\/code> for x86, the compiler realized that it was the same as before, so it just used the previous codegen!<\/p>\n<p><b>Bonus chatter<\/b>: And while I&#8217;m here, here are sequences for processors that don&#8217;t have rotate-right-through-carry.<\/p>\n<pre>\/\/ AArch64 (A64)\r\n    mov     x0, a\r\n    adds    x0, x1, b       ; Add, overflow goes into carry bit\r\n    addc    x1, xzr, xzr    ; Copy carry to x1\r\n    extr    x0, x1, x0, 1   ; Extract bits 64:1 from x1:x0\r\n                            ; Answer in x0\r\n\r\n\/\/ Alpha AXP: Assume a0 = a, a1 = b, both 64-bit values\r\n    addq    a0, a1, v0      ; 64-bit addition: v0 = a0 + a1\r\n    cmpult  a0, v0, a0      ; a0 = 1 if overflow occurred\r\n    srl     v0, #1, v0      ; 64-bit shift:    v0 = v0 &gt;&gt; 1\r\n    sll     a0, #63, a0     ; 64-bit shift:    a0 = a0 &lt;&lt; 63\r\n    or      a0, v0, v0      ; v0 = v0 | a0\r\n                            ; Answer in v0\r\n\r\n\/\/ Itanium Ia64: Assume r32 = a, r33 = b, both 64-bit values\r\n    add     r8 = r32, r33 ;;    \/\/ 64-bit addition: r8 = r32 + r33\r\n    cmp.ltu p6, p7 = r8, r33 ;; \/\/ p6 = true if overflow occurred\r\n(p6) addl   r9 = 1, r0          \/\/ r9 = 1 if overflow occurred\r\n(p7) addl   r9 = 0, r0 ;;       \/\/ r9 = 0 if overflow did not occur\r\n    shrp    r8 = r9, r8, 1      \/\/ r8 = extract bits 64:1 from r9:r8\r\n                                \/\/ Answer in r8\r\n\r\n\/\/ MIPS: Same as multiprecision version\r\n\r\n\/\/ PowerPC: Assume r3 = a, r4 = b\r\n    addc    r3, r3, r4          ; Accumulate low 32 bits in r3, set carry on overflow\r\n    adde    r5, r4, r4          ; Shift carry into bottom bit of r5 (other bits garbage)\r\n    rlwinm  r3, r3, 31, 1, 31   ; Shift r3 right by one position\r\n    rlwinm  r5, r5, 31, 0, 0    ; Shift bottom bit of r5 to bit 31\r\n    or      r3, r5, r5          ; Combine the two parts\r\n\r\n\/\/ RISC-V: Same as multiprecision version\r\n<\/pre>\n<p><b>Bonus chatter<\/b>: C++20 adds a <code>std::midpoint<\/code> function that calculates the average of two values (rounding toward <code>a<\/code>).<\/p>\n<p><b>Bonus viewing<\/b>: <a href=\"https:\/\/www.youtube.com\/watch?v=sBtAGxBh-XI\"> std::midpoint? How hard could it be?<\/a><\/p>\n<p><b>Update<\/b>: I was able to trim an instruction off the PowerPC version by realizing that only the bottom bit of <code>r5<\/code> participates in the <code>rlwinm<\/code>, so the other bits can be uninitialized garbage. For the uninitialized garbage, I used <code>r4<\/code>, which I know can be consumed without a stall because the <code>addc<\/code> already consumed it.<\/p>\n<p>Here&#8217;s the original:<\/p>\n<pre>\/\/ PowerPC: Assume r3 = a, r4 = b\r\n    li      r5, #0              ; r5 = 0 (accumulates high 32 bits)\r\n    addc    r3, r3, r4          ; Accumulate low 32 bits in r3, set carry on overflow\r\n    addze   r5, r5              ; Accumulate high bits in r5\r\n    rlwinm  r3, r3, 31, 1, 31   ; Shift r3 right by one position\r\n    rlwinm  r5, r5, 31, 0, 0    ; Shift bottom bit of r5 to bit 31\r\n    or      r3, r5, r5          ; Combine the two parts\r\n<\/pre>\n<p><b>Update 2<\/b>: Peter Cordes pointed out that an instruction can also be trimmed from the AArch64 version by using the <code>uxtw<\/code> extended register operation to combine a <code>uxtw<\/code> with an <code>add<\/code>. Here&#8217;s the original:<\/p>\n<pre>\/\/ AArch64 (ARM 64-bit): Assume w0 = a, w1 = b, upper 32 bits unknown\r\n    uxtw    x0, w0          ; x0 = w0 zero-extended to 64-bit value\r\n    uxtw    x1, w1          ; x1 = w1 zero-extended to 64-bit value\r\n    add     x0, x1          ; 64-bit addition: x0 = x0 + x1\r\n    ubfx    x0, x0, 1, 32   ; Extract bits 1 through 32 from result\r\n                            ; (shift + zero-extend in one instruction)\r\n                            ; Answer in x0\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Another code golfing adventure.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-106223","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Another code golfing adventure.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/106223","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=106223"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/106223\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=106223"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=106223"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=106223"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}