{"id":105280,"date":"2021-06-04T07:00:00","date_gmt":"2021-06-04T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=105280"},"modified":"2021-06-05T12:04:37","modified_gmt":"2021-06-05T19:04:37","slug":"20210604-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20210604-00\/?p=105280","title":{"rendered":"The ARM processor (Thumb-2), part 5: Arithmetic"},"content":{"rendered":"<p>The general format of three-register instructions in Thumb-2 goes like this:\u00b9<\/p>\n<pre>    op      Rd, Rn, #imm12      ; Rd = Rn op decode(imm12)\r\n    op      Rd, Rn, Rm          ; Rd = Rn op Rm\r\n    op      Rd, Rn, Rm, shift   ; Rd = Rn op (Rm with shift applied)\r\n                                ; shift can be LSL, LSR, ASR, ROR\r\n<\/pre>\n<p>The <code>#imm12<\/code> is a constant <a title=\"The ARM processor (Thumb-2), part 4: Single-instruction constants\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20210603-00\/?p=105276\"> in a form we discussed last time<\/a>.<\/p>\n<p>For notational convenience, let&#8217;s call this<\/p>\n<pre>    op      Rd, Rn, op2         ; op2 can be #imm12, Rm, or Rm with a shift\r\n<\/pre>\n<p>Sometimes you&#8217;ll see a two-register version, which is shorthand for (and often a more compact encoding than) the three-register version:<\/p>\n<pre>    op      Rd, Rn              ; shorthand for op Rd, Rd, Rn\r\n<\/pre>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20180808-00\/?p=99445\"> Like the PowerPC<\/a>, the ARM uses true carry. This means that for subtraction, the carry is clear when a borrow occurs, and subtract with carry subtracts an additional unit if inbound carry is clear.<\/p>\n<p>With that said, here are the basic arithmetic operations:<\/p>\n<pre>    ; add\r\n    add     Rd, Rn, op2         ; Rd = Rn + op2\r\n\r\n    ; add with carry\r\n    adc     Rd, Rn, op2         ; Rd = Rn + op2 + carry\r\n\r\n    ; subtract\r\n    sub     Rd, Rn, op2         ; Rd = Rn - op2\r\n\r\n    ; subtract with carry\r\n    sbc     Rd, Rn, op2         ; Rd = Rn - op2 - !carry\r\n\r\n    ; reverse subtract\r\n    rsb     Rd, Rn, op2         ; Rd = op2 - Rn\r\n\r\n    ; reverse subtract with carry\r\n    rsc     Rd, Rn, op2         ; Rd = op2 - Rn - !carry\r\n\r\n    ; copy register from constant, register, or generalized op2\r\n    mov     Rd, #imm8           ; Rd = imm8 (0 to 255)\r\n    mov     Rd, Rm              ; Rd = Rm\r\n    mov     Rd, op2             ; Rd = op2\r\n\r\n    ; copy register from bitwise NOT of register or generalized op2\r\n    mvn     Rd, Rm              ; Rd = ~Rm\r\n    mvn     Rd, op2             ; rd = ~op2\r\n\r\n    ; all support the S suffix\r\n<\/pre>\n<p>I noted earlier that in traditional RISC, there is no need for an architectural <code>MOV<\/code> instruction because you can treat it as a pseudo-instruction formed by adding zero to a register. Thumb-2 does include it as a special instruction because it has a 16-bit encoding in the case where you are loading a small positive constant, or if you are copying to a low register (even if the source register is high). There&#8217;s also a more traditional <code>op2<\/code> format that takes decoded 12-bit immediates or shifted registers.<\/p>\n<p>The most valuable part of reverse subtraction is that you can use it to subtract from a constant. In particular, you can negate a register by subtracting it from zero.<\/p>\n<p>There are also discarding versions of the subtraction instructions, where the sole purpose is setting flags.<\/p>\n<pre>    ; compare (compare Rn with op2)\r\n    cmp     Rn, op2             ; Set flags for Rn - op2\r\n\r\n    ; compare negative (compare Rn with -op2)\r\n    cmn     Rn, op2             ; Set flags for Rn + op2\r\n<\/pre>\n<p>The ARM processor designers are pulling a fast one here. In the <code>MVN<\/code> instruction, the <code>N<\/code> stands for <i>not<\/i>, meaning that it moved the bitwise negation of the <code>op2<\/code>. But in <code>CMN<\/code>, the <code>N<\/code> stands for <i>negative<\/i>, meaning that it compares the arithmetic negative of the <code>op2<\/code>.<\/p>\n<p>There&#8217;s an even more devious trap hiding in the <code>CMN<\/code> instruction, which I will discuss next time.<\/p>\n<p>Multiplication has a few variations. These are the 32 \u00d7 32 \u2192 32 multiplies:<\/p>\n<pre>    ; multiply\r\n    mul     Rd, Rn, Rm          ; Rd = Rn * Rm\r\n    muls    Rd, Rn, Rm          ; Rd = Rn * Rm, set partial flags\r\n\r\n    ; multiply accumulate\r\n    mla     Rd, Rm, Rs, Rn      ; Rd = (Rm * Rs) + Rn\r\n\r\n    ; multiply subtract\r\n    mls     Rd, Rm, Rs, Rn      ; Rd = Rn - (Rm * Rs)\r\n<\/pre>\n<p>The only multiply or divide instruction that has the option to set flags is <code>MULS<\/code>. It updates the negative (N) and zero (Z) flags to match the result, but the carry (C) and overflow (V) flags are unmodified.<\/p>\n<p>And here are the 32 \u00d7 32 \u2192 64 multiplies:<\/p>\n<pre>    ; unsigned multiply long\r\n    umull   Rdlo, Rdhi, Rm, Rs  ; Rdhi:Rdlo = Rm * Rs (unsigned)\r\n\r\n    ; signed multiply long\r\n    smull   Rdlo, Rdhi, Rm, Rs  ; Rdhi:Rdlo = Rm * Rs (signed)\r\n\r\n    ; unsigned multiply accumulate long\r\n    umlal   Rdlo, Rdhi, Rm, Rs  ; Rdhi:Rdlo = Rdhi:Rdlo + Rm * Rs (unsigned)\r\n\r\n    ; signed multiply accumulate long\r\n    smlal   Rdlo, Rdhi, Rm, Rs  ; Rdhi:Rdlo = Rdhi:Rdlo + Rm * Rs (signed)\r\n\r\n    ; unsigned multiply accumulate accumulate long\r\n    umaal   Rdlo, Rdhi, Rm, Rs  ; Rdhi:Rdlo = Rdhi + Rdlo + Rm * Rs (unsigned)\r\n<\/pre>\n<p>The &#8220;unsigned multiply accumulate accumulate long&#8221; instruction is a bit of an oddball. Its funny name reflects the fact that the registers of the output register pair are treated as separate integer inputs.<\/p>\n<p>Of the multiply instructions, I&#8217;ve seen the compiler use <code>MUL<\/code>, <code>MLA<\/code>, <code>UMULL<\/code> and <code>SMULL<\/code>. I have yet to see it use <code>UMLAL<\/code>, <code>SMLAL<\/code>, or <code>UMAAL<\/code>.<\/p>\n<p>There are also division instructions, but they are architecturally optional and raise an &#8220;invalid instruction&#8221; on processors that don&#8217;t support them.<\/p>\n<pre>    ; unsigned divide\r\n    udiv    Rd, Rn, Rm          ; Rd = Rn \/ Rm (unsigned)\r\n\r\n    ; signed divide\r\n    sdiv    Rd, Rn, Rm          ; Rd = Rn \/ Rm (signed)\r\n<\/pre>\n<p>The division instructions perform integer unsigned or signed division, with the result rounded toward zero. In the special case of signed division of <code>0x80000000 \u00f7 0xFFFFFFFF<\/code>, the processor produces a result of <code>0x80000000<\/code> without trapping. By default, division by zero does not trap; it just returns zero. However, some revisions allow the operating system to enable trapping on division by zero. Windows enables trapping when the processor supports it.\u00b2<\/p>\n<p>If hardware support for division is not present, the instructions trap into the kernel, where the operation is emulated. Operating system code generally does not assume hardware division support, and division will call out to a helper function to perform the division.<\/p>\n<p>I&#8217;m skipping over the SIMD and multimedia instructions, like saturating arithmetic and parallel arithmetic. I have yet to see them in compiler-generated code.<\/p>\n<p>Next time, we&#8217;ll look at the lie hiding inside the <code>CMN<\/code> instruction.<\/p>\n<p><b>Bonus chatter<\/b>: Commenter Petteri Aimonen points out that even though the division operation does not produce the remainder, you can recover the remainder with just one additional instruction, thanks to the &#8220;multiply and subtract&#8221; instruction:<\/p>\n<pre>    sdiv    Rq, Rn, Rm          ; Rq = Rn \/ Rm (signed)\r\n    mls     Rr, Rq, Rm, Rn      ; Rr = Rn - (Rq * Rm) = Rn % Rm\r\n<\/pre>\n<p>In practice, the MSVC, gcc and clang compilers default to assuming that <code>sdiv<\/code> is an emulated instruction and performing the division manually rather than risking a trap. The emulated version produces the remainder for free as a by-product. If you tell them to assume armv7ve, then they will enable the native division instruction. The gcc and clang compilers will use <code>mls<\/code> to calculate the remainder. MSVC breaks it into separate <code>mul<\/code> and <code>subs<\/code> instructions.<\/p>\n<p>\u00b9 Classic ARM also supports shifting by an amount provided by a fourth register, leading to instructions like<\/p>\n<pre>    ADD     Rd, Rn, Rm, LSL Rs  ; Rd = Rn + (Rm &lt;&lt; Rs)\r\n<\/pre>\n<p>\u00b2 There is no dedicated &#8220;divide by zero&#8221; trap. Instead, if division by zero is attempted, the processor raises an &#8220;invalid instruction&#8221; trap. The trap handler is expected to parse the faulting instruction, identify it as a valid division instruction, and then realize that the divisor is zero.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Starting with basic mathematics.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[2],"class_list":["post-105280","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-history"],"acf":[],"blog_post_summary":"<p>Starting with basic mathematics.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105280","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=105280"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105280\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=105280"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=105280"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=105280"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}