{"id":105311,"date":"2021-06-15T07:00:00","date_gmt":"2021-06-15T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=105311"},"modified":"2021-06-15T07:23:05","modified_gmt":"2021-06-15T14:23:05","slug":"20210615-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20210615-00\/?p=105311","title":{"rendered":"The ARM processor (Thumb-2), part 12: Control transfer"},"content":{"rendered":"<p>The most basic control transfer is a direct relative branch.<\/p>\n<pre>    b       label       ; unconditional branch\r\n<\/pre>\n<p>The reach of the relative branch is around \u00b116MB, with a compact 16-bit encoding available for branch targets within 2KB.<\/p>\n<p>The relative branch instruction can be conditionalized on the status flags:<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>Condition<\/th>\n<th>Meaning<\/th>\n<th>Evaluation<\/th>\n<th>Notes<\/th>\n<\/tr>\n<tr>\n<td><code>EQ<\/code><\/td>\n<td>equal<\/td>\n<td>Z = 1<\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td><code>NE<\/code><\/td>\n<td>not equal<\/td>\n<td>Z = 0<\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td><code>CS<\/code><\/td>\n<td>carry set<\/td>\n<td rowspan=\"2\">C = 1<\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td><code>HS<\/code><\/td>\n<td>high or same<\/td>\n<td>unsigned greater than or equal<\/td>\n<\/tr>\n<tr>\n<td><code>CC<\/code><\/td>\n<td>carry clear<\/td>\n<td rowspan=\"2\">C = 0<\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td><code>LO<\/code><\/td>\n<td>low<\/td>\n<td>unsigned less than<\/td>\n<\/tr>\n<tr>\n<td><code>MI<\/code><\/td>\n<td>minus<\/td>\n<td>N = 1<\/td>\n<td>signed negative<\/td>\n<\/tr>\n<tr>\n<td><code>PL<\/code><\/td>\n<td>plus<\/td>\n<td>N = 0<\/td>\n<td>signed positive or zero<\/td>\n<\/tr>\n<tr>\n<td><code>VS<\/code><\/td>\n<td>overflow set<\/td>\n<td>V = 1<\/td>\n<td>signed overflow<\/td>\n<\/tr>\n<tr>\n<td><code>VC<\/code><\/td>\n<td>overflow clear<\/td>\n<td>V = 0<\/td>\n<td>no signed overflow<\/td>\n<\/tr>\n<tr>\n<td><code>HI<\/code><\/td>\n<td>high<\/td>\n<td>C = 1 and Z = 0<\/td>\n<td>unsigned greater than<\/td>\n<\/tr>\n<tr>\n<td><code>LS<\/code><\/td>\n<td>low or same<\/td>\n<td>C = 0 or Z = 1<\/td>\n<td>unsigned less than or equal<\/td>\n<\/tr>\n<tr>\n<td><code>GE<\/code><\/td>\n<td>greater than or equal<\/td>\n<td>N = V<\/td>\n<td>signed greater than or equal<\/td>\n<\/tr>\n<tr>\n<td><code>LT<\/code><\/td>\n<td>less than<\/td>\n<td>N \u2260 V<\/td>\n<td>signed less than<\/td>\n<\/tr>\n<tr>\n<td><code>GT<\/code><\/td>\n<td>greater than<\/td>\n<td>Z = 0 and N = V<\/td>\n<td>signed greater than<\/td>\n<\/tr>\n<tr>\n<td><code>LE<\/code><\/td>\n<td>less than or equal<\/td>\n<td>Z = 1 or N \u2260 V<\/td>\n<td>signed less than<\/td>\n<\/tr>\n<tr>\n<td><code>AL<\/code><\/td>\n<td>always<\/td>\n<td>always true<\/td>\n<td>unconditional<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The conditions come in pairs (aside from <code>AL<\/code>), and toggling the bottom bit negates the condition. For 16-bit conditional branch encoding, this maps to the bottom bit of the first byte of the instruction. For 32-bit conditional branch encoding, you toggle <code>0x40<\/code> in the second byte of the instruction.<\/p>\n<p>The conditions are named after the behavior that is expected if they come directly after a <code>CMP<\/code> instruction. For example, a <code>BEQ<\/code> instruction that comes directly after a <code>CMP<\/code> is a conditional branch that is taken if the comparison was between two equal values.<\/p>\n<p>Four bits of instruction encoding space are lost to encode the condition, so it can reach only 1\/16th as far as the unconditional branch: About \u00b1254 bytes for the 16-bit encoding and about \u00b11MB for the 32-bit encoding.<\/p>\n<p>There are special conditional branch instructions for testing whether a register is zero.<\/p>\n<pre>    cbz     Rn, label       ; branch if Rn == 0\r\n    cbnz    Rn, label       ; branch if Rn != 0\r\n<\/pre>\n<p>These are 16-bit instructions which are available only for low registers, and they are capable only of branching <i>forward<\/i> by up to 126 bytes.\u00b9<\/p>\n<p>Subroutine calls are performed by branching to the first instruction of the subroutine and putting the return address in the <var>lr<\/var> register. This should feel familiar, for all of the other non-x86 processors we&#8217;ve reviewed perform subroutine linkage the same way.<\/p>\n<pre>    ; branch and link, stay in Thumb-2\r\n    bl      label           ; lr = next instruction + 1\r\n                            ; execution resumes at label\r\n\r\n    ; branch and link with exchange, switch to classic ARM\r\n    blx     label           ; lr = next instruction + 1\r\n                            ; execution resumes at label\r\n<\/pre>\n<p>These instructions have a reach of approximately \u00b116MB.<\/p>\n<p>Windows uses Thumb-2 exclusively, so you won&#8217;t see the <code>blx<\/code> instruction used in this way. The <code>X<\/code> stands for &#8220;exchange&#8221;, which means that it swaps between Thumb-2 and classic ARM modes.\u00b2<\/p>\n<p>The return address is stored in <var>lr<\/var>, but with the bottom bit set. There&#8217;s a reason for this.<\/p>\n<p>Thumb-2 instructions must be halfword-aligned, and classic ARM instructions must be word-aligned. Therefore, the bottom bit of any code address is known to be zero, so the processor uses it to encode the target instruction set: If the bottom bit is clear, then execution resumes in classic ARM; if the bottom bit is set, then execution resumes in Thumb-2. Switching dynamically between classic ARM and Thumb-2 instruction sets is known as <i>interworking<\/i>.<\/p>\n<p>Windows uses Thumb-2 exclusively, and the convention is that the bottom bit of function pointers is always set. When you look at function pointers in the debugger, they will always be <i>one larger<\/i> than the address itself.<\/p>\n<pre>    ; branch with exchange\r\n    bx      Rn              ; switch to classic ARM if Rn is even\r\n                            ; execution resumes at Rn &amp; ~1\r\n\r\n    ; branch and link with exchange\r\n    blx     Rn              ; lr = next instruction + 1\r\n                            ; switch to classic ARM if Rn is even\r\n                            ; execution resumes at Rn &amp; ~1\r\n<\/pre>\n<p>Even though the <code>X<\/code> instructions can switch to classic ARM, that switching feature is never used in Windows. Function pointers always have the bottom bit set, so the destination of the <code>BLX<\/code> is always Thumb-2.<\/p>\n<p>The last branch instruction is the table-based branch:<\/p>\n<pre>    ; table branch byte\r\n    tbb     [Rn, Rm]            ; jump to pc + 2 * (byte at Rn + Rm)\r\n\r\n    ; table branch halfword\r\n    tbh     [Rn, Rm, lsl #1]    ; jump to pc + 2 * (halfword at Rn + Rm * 2)\r\n<\/pre>\n<p>The base register points to the start of a jump table, and the second register is a byte or word index into the table. The value read from the table is then treated as a forward relative branch offset in units of halfwords.<\/p>\n<p>Remember that <var>pc<\/var> has moved ahead four bytes when the instruction executes, so the forward branch is relative to the next instruction, not to the <code>TBB<\/code> or <code>TBH<\/code> instruction.<\/p>\n<p>Since the offsets are stored in an unsigned byte or halfword, the reach of <code>TBB<\/code> instruction is 514 bytes, and the reach of of the <code>TBH<\/code> instruction is around 128KB.<\/p>\n<p>One thing you might notice is that, if you assume that the bottom bit of the register is set, these two instructions are equivalent:<\/p>\n<pre>    bx      Rn          ; jump to Rn\r\n    mov     pc, Rn      ; jump to Rn\r\n<\/pre>\n<p>The second version takes advantage of the fact that storing a value into the <var>pc<\/var> register acts as a control transfer. In practice, you won&#8217;t see the <code>MOV<\/code> version because it takes a 32-bit encoding, whereas <code>BX<\/code> uses a 16-bit encoding.<\/p>\n<p>Nevertheless, other variations of loading a value into <var>pc<\/var> are still useful:<\/p>\n<pre>    mov     pc, [r0,#4] ; jump to address\r\n    pop     {pc}        ; pop return address and jump there\r\n<\/pre>\n<p>Popping a value into the instruction pointer is a common pattern. On entry to a function, you push the registers you need to preserve across the call, and on exit you pop them off. The two sets of registers line up, so that everything pops back to the original source register, <i>except<\/i> that you pop the old <var>lr<\/var> into <var>pc<\/var>, so that the <code>pop<\/code> instruction is a combination &#8220;pop registers from the stack&#8221; and &#8220;return to caller&#8221; instruction.<\/p>\n<pre>    ; save a bunch of registers, and the return address\r\n    push    {r3-r6,r11,lr}\r\n\r\n    ...\r\n\r\n    ; restore the registers, except that the return\r\n    ; address goes into pc, thereby jumping there\r\n    pop     {r3-r6,r11,pc}\r\n<\/pre>\n<p>Next time, we&#8217;ll look at conditional execution.<\/p>\n<p>\u00b9 The inability to branch backward with <code>CBNZ<\/code> explains why the sample atomic sequence we used last time uses a two-instruction sequence of <code>cmp r3, #0<\/code> followd by <code>bne<\/code>: It can&#8217;t use <code>cbnz<\/code> because it wants to branch backward to retry the operation.<\/p>\n<p>\u00b2 This instruction was clearly named back when there were <a href=\"https:\/\/www.youtube.com\/watch?v=vS-zEH8YmiM&amp;t=28s\"> only two modes<\/a>. Nowadays, naming the instruction &#8220;exchange&#8221; would be ambiguous about which of the many modes it is switching to.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Let&#8217;s go places.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[2],"class_list":["post-105311","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-history"],"acf":[],"blog_post_summary":"<p>Let&#8217;s go places.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105311","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=105311"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105311\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=105311"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=105311"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=105311"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}