{"id":105369,"date":"2021-06-25T07:00:00","date_gmt":"2021-06-25T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=105369"},"modified":"2021-08-06T13:07:11","modified_gmt":"2021-08-06T20:07:11","slug":"20210625-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20210625-00\/?p=105369","title":{"rendered":"The ARM processor (Thumb-2), part 20: Code walkthrough"},"content":{"rendered":"<p>As is traditional, I wrap up the processor overview series with an annotated walkthrough of a simple function. Here&#8217;s the function again:<\/p>\n<pre>extern FILE _iob[];\r\n\r\nint fclose(FILE *stream)\r\n{\r\n    int result = EOF;\r\n\r\n    if (stream-&gt;_flag &amp; _IOSTRG) {\r\n        stream-&gt;_flag = 0;\r\n    } else {\r\n        int index = stream - _iob;\r\n        _lock_str(index);\r\n        result = _fclose_lk(stream);\r\n        _unlock_str(index);\r\n    }\r\n\r\n    return result;\r\n}\r\n<\/pre>\n<p>Let&#8217;s dive in.<\/p>\n<p>This function takes a single pointer parameter, which therefore is passed in the <var>r0<\/var> register. No parameters are passed on the stack.<\/p>\n<pre>    push    {r3-r6,r11,lr}\r\n<\/pre>\n<p>We start by building our stack frame. From this one instruction we already learn that<\/p>\n<ul>\n<li>This is not a lightweight leaf function, because we are using the stack. Saving the frame pointer <var>r11<\/var> and return address <var>lr<\/var> is therefore required.<\/li>\n<li>We have one word of local variables and outbound parameters. This is inferred by the inclusion of the otherwise-garbage <var>r3<\/var> register. We don&#8217;t actually care about the value of the <var>r3<\/var> register. We are pushing it for the side effect of allocating space on the stack.<\/li>\n<li>We need three additional registers: <var>r4<\/var>, <var>r5<\/var>, and <var>r6<\/var>.<\/li>\n<\/ul>\n<pre>    add     r11, sp, #0x10      ; link into stack frame chain\r\n<\/pre>\n<p>The next step in the standard prologue is to point the <var>r11<\/var> register at the place where we saved the previous <var>r11<\/var> register, in order to maintain the stack frame chain.<\/p>\n<pre>    mov     r5, r0              ; r5 = stream\r\n<\/pre>\n<p>We save the stream pointer in a non-volatile register for safekeeping.<\/p>\n<pre>;   int result = EOF;\r\n;   if (stream-&gt;_flag &amp; _IOSTRG) {\r\n\r\n    ldr     r3, [r5,#0xC]       ; r3 = stream-&gt;_flag\r\n    mvn     r6, #0              ; r6 \"result\" = -1\r\n    tst     r3, #0x40           ; Q: Is _IOSTRG set?\r\n    beq     notstring           ; N: Then need to flush for real\r\n<\/pre>\n<p>The compiler interleaved the initialization of the <var>result<\/var> variable (which is evidently being kept in register <var>r6<\/var>) with the test of the <code>_flags<\/code> member.<\/p>\n<p>Initializing <var>result<\/var> is done by moving <code>~0<\/code>, which is the same as <code>0xFFFFFFFF<\/code> or <code>-1<\/code>.<\/p>\n<p>Testing the <code>_IOSTRG<\/code> bit is done by loading the flags into the <var>r3<\/var> register (a scratch register) and using the <code>TST<\/code> instruction, which sets the flags based on the result of a bitwise AND operation. If the flag is clear, then the result is zero (&#8220;equal&#8221;), and the jump is taken. If the flag is set, then we fall through.<\/p>\n<pre>;   stream-&gt;_flag = 0;\r\n\r\n    movs    r3, #0              ; r3 = 0\r\n    str     r3, [r5,#0xC]       ; stream-&gt;_flag = 0\r\n    b       done                ; end of \"true\" branch\r\n<\/pre>\n<p>If the flag is clear, then we enter the &#8220;true&#8221; branch of the <code>if<\/code> statement, which sets the <code>_flag<\/code> to zero. We cannot move a constant directly into memory, so we first load the constant in to a scratch register (<var>r3<\/var>) and store the register to memory.<\/p>\n<p>Note that we use a <code>MOVS<\/code> instruction, which sets flags, even though we don&#8217;t care about the flags. That&#8217;s because the 8-bit immediate <code>MOVS<\/code> instruction has a compact 16-bit encoding, whereas the corresponding <code>MOV<\/code> instruction uses a 32-bit encoding, so switching to <code>MOVS<\/code> reduces code size.<\/p>\n<pre>notstring:\r\n;   int index = stream - _iob;\r\n\r\n    ldr     r3, =|_iob|         ; r3 = address of _iob\r\n    subs    r4, r5, r3          ; r4 = stream - iob (byte offset)\r\n    asrs    r0, r4, #4          ; r0 = r4 \/ 16 (convert to index)\r\n<\/pre>\n<p>We use the literal pool version of the <code>LDR<\/code> pseudo-instruction to load the address of the <code>_iob<\/code> array from the literal pool into a scratch register <var>r3<\/var>. We subtract that from the <var>stream<\/var> variable, producing the byte offset into the preserved register <var>r4<\/var>. Shifting that right by 4 is the same as dividing by 16, which produces the index into the <var>r0<\/var> register.<\/p>\n<pre>;   _lock_str(index);\r\n\r\n    bl      |_lock_str|\r\n<\/pre>\n<p>The <var>r0<\/var> register is exactly where we pass the <var>index<\/var> parameter to the <code>lock_str<\/code> function, so we&#8217;re all set to call it.<\/p>\n<pre>;   result = _fclose_lk(stream);\r\n\r\n    mov     r0, r5              ; r0 = stream\r\n    bl      |_fclose_lk|        ; _fclose_lk(stream)\r\n    mov     r6, r0              ; save result\r\n<\/pre>\n<p>Next comes another function call, this time to close the stream. We put the first (and only) parameter into <var>r0<\/var> and call the function. The result comes back in <var>r0<\/var>, and we save it in <var>r6<\/var> so we can return it when we&#8217;re done.<\/p>\n<pre>;   _unlock_str(index);\r\n\r\n    asrs    r0, r4, #4          ; r0 = r4 \/ 16 (convert to index)\r\n    bl      |_unlock_str|       ; _unlock_str(index)\r\n<\/pre>\n<p>To call <code>_unlock_str<\/code>, we recalculate the index from the byte offset (still in <var>r4<\/var>, since <var>r4<\/var> is a preserved register) and put the index into <var>r0<\/var> so we can call <code>_unlock_str<\/code>.<\/p>\n<p>It may seem odd to recalculate the index from the byte offset. Why not just save the index the first time?<\/p>\n<p>The reason is that <code>mov r0, r4<\/code> and <code>asrs r0, r4, #4<\/code> are the same size: They both use 16-bit encoding. Recalculating the value takes the same number of code bytes as copying it, and it avoids having to save the index anywhere, thereby saving two bytes. Thanks to the barrel shifter (which the ARM is very proud of, in case you have forgotten), shifting a register is just as fast as copying it.<\/p>\n<p>We now fall through to the end of the function.<\/p>\n<pre>done:\r\n\r\n;   return result;\r\n\r\n    mov     r0, r6              ; return result (r6)\r\n<\/pre>\n<p>The function return value goes into <var>r0<\/var>, so we copy it there from <var>r6<\/var>.<\/p>\n<pre>    pop     {r3-r6,r11,pc}\r\n<\/pre>\n<p>For this function, we can pack the the function epilogue into just one instruction: Popping <var>r3<\/var> cleans up our local variables, popping <var>r4<\/var> through <var>r6<\/var> restores the saved registers, popping <var>r11<\/var> unlinks the current stack frame from the stack frame chain, and popping the inbound return address into <var>pc<\/var> transfers control to the return address.<\/p>\n<p>That&#8217;s the end of the function, but we&#8217;re not done yet!<\/p>\n<pre>    __debugbreak                ; recover word alignment\r\n\r\n    dcd     |_iob|\r\n<\/pre>\n<p>We still have the matter of the literal pool we used in the <code>ldr r3, =|_iob|<\/code> pseudo-instruction. That pseudo-instruction turns into the instruction<\/p>\n<pre>    ldr     r3, [pc, #...]      ; load register from memory\r\n<\/pre>\n<p>where the <code>#...<\/code> is the offset to the desired literal. When you use the <var>pc<\/var> register as a base index, the value is rounded down to the nearest multiple of four, and the offset must also be a multiple of four. This means that the value must be at a word-aligned address. The unreachable <code>__debugbreak<\/code> instruction at the end of the function is just padding so that the <code>|_iob|<\/code> literal can be placed on a word boundary.<\/p>\n<p>So there we have it, our whirlwind tour of the ARM processor in Thumb-2 mode. I don&#8217;t know about you, but I&#8217;m exhausted.<\/p>\n<p>\u00b9 Commenter Neil Rashbrook notes that stack space reserved by pushing the <var>r3<\/var> register is never used. It exists only to satisfy the requirement that the stack be 8-byte aligned.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Putting together what we&#8217;ve learned, with a surprise.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[2],"class_list":["post-105369","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-history"],"acf":[],"blog_post_summary":"<p>Putting together what we&#8217;ve learned, with a surprise.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105369","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=105369"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105369\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=105369"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=105369"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=105369"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}