{"id":107066,"date":"2022-08-29T07:00:00","date_gmt":"2022-08-29T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=107066"},"modified":"2022-08-28T07:16:13","modified_gmt":"2022-08-28T14:16:13","slug":"20220829-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20220829-00\/?p=107066","title":{"rendered":"The AArch64 processor (aka arm64), part 24: Code walkthrough"},"content":{"rendered":"<p>As is traditional, I wrap up the processor overview series with an annotated walkthrough of a simple function. Here&#8217;s the function again:<\/p>\n<pre>extern FILE _iob[];\r\n\r\nint fclose(FILE *stream)\r\n{\r\n    int result = EOF;\r\n\r\n    if (stream-&gt;_flag &amp; _IOSTRG) {\r\n        stream-&gt;_flag = 0;\r\n    } else {\r\n        int index = stream - _iob;\r\n        _lock_str(index);\r\n        result = _fclose_lk(stream);\r\n        _unlock_str(index);\r\n    }\r\n\r\n    return result;\r\n}\r\n<\/pre>\n<p>Here we go.<\/p>\n<p>This function takes a single pointer parameter, which is therefore passed in the <var>x0<\/var> register. No parameters are passed on the stack.<\/p>\n<pre>; int fclose(FILE *stream)\r\n\r\n    stp     x19, x20, [sp,#-0x20]!\r\n    str     x21, [sp,#0x10]\r\n    stp     fp, lr, [sp,#-0x10]!\r\n    mov     fp, sp\r\n<\/pre>\n<p>We start with the function prologue, which creates the stack frame and saves nonvolatile registers that we will be using inside the function.<\/p>\n<p>The first instruction reserves 32 bytes of stack and stores <var>x19<\/var> and <var>x20<\/var> into the first two slots. The pre-increment addressing mode (signaled by the exclamation point) updates the base register <var>sp<\/var> with the effective address, so this both stores the registers to memory as well as moving the stack pointer.<\/p>\n<p>The second instruction stores the <var>x21<\/var> variable into the memory that follows <var>x20<\/var>. The last 8 bytes are not used; they were allocated in order to preserve 16-byte stack pointer alignment.<\/p>\n<p>The third instruction pushes the frame pointer and link register into the stack. Notice that this function adjusted the stack pointer twice. I&#8217;m not sure how the compiler decides whether to reserve stack space all at once, or whether to reserve it little by little, like we did here.<\/p>\n<p>After all the registers have been stored, we set <var>fp<\/var> to the current stack pointer, which makes it point to where we stored the previous <var>fp<\/var>, thereby linking a new node onto the chain of stack frames.<\/p>\n<p>Now that the prologue is out of the way, we can start with the function body.<\/p>\n<pre>    mov     x20, x0             ; x20 = stream\r\n<\/pre>\n<p>The compiler takes the <var>stream<\/var> parameter, which was received in <var>x0<\/var>, and saves it in the nonvolatile register <var>x20<\/var> so it can be preserved across function calls.<\/p>\n<pre>; int result = EOF;\r\n; if (stream-&gt;_flag &amp; _IOSTRG) {\r\n\r\n    ldr     w8, [x20, #0xC]     ; w8 = stream-&gt;_flag\r\n    mov     w21, #-1            ; w21 = EOF\r\n    tbz     x8, #6, nostring    ; branch if _IOSTRG bit is zero\r\n<\/pre>\n<p>The work for the next two lines of code are interleaved. The compiler appears to have chosen to use <var>w21<\/var> to hold the <code>result<\/code> variable, so it initializes it to <code>-1<\/code>. The disassembler shows it as a <code>MOV<\/code>, but the raw instruction is really a <code>MOVN w21, #0, LSL #0<\/code>.<\/p>\n<p>The initialization of the <code>result<\/code> variable is sandwiched between the test for the <code>_IOSTRG<\/code> bit. We load the value of <code>_flag<\/code> into the <var>w8<\/var> register and test bit 6, which is the bit that corresponds to <code>_IOSTRG<\/code>, branching if the bit is clear (test bit zero).<\/p>\n<pre>;    stream-&gt;_flag = 0;\r\n; } else {\r\n\r\n    str     wzr, [x20, #0xC]    ; set _flag to zero\r\n    b       done                ; end of \"true\" branch\r\n<\/pre>\n<p>If the branch is not taken, we fall through and store a 32-bit zero to <code>_flag<\/code>. That&#8217;s the end of the &#8220;true&#8221; branch.<\/p>\n<pre>;   int index = stream - _iob;\r\n\r\nnostring:\r\n    adrp    x8, sample+0x2000   ; load high bits of pointer\r\n    add     x8, x8, #0x0180     ; x8 -&gt; _iob\r\n    sub     x9, x20, x8         ; calculate byte offset\r\n    asr     x19, x9, #4         ; x19 = convert to element offset\r\n<\/pre>\n<p>In the &#8220;false&#8221; branch, we calculate the stream index. First, we load up the address of the <code>_iob<\/code>. This takes two instructions, the first to load up the page that holds the <code>_iob<\/code> variable, and the second to find the <code>_iob<\/code> within that page.<\/p>\n<p>Subtract the <code>_iob<\/code> from the <code>stream<\/code> to get the byte offset, and convert it to an index by dividing by the size of a single <code>FILE<\/code>, which happens to be 16, so dividing can be done by shifting. The index is kept in <var>x19<\/var>.<\/p>\n<pre>;   _lock_str(index);\r\n\r\n    mov     w0, w19             ; parameter is the index\r\n    bl      _lock_str\r\n<\/pre>\n<p>The index is the sole parameter to <code>_lock_str<\/code>, so we put it into <var>w0<\/var> and call the function.<\/p>\n<pre>;   result = _fclose_lk(stream);\r\n\r\n    mov     x0, x20             ; parameter is the stream\r\n    bl      _fclose_lk          ; call _fclose_lk\r\n    mov     w21, w0             ; save the result\r\n<\/pre>\n<p>Next up is calling <code>_fclose_lk<\/code> with the stream as the parameter. We save the return value into <var>w21<\/var> which represents the <code>result<\/code> variable.<\/p>\n<pre>;   _unlock_str(index);\r\n; }\r\n\r\n    mov     w0, w19             ; parameter is the index\r\n    bl      _unlock_str\r\n<\/pre>\n<p>Unlocking the string is done by index, which is fortunately still sitting around in the <var>w19<\/var> register.<\/p>\n<pre>; return result;\r\n\r\ndone:\r\n    mov     w0, w21\r\n<\/pre>\n<p>The function return value goes into <var>x0<\/var>, so we move <var>w21<\/var> (representing <code>result<\/code>) into the lower 32 bits of the <var>x0<\/var> register.<\/p>\n<pre>; }\r\n    ldp     fp, lr, [sp], #0x10\r\n    ldr     x21, [sp, #0x10]\r\n    ldp     x19, x20, [sp], #0x20\r\n    ret\r\n<\/pre>\n<p>And we&#8217;re done. Now it&#8217;s time to clean up. We pop off the previous frame pointer and return address, the restore and pop the other nonvolatile registers we had saved. Finally we perform a <code>ret<\/code> to jump back to the return address in <var>lr<\/var>.<\/p>\n<p>When I do these walkthrough, I look to see if there was anything I could do to tighten up the code. The interesting thing that the compiler failed to recognize is that the lifetimes of <code>result<\/code> and <code>stream<\/code> do not overlap in any meaningful way, so they could share the same register. This reduces the number of registers by one, which saves 16 bytes of stack since we no longer need to save <var>x21<\/var>.<\/p>\n<p>Another trick is to fold the <code>asr<\/code> into the <code>mov<\/code> instruction that sets up the <code>index<\/code> parameter, saving an instruction.<\/p>\n<pre>; int fclose(FILE *stream)\r\n    stp     x19, x20, [sp,#-0x10]!  ; NEW! Need only 0x10 bytes\r\n                                ; NEW! Don't need to save x21\r\n    stp     fp, lr, [sp,#-0x10]!\r\n    mov     fp, sp\r\n\r\n    mov     x20, x0             ; x20 = stream\r\n\r\n; int result = EOF;\r\n; if (stream-&gt;_flag &amp; _IOSTRG) {\r\n\r\n    ldr     w8, [x20, #0xC]     ; w8 = stream-&gt;_flag\r\n    tbz     x8, #6, nostring    ; branch if _IOSTRG bit is zero\r\n\r\n;    stream-&gt;_flag = 0;\r\n; } else {\r\n\r\n    str     wzr, [x20, #0xC]    ; set _flag to zero\r\n                                ; NEW! \"stream\" is dead, so\r\n                                ;      w20 now represents \"result\"\r\n    mov     w20, #-1            ; result = EOF\r\n    b       done                ; end of \"true\" branch\r\n\r\n;   int index = stream - _iob;\r\n\r\nnostring:\r\n    adrp    x8, sample+0x2000   ; load high bits of pointer\r\n    add     x8, x8, #0x0180     ; x8 -&gt; _iob\r\n    sub     x19, x20, x8        ; calculate byte offset (x19)\r\n\r\n;   _lock_str(index);\r\n\r\n                                ; NEW! Convert byte offset to index\r\n                                ;      on the fly\r\n    asr     w0, w19, #4         ; parameter is the index\r\n    bl      _lock_str\r\n\r\n;   result = _fclose_lk(stream);\r\n;   _unlock_str(index);\r\n; }\r\n\r\n    mov     x0, x20             ; parameter is the stream\r\n    bl      _fclose_lk          ; call _fclose_lk\r\n\r\n                                ; NEW! \"stream\" is dead, so\r\n                                ;      w20 now represents \"result\"\r\n    mov     w20, w0             ; save the result\r\n\r\n                                ; NEW! Convert byte offset to index\r\n                                ;      on the fly\r\n    asr     w0, w19, #4         ; parameter is the index\r\n    bl      _unlock_str\r\n\r\n; return result;\r\n\r\ndone:\r\n    mov     w0, w20\r\n\r\n; }\r\n    ldp     fp, lr, [sp], #0x10\r\n                                ; NEW! Don't need to restore x21\r\n    ldp     x19, x20, [sp], #0x10\r\n    ret\r\n<\/pre>\n<p>This is really just recreational optimization at this point. The extra few instructions in the compiler-generated code is not going to be noticeable here, seeing as the <code>fclose<\/code> function is probably going to do things like close file handles, which are far more expensive than just a few instructions.<\/p>\n<p>This concludes our quick overview of the ARM processor in 64-bit mode. Now when you have to look at a crash dump on an ARM64 system, you might have a clue about what you&#8217;re looking at.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Putting theory into practice.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[2],"class_list":["post-107066","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-history"],"acf":[],"blog_post_summary":"<p>Putting theory into practice.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/107066","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=107066"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/107066\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=107066"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=107066"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=107066"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}