{"id":101046,"date":"2019-02-10T23:00:00","date_gmt":"2019-02-11T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=101046"},"modified":"2019-03-18T11:19:53","modified_gmt":"2019-03-18T18:19:53","slug":"20190211-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20190210-00\/?p=101046","title":{"rendered":"The Intel 80386, part 16: Code walkthrough"},"content":{"rendered":"<p>Let&#8217;s put into practice what we&#8217;ve learned so far by walking through a simple function and studying its disassembly. <\/p>\n<pre>\n#define _lock_str(s)                    _lock(s+_STREAM_LOCKS)\n#define _unlock_str(s)                  _unlock(s+_STREAM_LOCKS)\n\nextern FILE _iob[];\n\nint fclose(FILE *stream)\n{\n    int result = EOF;\n\n    if (stream-&gt;_flag &amp; _IOSTRG) {\n        stream-&gt;_flag = 0;\n    } else {\n        int index = stream - _iob;\n        _lock_str(index);\n        result = _fclose_lk(stream);\n        _unlock_str(index);\n    }\n\n    return result;\n}\n<\/pre>\n<p>This is a function from the C runtime library, so the functions use the <code>__cdecl<\/code> calling convention. This means that the parameters are pushed right-to-left, and the caller is responsible for cleaning them from the stack. <\/p>\n<pre>\n_fclose:\n    push    ebx\n    push    esi\n    push    edi\n<\/pre>\n<p>This code was compiled back in the days when frame pointer omission was fashionable. The function does not create a traditional stack frame with the <var>ebp<\/var> register acting as frame pointer. <\/p>\n<p>The 80386 calling convention says that the <var>ebx<\/var>, <var>esi<\/var>, the <var>edi<\/var>, and <var>ebp<\/var> registers must be preserved across the call. <\/p>\n<pre>\n    mov     esi,dword ptr [esp+10h] ; esi = stream\n<\/pre>\n<p>We will be using the <var>stream<\/var> variable a lot, so we&#8217;ll load it into a register for convenient access. <\/p>\n<pre>\n; int result = EOF;\n    mov     edi,0FFFFFFFFh          ; edi = result = EOF\n<\/pre>\n<p>The other variable is <var>result<\/var>, which we will keep in the <var>edi<\/var> register, and we set it to its initial value of &minus;1. This is a straight <code>MOV<\/code> instruction, which is five-byte encoding (one opcode byte plus a four-byte immediate). A smaller encoding would have been <code>or edi, -1<\/code>, which uses two bytes for the opcode and one for the 8-bit signed immediate. But the smaller encoding comes at a perfornance cost because it creates a false dependency on the <var>edi<\/var> register. (Mind you, the 80386 did not have out-of-order execution, so dependencies really aren&#8217;t a factor yet.) <\/p>\n<pre>\n; if (stream-&gt;_flag &amp; _IOSTRG) {\n    test    byte ptr [esi+0Ch],40h  ; is this a string?\n    je      not_string              ; N: then need a true flush\n<\/pre>\n<p>Even though <var>_flag<\/var> is a 32-bit field, we use a byte test to save code size. This takes advantage of the fact that testing a single bit can be done by testing a single bit in a 32-bit field, or by testing a single bit in an 8-bit subfield. The <var>_flag<\/var> field is at offset <code>0Ch<\/code>, and the value of <code>_IOSTRG<\/code> is <code>0x40<\/code>, so the bit we want is in the first byte. <\/p>\n<p>We learned some time ago that this size optimization defeats the <a HREF=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/\">store-to-load forwarder<\/a>, but the 80386 didn&#8217;t have a store-to-load forwarder, so that wasn&#8217;t really a factor. <\/p>\n<pre>\n; stream-&gt;_flag = 0;\n    mov     dword ptr [esi+0Ch],0\n<\/pre>\n<p>Again, the compiler chooses a full 32-bit immediate instead of using a smaller instruction. An alternative would have been <code>and dword ptr [esi+0Ch], 0<\/code>, using a sign-extended 8-bit immediate instead of a 32-bit immediate, but at a cost of incurring a read-modify-write rather than simply a write. <\/p>\n<pre>\n; return result;\n    mov     eax,edi                 ; eax = return value\n    pop     edi\n    pop     esi\n    pop     ebx\n    ret\n<\/pre>\n<p>The compiler chose to inline the common <code>return<\/code> instruction into this branch of the <code>if<\/code> statement. The value being returned is in the <var>result<\/var> variable, which we had enregistered in the <var>edi<\/var> register. The return value goes in the <var>eax<\/var> register, so we move it there. And then we restore the registers we had saved on the stack and return to the caller. Since this function uses the <code>__cdecl<\/code> calling convention, the function does no stack cleanup; it is the caller&#8217;s responsibility to clean the stack. <\/p>\n<pre>\n    nop\n<\/pre>\n<p>This <code>nop<\/code> instruction is padding to bring the next instruction, a jump target, to an address that is a multiple of 16. The 80386 fetches instructions in 16-byte chunks, and putting jump targets at the start of a 16-byte chunk means that all of the fetched bytes are potentially executable. <\/p>\n<pre>\nnot_string:\n; int index = stream - _iob;\n    mov     ebx,esi                 ; ebx = stream\n    sub     ebx,77E243F0h           ; ebx = stream - _iob (byte offset)\n    sar     ebx,5                   ; ebx = stream - _iob (element offset)\n<\/pre>\n<p>This sequence of instructions calculates the value for the <var>index<\/var> local variable, which the compiler chose to enregister in the <var>ebx<\/var> register. We start with the value in the <var>esi<\/var> register, which is the <var>stream<\/var> variable. Next, we subtract the offset of the <var>_iob<\/var> variable, which is a global variable, so its address looks like a constant in the code stream. We then take that byte offset and shift it right by 5, which means dividing by 32, which is the size of a <var>FILE<\/var> structure in this particular implementation. The result now sits in the <var>ebx<\/var> register. <\/p>\n<pre>\n; _lock_str(index) &rArr; _lock(index+_STREAM_LOCKS)\n    add     ebx,19h                 ; add _STREAM_LOCKS\n    push    ebx                     ; the sole parameter\n    call    _lock                   ; call the function\n    add     esp,4                   ; clean stack arguments\n<\/pre>\n<p>The <var>_lock_str<\/var> macro is a wrapper around the <var>_lock<\/var> function. We add <var>STREAM_<\/var><var>LOCKS<\/var>, which happens to be 25, or <code>0x19<\/code>, and the push it onto the stack as the sole parameter for the <var>_lock<\/var> function. Since this is a <code>__cdecl<\/code> function, it is the caller&#8217;s responsibility to clean the stack, so we add 4 (the number of bytes of parameters) to the <var>esp<\/var> register to drop them from the stack. <\/p>\n<pre>\n; result = _fclose_lk(stream)\n    push    esi                     ; the sole parameter\n    call    _fclose_lk              ; call the function\n    add     esp,4                   ; clean stack arguments\n    mov     edi,eax                 ; save in edi = result\n<\/pre>\n<p>Another function call: We push the sole parameter, call the function, and clean the stack. The return value was placed in the <var>eax<\/var> register, so we move it into the <var>edi<\/var> register, which we saw represents the <var>result<\/var> variable. <\/p>\n<pre>\n; _unlock_str(index) &rArr;_unlock(index+_STREAM_LOCKS)\n    push    ebx                     ; the sole parameter\n    call    _unlock                 ; call the function\n    add     esp,4                   ; clean stack arguments\n<\/pre>\n<p>The compiler realized it could pull out the common subexpression <var>s+_STREAM_<\/var><var>LOCKS<\/var> and stored the value of that subexpression in the <var>ebx<\/var> register. It could therefore push the precomputed value (helpfully saved in the <var>ebx<\/var> register) as the parameter for the <var>_lock<\/vAR> function. <\/p>\n<pre>\n; return result;\n    mov     eax,edi                 ; eax = return value\n    pop     edi\n    pop     esi\n    pop     ebx\n    ret\n<\/pre>\n<p>And this is the same code we saw last time. The return value (<var>result<\/var>) is moved to the <var>eax<\/var> register, which is where the <code>__cdecl<\/code> calling convention places it. We then restore the registers we had saved at entry and return to our caller, leavving our caller to clean the stack parameters. <\/p>\n<p>The resulting function size is 81 bytes. <\/p>\n<p>Okay, now let&#8217;s see how we could optimize this function further. Let&#8217;s look closely at the calculation of <var>index + _STREAM_<\/var><var>LOCKS<\/var>. <\/p>\n<pre>\n    mov     ebx,esi                 ; ebx = stream\n    sub     ebx,77E243F0h           ; ebx = stream - _iob (byte offset)\n    sar     ebx,5                   ; ebx = stream - _iob (element offset)\n    add     ebx,19h                 ; add _STREAM_LOCKS\n<\/pre>\n<p>The first thing you might think of is combining the first two instructions into a single <code>LEA<\/code> instruction: <\/p>\n<pre>\n    lea     ebx,[esi+881dbc10h]     ; ebx = stream - _iob (byte offset)\n<\/pre>\n<p>The <code>LEA<\/code> instruction lets us perform an addition operation in a single instruction by taking advantage of the effective address computation circuitry in the memory unit. The operation we want to perform is subtraction of a constant, which we can transform into an addition of the negative of that constant. <\/p>\n<p>Unfortunately, the trick doesn&#8217;t work in this case because the &#8220;constant&#8221; is a relocatable address, and there is no loader fixup type for &#8220;negative of the address of a variable.&#8221; <\/p>\n<p>But all is not lost. There&#8217;s another trick we could use: Fold in the subsequent addition. <\/p>\n<table CLASS=\"cp3\" CELLPADDING=\"3\" BORDER=\"0\">\n<tr>\n<td><var>ebx<\/var><\/td>\n<td>= ((<var>esi<\/var> &minus; 77E243F0h) &gt;&gt; 5) + 19h<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>= ((<var>esi<\/var> &minus; 77E243F0h) &gt;&gt; 5) + (320h &gt;&gt; 5)<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>= (<var>esi<\/var> &minus; 77E243F0h + 320h) &gt;&gt; 5 <\/tr>\n<tr>\n<td><\/td>\n<td>= (<var>esi<\/var> &minus; 77E240D0h) &gt;&gt; 5 <\/tr>\n<\/table>\n<p>Another way to do this calculation: <\/p>\n<table CELLPADDING=\"0\" BORDER=\"0\">\n<tr>\n<td><code>adjusted_index&nbsp;<\/code><\/td>\n<td><code>= stream - _iob + 0x19<\/code><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><code>= stream - (_iob - 0x19)<\/code><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><code>= stream - &amp;_iob[-0x19]<\/code><\/td>\n<\/tr>\n<\/table>\n<p>Either way, the result is this: <\/p>\n<pre>\n    mov     ebx,esi                 ; ebx = stream\n    sub     ebx,77E240D0h           ; ebx = stream - &amp;_iob[-0x19] (byte offset)\n    sar     ebx,5                   ; ebx = stream - &amp;_iob[-0x19] (element offset)\n<\/pre>\n<p>Another observation is that <var>stream<\/var> and <var>result<\/var> do not have overlapping useful lifetimes. The useful lifetime of <var>result<\/var> doesn&#8217;t start until it receives the value from <var>_fclose_lk<\/var>. Prior to that, its value is known at compile time to be <code>EOF<\/code>, so there&#8217;s no need to devote a register to it. <\/p>\n<p>And we can combine the <code>add esp, 4<\/code> with the subsequent <code>push<\/code> (which decrements the <var>esp<\/var> register) by simply storing the new value into the top-of-stack slot. <\/p>\n<p>The case of a string-based stream does not use the <var>ebx<\/var> register, so we can use a technique know as <i>shrink-wrapping<\/i>, where we start with one stack frame, and then expand it to a larger one on certain code paths. In this case, we start by saving only the <var>esi<\/var> register, and then later save the <var>ebx<\/var> register only if we realize that we need it. <\/p>\n<p>A simple size\/speed optimization (in favor of size) is to use the <code>pop<\/code> instruction to pop a value off the stack (and ignore it). This replaces a three-byte <code>add esp,4<\/code> with a one-byte register <code>pop<\/code>. <\/p>\n<p>A very aggressive size optimization would be to replace the two-byte instructions <code>mov eax, r<\/code> or <code>mov r, eax<\/code> with the one-byte <code>xchg eax, r<\/code> instruction. This assumes you need to move the value into or out of the <var>eax<\/var> register and you don&#8217;t care about the source any more. <\/p>\n<p>Finally, a string-based stream is quite uncommon (and certainly the case of closing a string-based stream), so we&#8217;ll make that the out-of-line case, and we won&#8217;t bother optimizing the fetch of the jump target for the same reason. <\/p>\n<pre>\n_fclose:\n    push    esi                     ; save register\n    mov     esi,dword ptr [esp+0Ch] ; esi = stream\n    test    byte ptr [esi+0Ch],40h  ; Is this an _IOSTRG?\n    jnz     is_string                  \n\n    push    ebx                     ; shrink-wrap\n    mov     ebx,esi\n    sub     ebx,77E240D0h           ; ebx = stream - &amp;_iob[-0x19] (byte offset)\n    sar     ebx,5                   ; ebx = index + _STREAM_LOCKS\n    push    ebx\n    call    _lock                   ; call the function\n\n    mov    [esp],esi                ; parameter for _fclose_lk\n    call   _fclose_lk               ; close the stream\n\n    mov    [esp],ebx                ; parameter for _unlock\n    mov    ebx,eax                  ; ebx = result\n    call   _unlock\n\n    pop    eax                      ; clean the stack once\n    mov    eax,ebx                  ; eax = result\n    pop    ebx\n    pop    esi\n    ret\n\nis_string:\n    mov    dword ptr [esi+0Ch],0    ; stream-&gt;_flag = 0\n    or     eax,-1                   ; return EOF\n    pop    esi\n    ret\n<\/pre>\n<p>This reduces the function size to 65 bytes. <\/p>\n<p>Yet another trick is to pre-push the parameters for multiple function calls. <\/p>\n<pre>\n_fclose:\n    mov     ecx,dword ptr [esp+8]   ; ecx = stream\n    test    byte ptr [ecx+0Ch],40h  ; Is this an _IOSTRG?\n    jnz     is_string               ; Y: handle strings out of line\n\n    push    ebx                     ; shrink-wrap\n    mov     ebx,ecx\n    sub     ebx,77E240D0h           ; ebx = stream - &amp;_iob[-0x19] (byte offset)\n    sar     ebx,5                   ; ebx = index + _STREAM_LOCKS\n    push    ecx                     ; push for _fclose_lk\n    push    ebx                     ; push for _lock\n    call    _lock                   ; call the function\n    pop     eax                     ; discard arg to _lock\n    call    _fclose_lk              ; close the stream\n    mov     dword ptr [esp],ebx     ; parameter for _unlock\n    mov     ebx,eax                 ; save result\n    call    _unlock\n    pop     eax                     ; discard arg to _unlock\n    mov     eax,ebx                 ; recover result\n    pop     ebx\n    ret\n\nis_string:\n    mov    dword ptr [ecx+0Ch],0    ; stream-&gt;_flag = 0\n    or     eax,-1                   ; return EOF\n    ret\n<\/pre>\n<p>This brings us down to 57 bytes. <\/p>\n<p>If we abandon the idea of enregistering the <var>result<\/var>, we can do this: <\/p>\n<pre>\n_fclose:\n    mov     ecx,dword ptr [esp+8]   ; ecx = stream\n    test    byte ptr [ecx+0Ch],40h  ; Is this an _IOSTRG?\n    jnz     is_string               ; Y: handle strings out of line\n\n    mov     eax,ecx\n    sub     eax,77E240D0h           ; ebx = stream - &amp;_iob[-0x19] (byte offset)\n    sar     eax,5                   ; ebx = index + _STREAM_LOCKS\n    push    ecx                     ; garbage (for future result)\n    push    eax                     ; push for _unlock\n    push    ecx                     ; push for _fclose_lk\n    push    eax                     ; push for _lock\n    call    _lock                   ; call the function\n    pop     eax                     ; discard arg to _lock\n    call    _fclose_lk              ; close the stream\n    mov     dword ptr [esp+0Ch],eax ; save result\n    pop     eax                     ; discard arg to _lock\n    call    _unlock\n    pop     eax                     ; discard arg to _unlock\n    pop     eax                     ; recover result\n    ret\n\nis_string:\n    mov    dword ptr [ecx+0Ch],0    ; stream-&gt;_flag = 0\n    or     eax,-1                   ; return EOF\n    ret\n<\/pre>\n<p>But this comes out to 59 bytes. <\/p>\n<p><a HREF=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20190212-00\/?p=101048\">Next time<\/a>, a bonus chapter on future developments to this architecture. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Putting the information into practice.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[2],"class_list":["post-101046","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-history"],"acf":[],"blog_post_summary":"<p>Putting the information into practice.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/101046","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=101046"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/101046\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=101046"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=101046"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=101046"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}