{"id":102796,"date":"2019-08-22T07:00:00","date_gmt":"2019-08-22T14:00:00","guid":{"rendered":"http:\/\/devblogs.microsoft.com\/oldnewthing\/?p=102796"},"modified":"2019-09-13T22:00:41","modified_gmt":"2019-09-14T05:00:41","slug":"20190822-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20190822-00\/?p=102796","title":{"rendered":"The SuperH-3, part 14: Patterns for function calls"},"content":{"rendered":"<p>Function calls on the SH-3 are rather cumbersome. <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20190816-00\/?p=102788\"> The <code>BSR<\/code> instruction has a reach of only 4KB<\/a>, which makes it impractical for compiler-generated code because the compiler doesn&#8217;t know where the linker is going to put the function it&#8217;s calling. In practice, all function calls in compiler-generated code are performed with the <code>JSR<\/code> instruction, which calls a function whose address is given by a register.<\/p>\n<p>The typical case of a direct function call goes like this:<\/p>\n<pre>    MOV.L   r3, @(16, r15)          ; parameter 5 passed on the stack\r\n    MOV     r8, r7                  ; parameter 4 copied from another register\r\n    MOV     #20, r6                 ; parameter 3 is address of local variable\r\n    ADD     r15, r6                 ; r6 = r15 + 20\r\n    MOV     #8, r5                  ; parameter 2 is calculated in place\r\n    <span style=\"color: blue;\">MOV.L   #function, r0           ; r0 = function to call\r\n    JSR     @r0                     ; call the function<\/span>\r\n    MOV     @(24,r15), r4           ; parameter 1 copied from the stack\r\n                                    ; (in the branch delay slot)\r\n<\/pre>\n<p>We load the function address into some register. The compiler usually uses one of the non-parameter scratch registers for this purpose, <var>r0<\/var> through <var>r3<\/var>. Note that we wrote this as a 32-bit immediate, but that is a pseudo-instruction which the assembler converts to a PC-relative load, with a constant embedded in the code segment.<\/p>\n<pre>    ; You write\r\n    MOV.L   #function_address, r0   ; r0 = function to call\r\n\r\n    ; Assembler produces\r\n    MOV.L   @(n, PC), r0            ; r0 = function to call\r\n\r\n    ... around n+4 bytes later ...\r\n    .data.l function_address        ; constant stored in code segment\r\n<\/pre>\n<p>The notation used by the Microsoft SH-3 assembler is that the name of a label is treated as its address. You don&#8217;t need to say <code>offset<\/code> like you do in the Microsoft 80386 assembler.<\/p>\n<p>We also prepare the parameters for the call. As we noted when we discussed the calling convention, the first four parameters go in registers <var>r4<\/var> through <var>r7<\/var>, and the rest go on the stack.<\/p>\n<p>In practice, the parameters will be prepared in whatever order the compiler finds convenient, and they will be interleaved with the code that prepares the function address (and with each other) in order to improve scheduling.<\/p>\n<p>The final instruction for setting up the parameters can go into the branch delay slot, provided it does not use a PC-relative addressing mode.<\/p>\n<pre>    MOV.L   #function, r0           ; r0 = function to call\r\n    MOV.L   @(24, r15), r5          ; r5 = local variable\r\n    JSR     @r0                     ; call the function\r\n    MOV.L   #large_constant, r4     ; r4 = some large constant\r\n    ^^^^^ ILLSLOT EXCEPTION         ; (in the branch delay slot)\r\n<\/pre>\n<p>The <code>MOV.L #large_constant, r4<\/code> will be encoded by the assembler as a PC-relative load, which is illegal in a branch delay slot. Fortunately, the assembler will not let you do this:<\/p>\n<pre>error A151: Can't compute PC displacement in a delay slot\r\n<\/pre>\n<p>To fix this, you&#8217;ll have to move the PC-relative load out of the delay slot, preferably by swapping it with some instruction that it is not dependent upon.<\/p>\n<pre>    MOV.L   #function, r0           ; r0 = function to call\r\n    <span style=\"color: blue;\">MOV.L   #large_constant, r4     ; r4 = some large constant<\/span>\r\n    JSR     @r0                     ; call the function\r\n    <span style=\"color: blue;\">MOV.L   @(24, r15), r5          ; r5 = local variable<\/span>\r\n                                    ; (in the branch delay slot)\r\n<\/pre>\n<p>Calling a function through a global variable function pointer (such as through the import address table, in the case of a function that was declared as <code>__declspec(import)<\/code>) involves two memory accesses, one to get the address of the global variable, and another to get the code pointer.<\/p>\n<pre>    MOV.L   #variable, r0            ; r0 = variable that holds the fptr\r\n    MOV.L   @r0, r0                 ; r0 = the address to call\r\n    JSR     @r0                     ; call it\r\n<\/pre>\n<p>Here and in the subsequent examples, I&#8217;ve removed the parameter-loading instructions.<\/p>\n<p>Calling a virtual function means getting the function address from the object&#8217;s vtable.<\/p>\n<pre>    MOV     r8, r4                  ; r4 = \"this\" for function call\r\n    MOV.L   @r4, r0                 ; load vtable pointer into r0\r\n    MOV.L   @(n, r0), r0            ; load function pointer from vtable into r0\r\n    JSR     @r0                     ; call it\r\n<\/pre>\n<p>And calling a na\u00efvely-imported function means calling a stub.<\/p>\n<pre>    MOV.L   #stub_address, r0       ; r0 = pointer to stub function\r\n    JSR     @r0                     ; call it\r\n\r\n    ...\r\nstub:\r\n    MOV.L   #__imp__Function, r0    ; r0 = pointer to IAT entry\r\n    MOV.L   @r0, r0                 ; r0 = the address to call\r\n    JMP     @r0                     ; and jump there\r\n    NOP                             ; (branch delay slot)\r\n    .data.l __imp__Function         ; address of IAT entry\r\n                                    ; (constant for first MOV.L instruction)\r\n<\/pre>\n<p>Our last common pattern for today is the dense switch statement.<\/p>\n<pre>    switch (value) {\r\n    case 1: ...\r\n    case 2: ...\r\n    case 3: ...\r\n    case 4: ...\r\n    case 5: ...\r\n    default: ...\r\n    }\r\n\r\n        ADD     #-1,r4              ; bias by lowest valid value\r\n        MOV     #4,r3               ; is it in the range of our jump table?\r\n        CMP\/HI  r3,r4\r\n        BT      default             ; N: go to default case\r\n        MOV.L   #jump_table, r2     ; get address of jump table\r\n        MOV     r4,r0               ; prepare for indexed addressing\r\n        MOV.B   @(r0,r2),r0         ; r0 = instruction offset for case\r\n        NOP                         ; (we'll see more about this nop later)\r\n        BRAF    r0                  ; jump to appropriate handler\r\n        NOP                         ; (nothing in the branch delay slot)\r\n\r\n    ...\r\njump_table:\r\n        .data.b 0x0\r\n        .data.b 0x1a\r\n        .data.b 0x2c\r\n        .data.b 0x42\r\n        .data.b 0x78\r\n<\/pre>\n<p>The code first subtracts the lowest non-default case value, producing an index so that all the interesting cases are in the range 0 to <var>n<\/var> for some <var>n<\/var>. If the value is not in that range, then we jump to the <code>default:<\/code>. Otherwise, we use the index as an index into a jump table of bytes, and use a <code>BRAF<\/code> instruction to perform a relative jump.<\/p>\n<p>If there is a case label more than 127 bytes away from the <code>BRAF<\/code>, then the jump table expands to contain word offsets, and the index needs to be doubled before being looked up.<\/p>\n<pre>        ADD     #-1,r4              ; bias by lowest valid value\r\n        MOV     #4,r3               ; is it in the range of our jump table?\r\n        CMP\/HI  r3,r4\r\n        BT      default             ; N: go to default case\r\n        MOV.L   #jump_table, r2     ; get address of jump table\r\n        MOV     r4,r0               ; prepare for indexed addressing\r\n        <span style=\"color: blue;\">ADD     r0,r0               ; convert byte offset to word offset\r\n        MOV.W   @(r0,r2),r0         ; r0 = instruction offset for case<\/span>\r\n        BRAF    r0                  ; jump to appropriate handler\r\n        NOP                         ; (nothing in the branch delay slot)\r\n<\/pre>\n<p>We double the index by adding it to itself (<code>add r0, r0<\/code>). This is where the extra <code>NOP<\/code> from the previous case comes into play. The compiler leaves a <code>NOP<\/code> in its code generation so it can choose the size of the jump table later without having to go back and recalculate all its offsets.<\/p>\n<p>In theory the compiler could have emitted the jump table directly into the code rather than dropping just the address of the jump table, which then needs to be indirected through in order to access the actual jump table. That has its drawbacks though: You have a potentially large jump table in your code, which pushes the jump targets further away and makes it more likely you&#8217;re going to <a href=\"https:\/\/www.youtube.com\/watch?v=2I91DJZKRxs&amp;t=33s\"> need a bigger table<\/a>. And having the possibility of a variable-sized table means that the calculation of jump offsets requires multiple passes until all the consequences have stabilized. It&#8217;s easier for the compiler to just generate a pointer to a jump table and figure out the jump table later.<\/p>\n<p>I guess in theory if there is more than 64KB of code in the <code>switch<\/code> statement, the jump table might have to contain longword offsets, and the <code>NOP<\/code> becomes a <code>SLL2<\/code> to scale the index up so it can access a longword array. I&#8217;ve never seen a function so large that this became an issue, though.<\/p>\n<p>Next time, we&#8217;ll wrap up this whirlwind tour of the SH-3 processor by <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20190823-00\/?p=102798\"> walking through some actual code<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Different ways of setting up a function call.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[2],"class_list":["post-102796","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-history"],"acf":[],"blog_post_summary":"<p>Different ways of setting up a function call.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/102796","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=102796"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/102796\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=102796"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=102796"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=102796"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}