{"id":90791,"date":"2015-07-30T07:00:00","date_gmt":"2015-07-30T21:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/20150730-00\/?p=90791\/"},"modified":"2019-03-13T12:17:56","modified_gmt":"2019-03-13T19:17:56","slug":"20150730-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20150730-00\/?p=90791","title":{"rendered":"The Itanium processor, part 4: The Windows calling convention, leaf functions"},"content":{"rendered":"<p><a HREF=\"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/20150729-00\/?p=90801\">Last time<\/a>, we looked at the general rules for parameter passing on the Itanium. But those rules are relaxed for leaf functions (functions which call no other functions). <\/p>\n<p>Before we start, I need to correct some of the explanation I had given when introducing the calling convention. I used that explanation because it makes for an easier conceptual model, but the reality is slightly different. <\/p>\n<p>First of all, I said that the <code>alloc<\/code> function shuffles the registers around and lays out the new local region and output registers. In reality, it is the <code>br.call<\/code> instruction that moves the registers and the <code>alloc<\/code> which sets up the register frame. Since the first instruction of a function is <code>alloc<\/code>, it doesn&#8217;t make much difference how the work is distributed between the <code>br.call<\/code> and the <code>alloc<\/code> since they come right after each other. The only time you notice the difference is if you happen to break into the debugger immediately between those two instructions. <\/p>\n<p>More precisely, here&#8217;s what the <code>br.call<\/code> instruction does: <\/p>\n<ul>\n<li>Copy the current register frame state (and some other stuff)     to the <code>pfs<\/code> register.<\/li>\n<li>Rotate the registers so that the first output register is now     <var>r32<\/var>.<\/li>\n<li>Create a new register frame as follows:<\/li>\n<ul>\n<li>input registers = caller&#8217;s output registers<\/li>\n<li>no local registers<\/li>\n<li>no output registers<\/li>\n<li>no rotating registers<\/li>\n<\/ul>\n<li>Set the <var>rp<\/var> register to the return address.<\/li>\n<li>Transfer control to the target.<\/li>\n<\/ul>\n<p>In other words, the register stack changes like this: <\/p>\n<table BORDER=\"0\" CELLPADDING=\"3\" STYLE=\"border-collapse: collapse;text-align: center\">\n<tr>\n<td>r32<\/td>\n<td>r33<\/td>\n<td>r34<\/td>\n<td>r35<\/td>\n<td>r36<\/td>\n<td>r37<\/td>\n<td>r38<\/td>\n<td>r39<\/td>\n<td>r40<\/td>\n<td>r41<\/td>\n<td>r42<\/td>\n<td>r43<\/td>\n<td COLSPAN=\"8\"><\/td>\n<\/tr>\n<tr>\n<td COLSPAN=\"4\" BGCOLOR=\"#ffbbff\" STYLE=\"border: solid 1px black\"><var>f<\/var>&#8216;s Input<\/td>\n<td COLSPAN=\"5\" BGCOLOR=\"#C0FF97\" STYLE=\"border: solid 1px black\"><var>f<\/var>&#8216;s Local<\/td>\n<td COLSPAN=\"3\" BGCOLOR=\"#FFBBBB\" STYLE=\"border: solid 1px black\"><var>f<\/var>&#8216;s Output<\/td>\n<td COLSPAN=\"8\">&emsp;Before <var>f<\/var> does a <code>br.call<\/code><\/td>\n<\/tr>\n<tr>\n<td COLSPAN=\"20\">&nbsp;<\/td>\n<\/tr>\n<tr>\n<td COLSPAN=\"9\"><\/td>\n<td>r32<\/td>\n<td>r33<\/td>\n<td>r34<\/td>\n<\/tr>\n<tr>\n<td COLSPAN=\"9\">On entry to <var>g<\/var>&emsp;<\/td>\n<td COLSPAN=\"3\" BGCOLOR=\"#FFBBBB\" STYLE=\"border: solid 1px black\"><var>g<\/var>&#8216;s Input<\/td>\n<\/tr>\n<tr>\n<td COLSPAN=\"20\">&nbsp;<\/td>\n<\/tr>\n<tr>\n<td COLSPAN=\"9\"><\/td>\n<td>r32<\/td>\n<td>r33<\/td>\n<td>r34<\/td>\n<td>r35<\/td>\n<td>r36<\/td>\n<td>r37<\/td>\n<td>r38<\/td>\n<td>r39<\/td>\n<td>r40<\/td>\n<td>r41<\/td>\n<td>r42<\/td>\n<\/tr>\n<tr>\n<td COLSPAN=\"9\">After <var>g<\/var> does an <code>alloc<\/code>&emsp;<\/td>\n<td COLSPAN=\"3\" BGCOLOR=\"#FFBBBB\" STYLE=\"border: solid 1px black\"><var>g<\/var>&#8216;s Input<\/td>\n<td COLSPAN=\"4\" BGCOLOR=\"#FFFF99\" STYLE=\"border: solid 1px black\"><var>g<\/var>&#8216;s Local<\/td>\n<td COLSPAN=\"4\" BGCOLOR=\"#ACF3FD\" STYLE=\"border: solid 1px black\"><var>g<\/var>&#8216;s Output<\/td>\n<\/tr>\n<\/table>\n<p>A consequence of this division of labor between <code>br.call<\/code> and <code>alloc<\/code> is that leaf functions can take advantage of this default register frame: If a leaf function can do all its work with just <\/p>\n<ul>\n<li>its input registers<\/li>\n<li>scratch registers<\/li>\n<li>the     <a HREF=\"http:\/\/blogs.msdn.com\/b\/oldnewthing\/archive\/2004\/01\/13\/58199.aspx\">red zone<\/a><\/li>\n<\/ul>\n<p>then it doesn&#8217;t need to perform an <code>alloc<\/code> at all! It can use the default register allocation of &#8220;Caller&#8217;s output registers become my input registers, and I have no local registers or output registers.&#8221; When finished, the function just does a <code>br.ret rp<\/code> to return to the caller. <\/p>\n<p>Note that this optimization is available only to leaf functions. If the function calls another function, then the <code>br.call<\/code> will overwrite the <code>pfs<\/code> and <code>rp<\/code> registers, which will make it hard to return to your caller when you&#8217;re done. <\/p>\n<p>The red zone is officially known as the <i>scratch area<\/i>. The first 16 bytes on the stack are available for use by the currently executed function. If you want values to be preserved across a function call, you need to move them out of the scratch area, because they become the scratch area for the function being called! In other words, the scratch area is not preserved across function calls. <\/p>\n<p>A more obscure consequence of this division of labor between <code>br.call<\/code> and <code>alloc<\/code> is that a function could in principle perform <code>alloc<\/code> more than once in order to change the size of its local region or the number of output registers. For example, a function might start by saying, &#8220;I have five local registers and two output registers,&#8221; and then later realize, &#8220;Oh, wait, I need to call a function with six parameters. I will issue a new <code>alloc<\/code> instruction that requests five local registers and <i>six<\/i> output registers.&#8221; Although technically legal, it doesn&#8217;t often occur in practice because it&#8217;s usually easier just to set up your register state once and stick with it for the duration of the function. <\/p>\n<p>A more common case where this occurs is when a function has an early exit that can be determined using only leaf-available resources. <\/p>\n<pre>\nextern HANDLE LogFile;\n\nvoid Log(char *message, char *file, int line)\n{\n if (!LogFile) return;\n ... complicated logging code goes here ...\n}\n<\/pre>\n<p>If profiling feedback indicates that logging is rarely enabled, then the compiler can avoid setting up all the registers and stack for the complicated logging code, at least until it knows that logging is enabled. <\/p>\n<pre>\n.Log:\n      addl    r30, -205584, gp ;; \/\/ get address of LogFile variable\n      ld8     r30, [r30] ;;       \/\/ fetch the value\n      cmp.eq  p6, p0 = r30, r0    \/\/ is it zero?\n(p6)  br.ret  rp                  \/\/ return if so\n\n  \/\/ Okay, we are really logging. Set up our stack.\n      alloc   r35 = ar.pfs, 3, 5, 6, 0 \/\/ set up register frame\n      sub     sp = sp, 0x240      \/\/ set up stack buffers\n      mov     r36 = ra            \/\/ save return address\n\n      ... do complicated logging ...\n\n      mov     rp = r36            \/\/ return address\n      mov.i   ar.pfs = r34        \/\/ restore pfs\n      br.ret.sptk.many  rp ;;     \/\/ return to caller\n<\/pre>\n<p>The first instruction calculates the effective address of the <code>Log&shy;File<\/code> variable. We&#8217;ll learn more about the <var>gp<\/var> register later. <\/p>\n<p>The second instruction loads an 8-byte value from that address, thereby retrieving the value of <code>Log&shy;File<\/code>. <\/p>\n<p>The third instruction compares the value against <var>r0<\/var>, which is a hard-coded zero register. It asks for an equality comparison, putting the answer in the predicate variable <var>p6<\/var> (and putting the complement of the answer in <var>p0<\/var>, which effectively throws it away). <\/p>\n<p>The fourth instruction conditionally returns from the function if the comparison was true. In the common case where logging is not enabled, the function returns at this point. Only if logging is enabled do the <code>alloc<\/code> and related instructions execute to set up the stack frame and then perform the complicated logging. <\/p>\n<p>This is an example of an optimization known as <i>shrink-wrapping<\/i>. Shrink-wrapping occurs when a function does some work with a temporary stack frame, and then expands to a full stack frame only if it is needed. (Shrink-wrapping entails a few extra entries in the unwind exception table because the unwinding needs to take place differently depending on where in the function the exception occurred. I&#8217;ll spare you the details.) <\/p>\n<p>Okay, that&#8217;s all for leaf functions and getting to the bottom of the whole <code>br.call<\/code> \/ <code>alloc<\/code> dance. <a HREF=\"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/20150731-00\/?p=90771\">Next time<\/a>, we&#8217;ll look at function pointers and the funky <var>gp<\/var> register. Here&#8217;s something to whet your appetite: On ia64, a function pointer is not the address of the first instruction in the function&#8217;s code. In fact, it&#8217;s nowhere near the function&#8217;s code. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Just use what you&#8217;ve got.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[26],"class_list":["post-90791","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-other"],"acf":[],"blog_post_summary":"<p>Just use what you&#8217;ve got.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/90791","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=90791"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/90791\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=90791"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=90791"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=90791"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}