{"id":105351,"date":"2021-06-23T07:00:00","date_gmt":"2021-06-23T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=105351"},"modified":"2021-06-23T06:20:40","modified_gmt":"2021-06-23T13:20:40","slug":"20210623-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20210623-00\/?p=105351","title":{"rendered":"The ARM processor (Thumb-2), part 18: Other kinds of prologues and epilogues"},"content":{"rendered":"<p>Last time, <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20210622-00\/?p=105332\"> we looked at the standard function prologue and epilogue<\/a>. There are some variations to the standard that you may encounter from time to time.<\/p>\n<p>Lightweight leaf functions are functions which meet all of the following criteria:<\/p>\n<ul>\n<li>Modify only the non-preserved registers: <var>r0<\/var> through <var>r3<\/var> and <var>r12<\/var>, and <var>d0<\/var> through <var>d7<\/var> and <var>d16<\/var> through <var>r31<\/var>, and flags.<\/li>\n<li>Do not use any stack aside from inbound parameter space.<\/li>\n<\/ul>\n<p>Lightweight leaf functions do not create a stack frame. They must keep the return address in the <var>lr<\/var> register for the entire lifetime so that the kernel can unwind the function to its caller. The requirement that it use only non-preserved registers allows the kernel to unwind without using any unwind codes, since there are no registers that need to be restored during unwinding.<\/p>\n<p>Conversely, any function that lacks unwind codes is assumed to be a lightweight leaf function.<\/p>\n<p>Another variation is the shrink-wrapped function. This is a function that starts out with a small stack frame (or no stack frame at all, pretending to be a lightweight leaf function), in the hope that it can early-out. If not, then it expands to a full stack frame.<\/p>\n<p>If a function uses 16 or fewer bytes of local variables and outbound parameters, it can include up to four dummy registers to the initial <code>push<\/code>:<\/p>\n<pre>    push    {r0-r7,r11,lr}\r\n<\/pre>\n<p>The part you recognize is the saving of registers <var>r4<\/var> through <var>r7<\/var>, plus the frame pointer and return address. The sneaky part is that it also saves registers <var>r0<\/var> through <var>r3<\/var>. These extra registers are pushed, not so much because the function wants to save them, but because pushing four additional registers implicitly subtracts 4 \u00d74 = 16 bytes from the <var>sp<\/var> register, allocating the local variables and outbound parameters as part of the initial <code>push<\/code>.<\/p>\n<p>In the epilogue, you can use the reverse trick to clean up those extra 16 bytes as part of the final pop:<\/p>\n<pre>    pop     {r0-r7,r11,pc}\r\n<\/pre>\n<p>However, if your function needs to return a value in <var>r0<\/var> (and possibly <var>r1<\/var>), you can&#8217;t pop them in your optimized epilogue, because that would clobber your return value. You&#8217;ll have to use an old-fashioned <code>add sp, sp, #n<\/code> to discard those bytes from the stack.<\/p>\n<p>If the function is variadic, it will probably start with a<\/p>\n<pre>    push    {r0-r3}\r\n<\/pre>\n<p>This pushes the first 16 bytes of parameters onto the stack, so that they line up exactly adjacent to the stack-based parameters. That way, the code that walks the parameter list can just walk through memory uniformly.<\/p>\n<p>This extra push instruction in the prologue requires a change to the epilogue, because our usual trick of popping the return address into <var>pc<\/var> isn&#8217;t going to work.<\/p>\n<pre>    add     sp, sp, #0x20       ; free locals and outbound stack parameters\r\n    pop     {r4-r7,r11}         ; restore registers but leave return address\r\n    ldr     pc, [sp], #0x14     ; return and clean extra stack space\r\n<\/pre>\n<p>Things start out innocently enough, but this time, the <code>pop<\/code> instruction leaves the return address on the stack, and the <var>r0<\/var> through <var>r3<\/var> registers are still on the stack, too. At this point, we have this diagram:<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse    TITLE=;\" border=\"0\" frame=\"\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<td style=\"border: solid 1px black; border-top: none; text-align: center;\">\u00a0<\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px black; text-align: center;\">return address<\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px black; text-align: center;\">previous <var>r11<\/var><\/td>\n<td>\u2190 <var>r11<\/var> (frame chain)<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px black; text-align: center;\">\u22ee<\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px black; text-align: center;\">stack param<\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px black; text-align: center;\">saved <var>r3<\/var><\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px black; text-align: center;\">saved <var>r2<\/var><\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px black; text-align: center;\">saved <var>r1<\/var><\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px black; text-align: center;\">saved <var>r0<\/var><\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px black; text-align: center;\">return address<\/td>\n<td>\u2190 <var>sp<\/var><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The magic instruction that finishes the function is<\/p>\n<pre>    ldr     pc, [sp], #0x14     ; return and clean extra stack space\r\n<\/pre>\n<p>Let&#8217;s take this instruction apart.<\/p>\n<p>First, it loads <var>pc<\/var> from the stack pointer. Loading a value into <var>pc<\/var> acts like a jump instruction, so the next instruction to execute when this one is complete will be the instruction at the return address.<\/p>\n<p>The <code>, #0x14<\/code> suffix means that this is using the post-increment addressing mode. After the register is loaded from memory, the base register (<var>sp<\/var>) is incremented by <code>0x14<\/code>. This moves the stack pointer past the saved return address as well as the 16 bytes occupied by the registers <var>r0<\/var> through <var>r3<\/var> we had pushed at function entry.<\/p>\n<p>The last trick I&#8217;ll talk about is tail call optimization. The epilogue for this function goes like this:<\/p>\n<pre>    add     sp, sp, #0x20       ; free locals and outbound stack parameters\r\n    pop     {r4-r7,r11,lr}      ; restore registers and set lr to return address\r\n    b       next_function\r\n<\/pre>\n<p>After cleaning up the local variables and outbound stack parameters, we pop off everything that we saved, but instead of putting the return address into <var>pc<\/var> like we usually do, we pop it back into <var>lr<\/var>. This preserves the requirement that on entry to a function, the <var>lr<\/var> register holds the return address. We can now jump directly to the entry point of the tail call target.<\/p>\n<p>Well, that was an exciting tour of function prologues and epilogues. Next time, we&#8217;ll look at common code sequences you should learn to recognize.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Taking shortcuts and combining steps, or omitting them entirely.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[2],"class_list":["post-105351","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-history"],"acf":[],"blog_post_summary":"<p>Taking shortcuts and combining steps, or omitting them entirely.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105351","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=105351"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105351\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=105351"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=105351"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=105351"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}