The ARM processor (Thumb-2), part 18: Other kinds of prologues and epilogues

Last time, we looked at the standard function prologue and epilogue. There are some variations to the standard that you may encounter from time to time.

Lightweight leaf functions are functions which meet all of the following criteria:

Modify only the non-preserved registers: r0 through r3 and r12, and d0 through d7 and d16 through r31, and flags.
Do not use any stack aside from inbound parameter space.

Lightweight leaf functions do not create a stack frame. They must keep the return address in the lr register for the entire lifetime so that the kernel can unwind the function to its caller. The requirement that it use only non-preserved registers allows the kernel to unwind without using any unwind codes, since there are no registers that need to be restored during unwinding.

Conversely, any function that lacks unwind codes is assumed to be a lightweight leaf function.

Another variation is the shrink-wrapped function. This is a function that starts out with a small stack frame (or no stack frame at all, pretending to be a lightweight leaf function), in the hope that it can early-out. If not, then it expands to a full stack frame.

If a function uses 16 or fewer bytes of local variables and outbound parameters, it can include up to four dummy registers to the initial push:

    push    {r0-r7,r11,lr}

The part you recognize is the saving of registers r4 through r7, plus the frame pointer and return address. The sneaky part is that it also saves registers r0 through r3. These extra registers are pushed, not so much because the function wants to save them, but because pushing four additional registers implicitly subtracts 4 ×4 = 16 bytes from the sp register, allocating the local variables and outbound parameters as part of the initial push.

In the epilogue, you can use the reverse trick to clean up those extra 16 bytes as part of the final pop:

    pop     {r0-r7,r11,pc}

However, if your function needs to return a value in r0 (and possibly r1), you can’t pop them in your optimized epilogue, because that would clobber your return value. You’ll have to use an old-fashioned add sp, sp, #n to discard those bytes from the stack.

If the function is variadic, it will probably start with a

    push    {r0-r3}

This pushes the first 16 bytes of parameters onto the stack, so that they line up exactly adjacent to the stack-based parameters. That way, the code that walks the parameter list can just walk through memory uniformly.

This extra push instruction in the prologue requires a change to the epilogue, because our usual trick of popping the return address into pc isn’t going to work.

    add     sp, sp, #0x20       ; free locals and outbound stack parameters
    pop     {r4-r7,r11}         ; restore registers but leave return address
    ldr     pc, [sp], #0x14     ; return and clean extra stack space

Things start out innocently enough, but this time, the pop instruction leaves the return address on the stack, and the r0 through r3 registers are still on the stack, too. At this point, we have this diagram:


return address
previous `r11`	← `r11` (frame chain)
⋮
stack param
saved `r3`
saved `r2`
saved `r1`
saved `r0`
return address	← `sp`

The magic instruction that finishes the function is

    ldr     pc, [sp], #0x14     ; return and clean extra stack space

Let’s take this instruction apart.

First, it loads pc from the stack pointer. Loading a value into pc acts like a jump instruction, so the next instruction to execute when this one is complete will be the instruction at the return address.

The , #0x14 suffix means that this is using the post-increment addressing mode. After the register is loaded from memory, the base register (sp) is incremented by 0x14. This moves the stack pointer past the saved return address as well as the 16 bytes occupied by the registers r0 through r3 we had pushed at function entry.

The last trick I’ll talk about is tail call optimization. The epilogue for this function goes like this:

    add     sp, sp, #0x20       ; free locals and outbound stack parameters
    pop     {r4-r7,r11,lr}      ; restore registers and set lr to return address
    b       next_function

After cleaning up the local variables and outbound stack parameters, we pop off everything that we saved, but instead of putting the return address into pc like we usually do, we pop it back into lr. This preserves the requirement that on entry to a function, the lr register holds the return address. We can now jump directly to the entry point of the tail call target.

Well, that was an exciting tour of function prologues and epilogues. Next time, we’ll look at common code sequences you should learn to recognize.

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

1 comment

紅樓鍮 June 23, 2021 · Edited

NB to anyone who is horrendously confused by the variadics epilogue as I was:

The caller will clean up the stack params, i. e. those past the first 16 bytes of the param struct. We additionally push the first 16 bytes onto the stack for convenience, but they're not stack params, and we have to clean them up.

A typical prologue is

<code>

the first line of either of which is what Raymond means by "it will probably start with a ". Do not confuse it with the dummy push, which is the second line of the latter example above.

Here's an example....

NB to anyone who is horrendously confused by the variadics epilogue as I was:

The caller will clean up the stack params, i. e. those past the first 16 bytes of the param struct. We additionally push the first 16 bytes onto the stack for convenience, but they’re not stack params, and we have to clean them up.

A typical prologue is

push    {r0-r3}
push    {r4-r7,r11,lr}
sub     sp, sp, #0x44       ; e. g. if we use up to 68 bytes for locals

push    {r0-r3}
push    {r0-r7,r11,lr}      ; we use 16 bytes for locals,
                            ; which the additional dummy push of r0-r3 will do

the first line of either of which is what Raymond means by “it will probably start with a push {r0-r3}“. Do not confuse it with the dummy push, which is the second line of the latter example above.

Here’s an example. Suppose we use 68 bytes for locals, and our signature is

int us(unsigned int number_of_arguments_following, ...);

and a caller calls us with, for example,

// in caller code:
us(8u, 0x01, 0x23, 0x45, 0x67, 0x89, 0xab, 0xcd, 0xef);

then after our prologue, the stack looks like

                  sp+0x80 (caller frame above, do not touch...)
               /- sp+0x7c 0xef ----------------------------------\
              |   sp+0x78 0xcd                                    |
              |   sp+0x74 0xab                                     > caller pushed, caller to clean up
              |   sp+0x70 0x89                                    |
param struct <    sp+0x6c 0x67 ----------------------------------/
              |   sp+0x68 0x45 ----------\ ----------------------\
              |   sp+0x64 0x23             > push {r0-r3}         |
              |   sp+0x60 0x01            |                       |
               \- sp+0x5c 8u ------------/                        |
                  sp+0x58 <saved lr> ----\                        |
                  sp+0x54 <caller's r11>  |                       |
                  sp+0x50 <caller's r7>    > push {r4-r7,r11,lr}   > we pushed, we clean up
                  sp+0x4c <caller's r6>   |                       |
                  sp+0x48 <caller's r5>   |                       |
                  sp+0x44 <caller's r4> -/                        |
                  sp+0x40 uninitialized -\                        |
                  ...                      > sub sp, sp, 0x44     |
                  sp+0    uninitialized -/ ----------------------/

Then our epilogue will be

add     sp, sp, #0x44       ; cancel "sub sp, sp, #0x44",
                            ; sp now at <caller's r4>
pop     {r4-r7,r11}         ; partially cancel "push {r4-r7,r11,lr}", leaving <saved lr> on stack,
                            ; sp now at <saved lr>
ldr     pc, [sp], #0x14     ; load pc = [sp], i. e. pc = <saved lr>,
                            ; then both kill the left-over <saved lr> (0x4 bytes)
                            ; and cancel "push {r0-r3}" (0x10 bytes),
                            ; cleanup on our part is complete

Do not confuse ldr pc, [sp], #0x14 with ldr pc, [sp, #0x14] or ldr pc, [sp, #0x14]!!

The ARM processor (Thumb-2), part 18: Other kinds of prologues and epilogues

Category

Topics

Author

1 comment

Read next

The ARM processor (Thumb-2), part 19: Common patterns

The ARM processor (Thumb-2), part 20: Code walkthrough

Category

Topics

Share

Author

1 comment

Read next

The ARM processor (Thumb-2), part 19: Common patterns

The ARM processor (Thumb-2), part 20: Code walkthrough

Stay informed