May 21st, 2025
2 reactions

Silly parlor tricks: Promoting a 32-bit value to a 64-bit value when you don’t care about garbage in the upper bits

Suppose you have a function that wants to pass a 32-bit value to a function that takes a 64-bit value. You don’t care what goes into the upper 32 bits because that value is a passthrough value that gets passed to your callback function, and the callback function will truncate it to a 32-bit value. And for whatever reason, you are concerned about the performance impact of that single instruction that the compiler normally generates to extend the 32-bit value to a 64-bit value.

My first take is “Don’t worry yet.” I suspect that that one instruction is not going to be a performance bottleneck in your program.

But still, I took up the challenge, just for fun.

What I came up with was using gcc/clang inline assembly that says “I can produce a 64-bit value from a 32-bit value by executing no instructions.”

int64_t int32_to_64_garbage(int32_t i32)
{
    int64_t i64;
    __asm__("" :        // do nothing
            "=r"(i64) : // produces result in register
            "0"(i32));  // from this input
    return i64;
}

The first argument to the __asm__ inline directve is the code to generate. We pass an empty string, so there is in fact no code generated at all! All the effects we want are in the declarations of inputs and outputs.

Next come the outputs, of which we have only one. The "=r"(i64) means that our inline assembly will put the overwritten (=) value of i64 in a register r of the compiler’s choosing, which the inline assembler will refer to as %0. (The outputs are numbered starting at zero.)

Finally, we have the inputs, of which we have only one. The "0"(i32) means that the input should be put in the same place as output number zero.

All of the work was done by our constraints on the inputs and outputs. There’s no actual code. We tell the compiler “Put i32 in a register, and then cover your eyes, and when you open them, i64 will be in that same register!”

Running gcc at optimization level 3 shows that the value was completely elided.

void somewhere(int64_t);

void sample1(int32_t v)
{
    somewhere(v);
}

void sample2(int32_t v)
{
    somewhere(int32_to_64_garbage(v));
}

The result is

// x86-64
sample1(int):
        movsx   rdi, edi
        jmp     somewhere(long)
sample2(int):
        jmp     somewhere(long)

// arm32
sample1(int):
        asrs    r1, r0, #31
        b       somewhere(long long)
sample2(int):
        b       somewhere(long long)

// arm64
sample1(int):
        sxtw    x0, w0
        b       somewhere(long)
sample2(int):
        b       somewhere(long)

The first version contains an explicit sign extension instruction before making the tail call. The second version is a direct tail call, using whatever garbage is in the upper 32 bits of the rdi register.

Another compiler that supports gcc extended inline syntax is icc, and this trick seems to work there too.

// x86-64
sample1(int):
        movsxd    rdi, edi
        jmp       somewhere(long)
sample2(int):
        jmp       somewhere(long)    

The clang compiler also supports gcc extended inline assembly syntax. It, however, not only generates a conversion but also loses the tail call.

// x86-64
sample1(int):
        movsxd  edi, edi
        jmp     somewhere(long)@PLT

sample2(int):
        push    rax
        mov     edi, edi
        call    somewhere(long)@PLT
        pop     rax
        ret

// arm32
sample1(int):
        asr     r1, r0, #31
        b       somewhere(long long)

sample2(int):
        push    {r11, lr}
        sub     sp, sp, #8
        mov     r1, #0
        bl      somewhere(long long)
        add     sp, sp, #8
        pop     {r11, pc}

// arm64
sample1(int):
        sxtw    x0, w0
        b       somewhere(long)

sample2(int):
        sub     sp, sp, #32
        stp     x29, x30, [sp, #16]
        add     x29, sp, #16
        mov     w0, w0
        bl      somewhere(long)
        ldp     x29, x30, [sp, #16]
        add     sp, sp, #32
        ret

Update: It seems that the current version of clang (as of this writing) restores the tail call, though it still does a 32-to-64 unsigned conversion, so the cost is basically the same.

// x86-64
sample1(int):
        movsxd  edi, edi
        jmp     somewhere(long)@PLT

sample2(int):
        mov     edi, edi
        jmp     somewhere(long)@PLT

// arm32
sample1(int):
        asr     r1, r0, #31
        b       somewhere(long long)

sample2(int):
        mov     r1, #0
        b       somewhere(long long)

// arm64
sample1(int):
        sxtw    x0, w0
        b       somewhere(long)

sample2(int):
        mov     w0, w0
        b       somewhere(long)

The Microsoft Visual C++ compiler does not support gcc extended inline syntax, so we can’t check that one.

Since it doesn’t work at all with msvc and it doesn’t provide any benefit on clang, I would enable this optimization only when compiling with gcc or icc and live with the extra instruction everywhere else.

(But really, I wouldn’t use this anywhere unless I had to. This is just code golfing.)

 
Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

5 comments

Discussion is closed. Login to edit/delete existing comments.

  • Kevin Norris

    I looked at the LLVM IR documentation, and after thinking about it for a bit, I'm now rather puzzled by this discrepancy.

    LLVM IR represents everything in SSA form (i.e. every variable in a block is assigned at most once). That means an asm block is not an in-place mutation (there's no such thing) but instead a function-like operation that takes (in this case) a 32-bit argument and returns a 64-bit result. But during lowering, Clang presumably notices that %0 has to be a 64-bit register (or else we can't produce a 64-bit value as our result). And here's the problem:...

    Read more
    • Raymond ChenMicrosoft employee Author

      You can’t use “m” even on little-endian systems because a 64-bit load may reach beyond the end of a page, resulting in a possible access violation.

  • Joshua Hudson

    Another solution:

    Run architecture detect for 64 bit instruction set (at compile time).

    Run second check for architectures that pass 32 bit aligned 32 bit integers on the stack (approximately none; and if it’s the last argument or between two 64 bit arguments you don’t need to)

    If so, cast the function to a function pointer of the appropriate type with a 32 bit parameter in that slot.

    Your code works.

    This is a bad idea in portable code for numerous reasons.

    • Raymond ChenMicrosoft employee Author

      I just tried casting the function pointer. Not a good idea on clang because even though it suppresses the extension, it also suppresses inlining. gcc can inline through the function cast, but it sign extends the 32-bit value to a 64-bit value.

      • Joshua Hudson

        I totally misunderstood and thought it was about functions like EnumWindows that aren’t going to be inlined in any case.