May 21st, 2025

2 reactions

Silly parlor tricks: Promoting a 32-bit value to a 64-bit value when you don’t care about garbage in the upper bits

Raymond Chen

Suppose you have a function that wants to pass a 32-bit value to a function that takes a 64-bit value. You don’t care what goes into the upper 32 bits because that value is a passthrough value that gets passed to your callback function, and the callback function will truncate it to a 32-bit value. And for whatever reason, you are concerned about the performance impact of that single instruction that the compiler normally generates to extend the 32-bit value to a 64-bit value.

My first take is “Don’t worry yet.” I suspect that that one instruction is not going to be a performance bottleneck in your program.

But still, I took up the challenge, just for fun.

What I came up with was using gcc/clang inline assembly that says “I can produce a 64-bit value from a 32-bit value by executing no instructions.”

int64_t int32_to_64_garbage(int32_t i32)
{
    int64_t i64;
    __asm__("" :        // do nothing
            "=r"(i64) : // produces result in register
            "0"(i32));  // from this input
    return i64;
}

The first argument to the __asm__ inline directve is the code to generate. We pass an empty string, so there is in fact no code generated at all! All the effects we want are in the declarations of inputs and outputs.

Next come the outputs, of which we have only one. The "=r"(i64) means that our inline assembly will put the overwritten (=) value of i64 in a register r of the compiler’s choosing, which the inline assembler will refer to as %0. (The outputs are numbered starting at zero.)

Finally, we have the inputs, of which we have only one. The "0"(i32) means that the input should be put in the same place as output number zero.

All of the work was done by our constraints on the inputs and outputs. There’s no actual code. We tell the compiler “Put i32 in a register, and then cover your eyes, and when you open them, i64 will be in that same register!”

Running gcc at optimization level 3 shows that the value was completely elided.

void somewhere(int64_t);

void sample1(int32_t v)
{
    somewhere(v);
}

void sample2(int32_t v)
{
    somewhere(int32_to_64_garbage(v));
}

The result is

// x86-64
sample1(int):
        movsx   rdi, edi
        jmp     somewhere(long)
sample2(int):
        jmp     somewhere(long)

// arm32
sample1(int):
        asrs    r1, r0, #31
        b       somewhere(long long)
sample2(int):
        b       somewhere(long long)

// arm64
sample1(int):
        sxtw    x0, w0
        b       somewhere(long)
sample2(int):
        b       somewhere(long)

The first version contains an explicit sign extension instruction before making the tail call. The second version is a direct tail call, using whatever garbage is in the upper 32 bits of the rdi register.

Another compiler that supports gcc extended inline syntax is icc, and this trick seems to work there too.

// x86-64
sample1(int):
        movsxd    rdi, edi
        jmp       somewhere(long)
sample2(int):
        jmp       somewhere(long)

The clang compiler also supports gcc extended inline assembly syntax. It, however, not only generates a conversion but also loses the tail call.

// x86-64
sample1(int):
        movsxd  edi, edi
        jmp     somewhere(long)@PLT

sample2(int):
        push    rax
        mov     edi, edi
        call    somewhere(long)@PLT
        pop     rax
        ret

// arm32
sample1(int):
        asr     r1, r0, #31
        b       somewhere(long long)

sample2(int):
        push    {r11, lr}
        sub     sp, sp, #8
        mov     r1, #0
        bl      somewhere(long long)
        add     sp, sp, #8
        pop     {r11, pc}

// arm64
sample1(int):
        sxtw    x0, w0
        b       somewhere(long)

sample2(int):
        sub     sp, sp, #32
        stp     x29, x30, [sp, #16]
        add     x29, sp, #16
        mov     w0, w0
        bl      somewhere(long)
        ldp     x29, x30, [sp, #16]
        add     sp, sp, #32
        ret

Update: It seems that the current version of clang (as of this writing) restores the tail call, though it still does a 32-to-64 unsigned conversion, so the cost is basically the same.

// x86-64
sample1(int):
        movsxd  edi, edi
        jmp     somewhere(long)@PLT

sample2(int):
        mov     edi, edi
        jmp     somewhere(long)@PLT

// arm32
sample1(int):
        asr     r1, r0, #31
        b       somewhere(long long)

sample2(int):
        mov     r1, #0
        b       somewhere(long long)

// arm64
sample1(int):
        sxtw    x0, w0
        b       somewhere(long)

sample2(int):
        mov     w0, w0
        b       somewhere(long)

The Microsoft Visual C++ compiler does not support gcc extended inline syntax, so we can’t check that one.

Since it doesn’t work at all with msvc and it doesn’t provide any benefit on clang, I would enable this optimization only when compiling with gcc or icc and live with the extra instruction everywhere else.

(But really, I wouldn’t use this anywhere unless I had to. This is just code golfing.)

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

5 comments

Discussion is closed. Login to edit/delete existing comments.

Kevin Norris May 22, 2025

I looked at the LLVM IR documentation, and after thinking about it for a bit, I'm now rather puzzled by this discrepancy.

LLVM IR represents everything in SSA form (i.e. every variable in a block is assigned at most once). That means an asm block is not an in-place mutation (there's no such thing) but instead a function-like operation that takes (in this case) a 32-bit argument and returns a 64-bit result. But during lowering, Clang presumably notices that %0 has to be a 64-bit register (or else we can't produce a 64-bit value as our result). And here's the problem:...
Read more
I looked at the LLVM IR documentation, and after thinking about it for a bit, I’m now rather puzzled by this discrepancy.

LLVM IR represents everything in SSA form (i.e. every variable in a block is assigned at most once). That means an asm block is not an in-place mutation (there’s no such thing) but instead a function-like operation that takes (in this case) a 32-bit argument and returns a 64-bit result. But during lowering, Clang presumably notices that %0 has to be a 64-bit register (or else we can’t produce a 64-bit value as our result). And here’s the problem: You could reasonably interpret the asm block in two different ways. In the (apparent) GCC interpretation, %0 is a 64-bit register that is initialized with i32 on entry to the asm block, but i32 is a 32-bit value, so the upper half is uninitialized garbage. In the (apparent) Clang interpretation, %0 is a register that is fully initialized with i32 on entry to the asm block, but %0 is a 64-bit register, so we have to widen i32 to properly initialize %0.

There are cases for and against both of these interpretations, but I think the real issue is how they would behave on a big-endian system with m instead of r. GCC would presumably(?) initialize the upper half with i32 (and leave the lower half uninitialized), and Clang would presumably(?) initialize the upper half with zero and put i32 in the lower half. But that’s grossly incompatible, and Clang is supposed to be compatible with GCC, so probably one of them behaves differently from what I have extrapolated.

Read less
- Raymond Chen Author May 22, 2025
  
  You can’t use “m” even on little-endian systems because a 64-bit load may reach beyond the end of a page, resulting in a possible access violation.
Joshua Hudson May 22, 2025

Another solution:

Run architecture detect for 64 bit instruction set (at compile time).

Run second check for architectures that pass 32 bit aligned 32 bit integers on the stack (approximately none; and if it’s the last argument or between two 64 bit arguments you don’t need to)

If so, cast the function to a function pointer of the appropriate type with a 32 bit parameter in that slot.

Your code works.

This is a bad idea in portable code for numerous reasons.
- Raymond Chen Author May 22, 2025
  
  I just tried casting the function pointer. Not a good idea on clang because even though it suppresses the extension, it also suppresses inlining. gcc can inline through the function cast, but it sign extends the 32-bit value to a 64-bit value.
  - Joshua Hudson May 22, 2025
    
    I totally misunderstood and thought it was about functions like EnumWindows that aren’t going to be inlined in any case.

Silly parlor tricks: Promoting a 32-bit value to a 64-bit value when you don’t care about garbage in the upper bits

Author

5 comments

Read next

How can I create a window the size of the screen without it being treated as a fullscreen window?

How can I detect if one of my helper processes is launching child processes?

Author

5 comments

Read next

How can I create a window the size of the screen without it being treated as a fullscreen window?

How can I detect if one of my helper processes is launching child processes?

Stay informed