Suppose you have a function that wants to pass a 32-bit value to a function that takes a 64-bit value. You don’t care what goes into the upper 32 bits because that value is a passthrough value that gets passed to your callback function, and the callback function will truncate it to a 32-bit value. And for whatever reason, you are concerned about the performance impact of that single instruction that the compiler normally generates to extend the 32-bit value to a 64-bit value.
My first take is “Don’t worry yet.” I suspect that that one instruction is not going to be a performance bottleneck in your program.
But still, I took up the challenge, just for fun.
What I came up with was using gcc/clang inline assembly that says “I can produce a 64-bit value from a 32-bit value by executing no instructions.”
int64_t int32_to_64_garbage(int32_t i32) { int64_t i64; __asm__("" : // do nothing "=r"(i64) : // produces result in register "0"(i32)); // from this input return i64; }
The first argument to the __asm__
inline directve is the code to generate. We pass an empty string, so there is in fact no code generated at all! All the effects we want are in the declarations of inputs and outputs.
Next come the outputs, of which we have only one. The "=r"(i64)
means that our inline assembly will put the overwritten (=
) value of i64
in a register r
of the compiler’s choosing, which the inline assembler will refer to as %0
. (The outputs are numbered starting at zero.)
Finally, we have the inputs, of which we have only one. The "0"(i32)
means that the input should be put in the same place as output number zero.
All of the work was done by our constraints on the inputs and outputs. There’s no actual code. We tell the compiler “Put i32
in a register, and then cover your eyes, and when you open them, i64
will be in that same register!”
Running gcc at optimization level 3 shows that the value was completely elided.
void somewhere(int64_t); void sample1(int32_t v) { somewhere(v); } void sample2(int32_t v) { somewhere(int32_to_64_garbage(v)); }
The result is
// x86-64 sample1(int): movsx rdi, edi jmp somewhere(long) sample2(int): jmp somewhere(long) // arm32 sample1(int): asrs r1, r0, #31 b somewhere(long long) sample2(int): b somewhere(long long) // arm64 sample1(int): sxtw x0, w0 b somewhere(long) sample2(int): b somewhere(long)
The first version contains an explicit sign extension instruction before making the tail call. The second version is a direct tail call, using whatever garbage is in the upper 32 bits of the rdi
register.
Another compiler that supports gcc extended inline syntax is icc, and this trick seems to work there too.
// x86-64 sample1(int): movsxd rdi, edi jmp somewhere(long) sample2(int): jmp somewhere(long)
The clang compiler also supports gcc extended inline assembly syntax. It, however, not only generates a conversion but also loses the tail call.
// x86-64 sample1(int): movsxd edi, edi jmp somewhere(long)@PLT sample2(int): push rax mov edi, edi call somewhere(long)@PLT pop rax ret // arm32 sample1(int): asr r1, r0, #31 b somewhere(long long) sample2(int): push {r11, lr} sub sp, sp, #8 mov r1, #0 bl somewhere(long long) add sp, sp, #8 pop {r11, pc} // arm64 sample1(int): sxtw x0, w0 b somewhere(long) sample2(int): sub sp, sp, #32 stp x29, x30, [sp, #16] add x29, sp, #16 mov w0, w0 bl somewhere(long) ldp x29, x30, [sp, #16] add sp, sp, #32 ret
Update: It seems that the current version of clang (as of this writing) restores the tail call, though it still does a 32-to-64 unsigned conversion, so the cost is basically the same.
// x86-64 sample1(int): movsxd edi, edi jmp somewhere(long)@PLT sample2(int): mov edi, edi jmp somewhere(long)@PLT // arm32 sample1(int): asr r1, r0, #31 b somewhere(long long) sample2(int): mov r1, #0 b somewhere(long long) // arm64 sample1(int): sxtw x0, w0 b somewhere(long) sample2(int): mov w0, w0 b somewhere(long)
The Microsoft Visual C++ compiler does not support gcc extended inline syntax, so we can’t check that one.
Since it doesn’t work at all with msvc and it doesn’t provide any benefit on clang, I would enable this optimization only when compiling with gcc or icc and live with the extra instruction everywhere else.
(But really, I wouldn’t use this anywhere unless I had to. This is just code golfing.)
Â
I looked at the LLVM IR documentation, and after thinking about it for a bit, I'm now rather puzzled by this discrepancy.
LLVM IR represents everything in SSA form (i.e. every variable in a block is assigned at most once). That means an asm block is not an in-place mutation (there's no such thing) but instead a function-like operation that takes (in this case) a 32-bit argument and returns a 64-bit result. But during lowering, Clang presumably notices that %0 has to be a 64-bit register (or else we can't produce a 64-bit value as our result). And here's the problem:...
You can’t use “m” even on little-endian systems because a 64-bit load may reach beyond the end of a page, resulting in a possible access violation.
Another solution:
Run architecture detect for 64 bit instruction set (at compile time).
Run second check for architectures that pass 32 bit aligned 32 bit integers on the stack (approximately none; and if it’s the last argument or between two 64 bit arguments you don’t need to)
If so, cast the function to a function pointer of the appropriate type with a 32 bit parameter in that slot.
Your code works.
This is a bad idea in portable code for numerous reasons.
I just tried casting the function pointer. Not a good idea on clang because even though it suppresses the extension, it also suppresses inlining. gcc can inline through the function cast, but it sign extends the 32-bit value to a 64-bit value.
I totally misunderstood and thought it was about functions like EnumWindows that aren’t going to be inlined in any case.