Matt Godbolt, probably best known for being the proprietor of Compiler Explorer, wrote a brief article on why x86 compilers love the xor eax, eax instruction.
The answer is that it is the most compact way to set a register to zero on x86. In particular, it is several bytes shorter than the more obvious mov eax, 0 since it avoids having to encode the four-byte constant. The x86 architecture does not have a dedicated zero register, so if you need to zero out a register, you’ll have to do it ab initio.
But Matt doesn’t explain why everyone chooses xor as opposed to some other mathematical operation that is guaranteed to result in a zero? In particular, what’s wrong with sub eax, eax? It encodes to the same number of bytes, executes in the same number of cycles. And its behavior with respect to flags is even better:
| Â | xor eax, eax | sub eax, eax |
|---|---|---|
| OF | clear | clear |
| SF | clear | clear |
| ZF | set | set |
| AF | undefined | clear |
| PF | set | set |
| CF | clear | clear |
Observe that xor eax, eax leaves the AF flag undefined, whereas sub eax, eax clears it.
I don’t know why xor won the battle, but I suspect it was just a case of swarming.
In my hypothetical history, xor and sub started out with roughly similar popularity, but xor took a slightly lead due to some fluke, perhaps because it felt more “clever”.
When early compilers used xor to zero out a register, this started the snowball, because people would see the compiler generate xor and think, “Well, those compiler writes are smart, they must know something I don’t. Since I was on the fence between xor and sub, this tiny data point is enough to tip it toward xor.”
The predominance of these idioms as a way to zero out a register led Intel to add special xor r, r-detection and sub r, r-detection in the instruction decoding front-end and rename the destination to an internal zero register, bypassing the execution of the instruction entirely. You can imagine that the instruction, in some sense, “takes zero cycles to execute”. The front-end detection also breaks dependency chains: Normally, the output of an xor or sub is dependent on its inputs, but in this special case of xor‘ing or sub‘ing a register with itself, we know that the output is zero, independent of input.
Even though Intel added support for both xor-detection and sub-detection, Stack Overflow worries that other CPU manufacturers may have special-cased xor but not sub, so that makes xor the winner in this ultimately meaningless battle.
Once an instruction has an edge, even if only extremely slight, that’s enough to tip the scales and rally everyone to that side.
Bonus chatter: One of my former colleagues was partial to using sub r, r to zero a register, and when I was reading assembly code, I could tell that he was the author due to the use of sub to zero a register rather than the more popular xor.
Bonus bonus chatter: The xor trick doesn’t work for Itanium because mathematical operations don’t reset the NaT bit. Fortunately, Itanium also has a dedicated zero register, so you don’t need this trick. You can just move zero into your desired destination.
Using XOR should cause less power consumption.
In order to do sub, the logic needs to complement the number, then run an adder, which in order to reach high performance cannot use a simple carry look ahead.
XOR requires literally just an XOR gate per bit, therefore the total energy per op is orders of magnitude lower.
Keep however in mind that was probably important many eons ago. With modern technologies we're probably talking less than a pico joule saved.
The energy needed for the instruction fetch and decode alone, especially in a CISC processor, is much higher, therefore the real...
The AF (auxiliary carry flag) is only used by the BCD instructions DAA, DAS, AAA, and AAS.
Few (if any) compilers ever emitted any of these, and they’re quite unusual even in hand-written assembly code. In long mode (64-bit code) they’re no longer available.
Bottom line: it’s exceedingly rare that leaving AF in a defined state matters (at all).
AF is part of the state saved by PUSHF. If for some reason you’re trying to verify that something runs *identically* on different machines, memory could plausibly differ because of different AF results. (That kind of verification seems more likely to be done for a kernel than for most user-space programs.). Still exceedingly rare, but not depending on use of BCD instructions.
But yeah, this is fortunately pretty much a non-problem even if you care about rare CPUs that set AF with XOR.
There are reserved bits whose values aren’t guaranteed either, so if you do a pushf, there are already bits you should ignore. At least in 64-bit code, you almost certainly want AF to ignore AF in any case.
Another reason non-x86 ISAs (like Itanium, but also AArch64) can't have dep-breaking zeroing idioms for XOR is memory dependency ordering. (memory_order_consume). Only x86 treats every load as acquire. Others need architectural rules about carrying dependencies to guarantee that you can do things like ptr[tmp-tmp] (in asm load / sub / load), and still have that load ordered after an earlier tmp=data_ready.load( consume). (Typically with a branch in between on the data-ready flag, but then also using it for the pointed-to data.)
Also, there's less reason to spend transistors on checking for zeroing idioms on RISCs or VLIWs...
Wasn’t xor better on Z80, and ppl just kept doing it on 8080 et seq mostly out of habit ? I tried to confirm, but the world is much younger than me these days so I couldn’t find a quick answer.
I bet it’s a case of folklore aka cargo cult programming. Fifty years ago (!), XOR was in fact faster than SUB on many machines.
On the PDP-10/KA, introduced in 1968, XOR was 25+% faster than SUB, which had to wait for the carry to propagate. The KA-10 was the main machine of the MIT AI Lab (a major center of hacker culture) until the late 1970s, and was probably the last major architecture that didn’t have fast carry.
That said, XOR was not the usual way of zeroing a register on the 10.
I believe it has to do with the conditional flags xor is simpler in that it always clears carry and overflow. Also can be represented as 2 byte opcode
this comment has been deleted.
SUB same,same also always clears carry and overflow. (Because of the value, so you have to think it through instead of looking at the manual and seeing it say the instruction always does that for every input like with XOR.) So possibly. But like Raymond explained, the actual FLAGs results are if anything better, having well-defined zero for AF with SUB vs. "undefined" for XOR.
On modern CPUs, of course, zeroing idioms are all the same, with AF=0 which works for SUB and is allowed for XOR. But this zeroing idiom became a thing long before it...
I assume Intel officially recommended XOR after it was already the most common idiom.
They even went so far as to have Silvermont only recognize XOR as a zeroing idiom, not SUB,and only at 32 but operand size not 64, as I mentioned in my answer you linked.
I’m now curious if current E cores are still like that!
I wish Raymond had linked that one instead of the how many ways answer where I wrote “maybe” :/
Will edit it when I’m not on my phone.
The Motorola 68000 did have a CLR instruction, but CLR.L (32-bit) on a data register was slower than CLR.W or CLR.B due to the 16-bit ALU (6 cycles vs 4). However, there was also MOVEQ, which loaded a sign-extended 8-bit immediate into a data register. It was a single-word instruction (the immediate was stored in the low byte) and also took only 4 cycles. Therefore, MOVEQ #0, Dn was preferable to CLR.L Dn.
I’m surprised that Intel hasn’t used the same “preferred alias” trick as AArch64, to specify a CLR r instruction that is the same bitpattern as XOR r,r but disassembles as CLR r instead of XOR r,r by default.
In AArch64, the most common instruction to see this with (there’s a few that have preferred aliases) is ORR; you write MOV Wd, #imm or MOV Xd, #imm, and it’s encoded as ORR Wd, WZR, #imm or ORR Xd, XZR, #imm – doing the “move immediate” operation as “logical or the zero register with the immediate” operation.
That would be a pseudo-instruction rather than an architectural instruction. Pseudo-instructions are provided by assemblers (and reverse-compiled by disassemblers), not CPUs, so it wouldn’t really be within Intel’s control. They could strongly recommend that assemblers and disassemblers support the pseudo-instruction, but it would take time for toolchains to get on board with it. The only pseudo-instruction on x86 that I know about is “nop” (xchg ax, ax), and it became a proper instruction on x86-64. [Because xchg eax, eax on x86-64 would normally zero out the upper 32 bits of rax.]