Matt Godbolt, probably best known for being the proprietor of Compiler Explorer, wrote a brief article on why x86 compilers love the xor eax, eax instruction.
The answer is that it is the most compact way to set a register to zero on x86. In particular, it is several bytes shorter than the more obvious mov eax, 0 since it avoids having to encode the four-byte constant. The x86 architecture does not have a dedicated zero register, so if you need to zero out a register, you’ll have to do it ab initio.
But Matt doesn’t explain why everyone chooses xor as opposed to some other mathematical operation that is guaranteed to result in a zero? In particular, what’s wrong with sub eax, eax? It encodes to the same number of bytes, executes in the same number of cycles. And its behavior with respect to flags is even better:
| Â | xor eax, eax | sub eax, eax |
|---|---|---|
| OF | clear | clear |
| SF | clear | clear |
| ZF | set | set |
| AF | undefined | clear |
| PF | set | set |
| CF | clear | clear |
Observe that xor eax, eax leaves the AF flag undefined, whereas sub eax, eax clears it.
I don’t know why xor won the battle, but I suspect it was just a case of swarming.
In my hypothetical history, xor and sub started out with roughly similar popularity, but xor took a slightly lead due to some fluke, perhaps because it felt more “clever”.
When early compilers used xor to zero out a register, this started the snowball, because people would see the compiler generate xor and think, “Well, those compiler writes are smart, they must know something I don’t. Since I was on the fence between xor and sub, this tiny data point is enough to tip it toward xor.”
The predominance of these idioms as a way to zero out a register led Intel to add special xor r, r-detection and sub r, r-detection in the instruction decoding front-end and rename the destination to an internal zero register, bypassing the execution of the instruction entirely. You can imagine that the instruction, in some sense, “takes zero cycles to execute”. The front-end detection also breaks dependency chains: Normally, the output of an xor or sub is dependent on its inputs, but in this special case of xor‘ing or sub‘ing a register with itself, we know that the output is zero, independent of input.
Even though Intel added support for both xor-detection and sub-detection, Stack Overflow worries that other CPU manufacturers may have special-cased xor but not sub, so that makes xor the winner in this ultimately meaningless battle.
Once an instruction has an edge, even if only extremely slight, that’s enough to tip the scales and rally everyone to that side.
Bonus chatter: One of my former colleagues was partial to using sub r, r to zero a register, and when I was reading assembly code, I could tell that he was the author due to the use of sub to zero a register rather than the more popular xor.
Bonus bonus chatter: The xor trick doesn’t work for Itanium because mathematical operations don’t reset the NaT bit. Fortunately, Itanium also has a dedicated zero register, so you don’t need this trick. You can just move zero into your desired destination.
Using XOR should cause less power consumption.
In order to do sub, the logic needs to complement the number, then run an adder, which in order to reach high performance cannot use a simple carry look ahead.
XOR requires literally just an XOR gate per bit, therefore the total energy per op is orders of magnitude lower.
Keep however in mind that was probably important many eons ago. With modern technologies we're probably talking less than a pico joule saved.
The energy needed for the instruction fetch and decode alone, especially in a CISC processor, is much higher, therefore the real...
On most modern processors, both
and
are recognized as a register clearing operation. They’re normally handled by register renaming, so they don’t involve carrying out an operation on the contents of a register at all.
Thanks for the clarification. The historical explanation still applies though!
The AF (auxiliary carry flag) is only used by the BCD instructions DAA, DAS, AAA, and AAS.
Few (if any) compilers ever emitted any of these, and they’re quite unusual even in hand-written assembly code. In long mode (64-bit code) they’re no longer available.
Bottom line: it’s exceedingly rare that leaving AF in a defined state matters (at all).
AF is part of the state saved by PUSHF. If for some reason you’re trying to verify that something runs *identically* on different machines, memory could plausibly differ because of different AF results. (That kind of verification seems more likely to be done for a kernel than for most user-space programs.). Still exceedingly rare, but not depending on use of BCD instructions.
But yeah, this is fortunately pretty much a non-problem even if you care about rare CPUs that set AF with XOR.
There are reserved bits whose values aren’t guaranteed either, so if you do a pushf, there are already bits you should ignore. At least in 64-bit code, you almost certainly want AF to ignore AF in any case.
Another reason non-x86 ISAs (like Itanium, but also AArch64) can't have dep-breaking zeroing idioms for XOR is memory dependency ordering. (memory_order_consume). Only x86 treats every load as acquire. Others need architectural rules about carrying dependencies to guarantee that you can do things like ptr[tmp-tmp] (in asm load / sub / load), and still have that load ordered after an earlier tmp=data_ready.load( consume). (Typically with a branch in between on the data-ready flag, but then also using it for the pointed-to data.)
Also, there's less reason to spend transistors on checking for zeroing idioms on RISCs or VLIWs...
Wasn’t xor better on Z80, and ppl just kept doing it on 8080 et seq mostly out of habit ? I tried to confirm, but the world is much younger than me these days so I couldn’t find a quick answer.
I went looking at datasheets - on the 8008, 8080, 8085 and Z80, both SUB and XOR are equivalent in terms of instruction byte code and clock cycles; additionally, it looks like all ALU operations (including both SUB and XOR) affect the flags, so you don't even get that benefit from XOR. The 4004 and 4040 didn't have XOR, and neither the MC6800 nor the 6502 have an XOR r,r instruction.
I'm guessing, based on what I've found so far, that you'd have to leave the world of microprocessors and go to minicomputers to find a CPU where XOR is better...
this comment has been deleted.