Sure, xor’ing a register with itself is the idiom for zeroing it out, but why not sub?

Raymond Chen

Matt Godbolt, probably best known for being the proprietor of Compiler Explorer, wrote a brief article on why x86 compilers love the xor eax, eax instruction.

The answer is that it is the most compact way to set a register to zero on x86. In particular, it is several bytes shorter than the more obvious mov eax, 0 since it avoids having to encode the four-byte constant. The x86 architecture does not have a dedicated zero register, so if you need to zero out a register, you’ll have to do it ab initio.

But Matt doesn’t explain why everyone chooses xor as opposed to some other mathematical operation that is guaranteed to result in a zero? In particular, what’s wrong with sub eax, eax? It encodes to the same number of bytes, executes in the same number of cycles. And its behavior with respect to flags is even better:

	`xor eax, eax`	`sub eax, eax`
OF	clear	clear
SF	clear	clear
ZF	set	set
AF	undefined	clear
PF	set	set
CF	clear	clear

Observe that xor eax, eax leaves the AF flag undefined, whereas sub eax, eax clears it.

I don’t know why xor won the battle, but I suspect it was just a case of swarming.

In my hypothetical history, xor and sub started out with roughly similar popularity, but xor took a slightly lead due to some fluke, perhaps because it felt more “clever”.

When early compilers used xor to zero out a register, this started the snowball, because people would see the compiler generate xor and think, “Well, those compiler writes are smart, they must know something I don’t. Since I was on the fence between xor and sub, this tiny data point is enough to tip it toward xor.”

The predominance of these idioms as a way to zero out a register led Intel to add special xor r, r-detection and sub r, r-detection in the instruction decoding front-end and rename the destination to an internal zero register, bypassing the execution of the instruction entirely. You can imagine that the instruction, in some sense, “takes zero cycles to execute”. The front-end detection also breaks dependency chains: Normally, the output of an xor or sub is dependent on its inputs, but in this special case of xor‘ing or sub‘ing a register with itself, we know that the output is zero, independent of input.

Even though Intel added support for both xor-detection and sub-detection, Stack Overflow worries that other CPU manufacturers may have special-cased xor but not sub, so that makes xor the winner in this ultimately meaningless battle.

Once an instruction has an edge, even if only extremely slight, that’s enough to tip the scales and rally everyone to that side.

Bonus chatter: One of my former colleagues was partial to using sub r, r to zero a register, and when I was reading assembly code, I could tell that he was the author due to the use of sub to zero a register rather than the more popular xor.

Bonus bonus chatter: The xor trick doesn’t work for Itanium because mathematical operations don’t reset the NaT bit. Fortunately, Itanium also has a dedicated zero register, so you don’t need this trick. You can just move zero into your desired destination.

Topics

History

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

29 comments

Join the discussion.

Leave a commentCancel reply

Marco Brambilla 22 hours ago

Using XOR should cause less power consumption.
In order to do sub, the logic needs to complement the number, then run an adder, which in order to reach high performance cannot use a simple carry look ahead.
XOR requires literally just an XOR gate per bit, therefore the total energy per op is orders of magnitude lower.

Keep however in mind that was probably important many eons ago. With modern technologies we're probably talking less than a pico joule saved.
The energy needed for the instruction fetch and decode alone, especially in a CISC processor, is much higher, therefore the real...
Read more
Using XOR should cause less power consumption.
In order to do sub, the logic needs to complement the number, then run an adder, which in order to reach high performance cannot use a simple carry look ahead.
XOR requires literally just an XOR gate per bit, therefore the total energy per op is orders of magnitude lower.

Keep however in mind that was probably important many eons ago. With modern technologies we’re probably talking less than a pico joule saved.
The energy needed for the instruction fetch and decode alone, especially in a CISC processor, is much higher, therefore the real enrgy difference is in the noise.
So besides looking fancy it also probably produced a power benefit once upon a time

Read less

Log in to Vote or Reply
Jerry Coffin April 23, 2026

The AF (auxiliary carry flag) is only used by the BCD instructions DAA, DAS, AAA, and AAS.

Few (if any) compilers ever emitted any of these, and they’re quite unusual even in hand-written assembly code. In long mode (64-bit code) they’re no longer available.

Bottom line: it’s exceedingly rare that leaving AF in a defined state matters (at all).

Log in to Vote or Reply
- Peter Cordes April 23, 2026
  
  AF is part of the state saved by PUSHF. If for some reason you’re trying to verify that something runs *identically* on different machines, memory could plausibly differ because of different AF results. (That kind of verification seems more likely to be done for a kernel than for most user-space programs.). Still exceedingly rare, but not depending on use of BCD instructions.
  
  But yeah, this is fortunately pretty much a non-problem even if you care about rare CPUs that set AF with XOR.
  
  Log in to Vote or Reply
  - Jerry Coffin April 23, 2026
    
    There are reserved bits whose values aren’t guaranteed either, so if you do a pushf, there are already bits you should ignore. At least in 64-bit code, you almost certainly want AF to ignore AF in any case.
Peter Cordes 5 days ago · Edited

Another reason non-x86 ISAs (like Itanium, but also AArch64) can't have dep-breaking zeroing idioms for XOR is memory dependency ordering. (memory_order_consume). Only x86 treats every load as acquire. Others need architectural rules about carrying dependencies to guarantee that you can do things like ptr[tmp-tmp] (in asm load / sub / load), and still have that load ordered after an earlier tmp=data_ready.load( consume). (Typically with a branch in between on the data-ready flag, but then also using it for the pointed-to data.)

Also, there's less reason to spend transistors on checking for zeroing idioms on RISCs or VLIWs...
Read more
Another reason non-x86 ISAs (like Itanium, but also AArch64) can’t have dep-breaking zeroing idioms for XOR is memory dependency ordering. (memory_order_consume). Only x86 treats every load as acquire. Others need architectural rules about carrying dependencies to guarantee that you can do things like ptr[tmp-tmp] (in asm load / sub / load), and still have that load ordered after an earlier tmp=data_ready.load( consume). (Typically with a branch in between on the data-ready flag, but then also using it for the pointed-to data.)

Also, there’s less reason to spend transistors on checking for zeroing idioms on RISCs or VLIWs where it doesn’t save code-size vs move-immediate which is automatically dep-breaking for any value. The back-end exec unit benefit is small.

BTW, Intel Silvermont only recognizes XOR as an idiom, not SUB. (Todo check current E cores like Gracemont / Crestmont.)
I mentioned that in https://stackoverflow.com/questions/33666617/what-is-the-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and/33668295#33668295
I updated my “how many ways” answer you linked to not just say maybe.

Another commenter points out that Via Nano 2000 is the same, with only Nano 3000 adding recognition of SUB as a zeroing idiom.

Read less

Log in to Vote or Reply
Olivier Barthelemy 5 days ago

Wasn’t xor better on Z80, and ppl just kept doing it on 8080 et seq mostly out of habit ? I tried to confirm, but the world is much younger than me these days so I couldn’t find a quick answer.

Log in to Vote or Reply
Stavros Macrakis 5 days ago

I bet it’s a case of folklore aka cargo cult programming. Fifty years ago (!), XOR was in fact faster than SUB on many machines.

On the PDP-10/KA, introduced in 1968, XOR was 25+% faster than SUB, which had to wait for the carry to propagate. The KA-10 was the main machine of the MIT AI Lab (a major center of hacker culture) until the late 1970s, and was probably the last major architecture that didn’t have fast carry.

That said, XOR was not the usual way of zeroing a register on the 10.

Log in to Vote or Reply
Justin Capella April 22, 2026 · Edited

I believe it has to do with the conditional flags xor is simpler in that it always clears carry and overflow. Also can be represented as 2 byte opcode

Log in to Vote or Reply
- anonymous 5 days ago
  
  this comment has been deleted.
- Peter Cordes 5 days ago
  
  SUB same,same also always clears carry and overflow. (Because of the value, so you have to think it through instead of looking at the manual and seeing it say the instruction always does that for every input like with XOR.) So possibly. But like Raymond explained, the actual FLAGs results are if anything better, having well-defined zero for AF with SUB vs. "undefined" for XOR.
  
  On modern CPUs, of course, zeroing idioms are all the same, with AF=0 which works for SUB and is allowed for XOR. But this zeroing idiom became a thing long before it...
  Read more
  SUB same,same also always clears carry and overflow. (Because of the value, so you have to think it through instead of looking at the manual and seeing it say the instruction always does that for every input like with XOR.) So possibly. But like Raymond explained, the actual FLAGs results are if anything better, having well-defined zero for AF with SUB vs. “undefined” for XOR.
  
  On modern CPUs, of course, zeroing idioms are all the same, with AF=0 which works for SUB and is allowed for XOR. But this zeroing idiom became a thing long before it was handled specially by hardware, in the first out-of-order-exec x86 uarch, P6 (Pentium Pro).
  
  Read less
  
  Log in to Vote or Reply
Peter Cordes April 22, 2026

I assume Intel officially recommended XOR after it was already the most common idiom.
They even went so far as to have Silvermont only recognize XOR as a zeroing idiom, not SUB,and only at 32 but operand size not 64, as I mentioned in my answer you linked.

I’m now curious if current E cores are still like that!

I wish Raymond had linked that one instead of the how many ways answer where I wrote “maybe” :/
Will edit it when I’m not on my phone.

Log in to Vote or Reply
Falcon April 22, 2026

The Motorola 68000 did have a CLR instruction, but CLR.L (32-bit) on a data register was slower than CLR.W or CLR.B due to the 16-bit ALU (6 cycles vs 4). However, there was also MOVEQ, which loaded a sign-extended 8-bit immediate into a data register. It was a single-word instruction (the immediate was stored in the low byte) and also took only 4 cycles. Therefore, MOVEQ #0, Dn was preferable to CLR.L Dn.

Log in to Vote or Reply
Simon Farnsworth 6 days ago

I’m surprised that Intel hasn’t used the same “preferred alias” trick as AArch64, to specify a CLR r instruction that is the same bitpattern as XOR r,r but disassembles as CLR r instead of XOR r,r by default.

In AArch64, the most common instruction to see this with (there’s a few that have preferred aliases) is ORR; you write MOV Wd, #imm or MOV Xd, #imm, and it’s encoded as ORR Wd, WZR, #imm or ORR Xd, XZR, #imm – doing the “move immediate” operation as “logical or the zero register with the immediate” operation.

Log in to Vote or Reply
- Raymond Chen Author April 22, 2026
  
  That would be a pseudo-instruction rather than an architectural instruction. Pseudo-instructions are provided by assemblers (and reverse-compiled by disassemblers), not CPUs, so it wouldn’t really be within Intel’s control. They could strongly recommend that assemblers and disassemblers support the pseudo-instruction, but it would take time for toolchains to get on board with it. The only pseudo-instruction on x86 that I know about is “nop” (xchg ax, ax), and it became a proper instruction on x86-64. [Because xchg eax, eax on x86-64 would normally zero out the upper 32 bits of rax.]
  
  Log in to Vote or Reply
Robin Davies 6 days ago

I thought there might be a difference between xor and sub on older processors like Z80s, or 8080s. That turns out not to be the case.

Interestingly, there is a speed difference on Pentium Pros, where XOR EAX,EAX executes in zero cycles (it gets handled as a register rename), and SUB EAX,EAX executes in one cycle. Processors after the Pentium Pro, recognize both operations as a register zero-ing operation, and both instructions operate in zero cycles.

Log in to Vote or Reply
- Jonathan Harston April 23, 2026
  
  I seem to recall on the Z80 it being explitly recommended to use XOR A explicitly because it did not do anything to the HalfCarry and Subtract flags. But on checking references, all alu instructions affect both H and N. Some unexpectedly – AND sets H, which was unexpected when I excersied the instructions in my emulator.
  
  Log in to Vote or Reply