The time the x86 emulator team found code so bad that they fixed it during emulation

Raymond Chen

During an exchange of war stories, a colleague of mine told one from back in the days when Windows included a processor emulator for x86-32 on systems that natively ran some other processor. (This has happened many times. And no, I don’t know which processor this particular story applied to.)

This particular emulator employed binary translation, generating native code to perform the equivalent operations of the original x86-32 code. This offered a significant performance improvement over emulation via interpreter. You can imagine that x86-32 is just a bytecode, and the emulator is a JIT compiler.

Anyway, my colleague found that there was one program that needed to allocate around 64KB of memory on the stack and initialize it. The standard way of doing this is to perform a stack probe to ensure that 64KB of memory is available, then subtracting 65536 from the stack pointer, and then initializing the memory in a small, tight loop.

But using a loop to initialize the memory was too mundane for whatever compiler was used to compile this code. Instead of generating a loop to initialize each byte of the buffer, the compiler “optimized” the code by unrolling the loop into 65,536 individual “write byte to memory” instructions, each 4 bytes long.

All in all, it took this program 256 kilobytes of code to initialize 64 kilobytes of data.

This offended the team so much that they added special code to the translator to detect this horrible function and replace it with the equivalent tight loop.

Topics

History

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

All comments

Single comment

pac 2000 June 16, 2026

This is actually done in modern CPU’s. Unrolling the loops in the fetch ahead pipeline. And it is a thing in computer science. I specifically design my code to take advantage of it.

There is an expression, you can either have slow, memory efficient code or fast, but it hogs memory. But I don’t think it applies to using 256kb to initialise 64kb lol. The is a bit much 🤣🤣🤣
- Antonio Rodríguez June 16, 2026
  
  Apart from what Liam Proven has said, with which I fully agree, loop unrolling is very effective even in a much smaller scale. If a loop has a 1:3 efficiency (the loop executes 2 bookkeeping instructions for each instruction of actual work), fully unrolling will have a 3x performance improvement. But you don't need to unroll all the instructions. For the given case, filling 64 KB of memory, you can create a sequence of, say, 16 unrolled instructions and put them in a loop that executes 4096 times. That would get you to a 16:18 efficiency, almost three times the...
  Read more
  Apart from what Liam Proven has said, with which I fully agree, loop unrolling is very effective even in a much smaller scale. If a loop has a 1:3 efficiency (the loop executes 2 bookkeeping instructions for each instruction of actual work), fully unrolling will have a 3x performance improvement. But you don’t need to unroll all the instructions. For the given case, filling 64 KB of memory, you can create a sequence of, say, 16 unrolled instructions and put them in a loop that executes 4096 times. That would get you to a 16:18 efficiency, almost three times the original 1:3 and quite close to the theoretical maximum of 1:1, using only a few dozen bytes of program memory. Doubling loop size from 16 to 32 also doubles memory use, but gets you a meager 5% speed improvement. As performance gain asymptotically approaches 1:1 while memory use has no bounds, you soon reach a point where you are wasting large amounts of memory to get an infinitesimal performance improvement. Unrolling all 64K instructions is plainly insane.
  
  Read less
  - Jamey Kirby June 16, 2026
    
    I’ve implemented a Duff’s device on the 8088 back in the day. Fun stuff. Back whe loop unrolling had gains.
  - Antonio Rodríguez June 16, 2026
    
    Simon, of course you are right. Real optimization implies precise measurement, period. Instruction dependencies, register renaming, three levels of cache (each with its own geometry and bandwidth) and memory latency can produce some nasty surprises. I was just trying to explain *theoretically* how the sweet spot is in the middle of simple loops and fully unrolled code.
  - Simon Farnsworth June 16, 2026
    
    Note that it's worth measuring actual execution time on your hardware of choice, not just loop efficiency; OoOE can "hide" the cost of some bookkeeping instructions from you, either "naturally" (because the bookkeeping uses different EUs to the actual work, and it's just using otherwise wasted power to do useful work - such as an integer loop counter for a SIMD loop), or because it recognizes the idiom you've used and is able to do something efficient.
    
    This means that you can get surprised because your unrolled loop accidentally introduces a false dependency between data items that wasn't present in the...
    Read more
    Note that it’s worth measuring actual execution time on your hardware of choice, not just loop efficiency; OoOE can “hide” the cost of some bookkeeping instructions from you, either “naturally” (because the bookkeeping uses different EUs to the actual work, and it’s just using otherwise wasted power to do useful work – such as an integer loop counter for a SIMD loop), or because it recognizes the idiom you’ve used and is able to do something efficient.
    
    This means that you can get surprised because your unrolled loop accidentally introduces a false dependency between data items that wasn’t present in the 1:3 loop, so where the 1:3 loop takes 1 cycle per iteration, plus has up to 4 cycles after the loop before the output is ready, your 16:18 loop takes 76 cycles per iteration (5 per item in the unrolled loop, minus 4 for the loop restart) and has up to 4 cycles after the loop before the output is ready.
    
    Even without surprises, the gain may be less (or more) than you expect – your 1:3 loop might handle 1 data item per cycle, while your 16:18 loop might handle 2 data items per cycle before running out of EUs to use, and thus gets a 2x speed-up, not a nearly 3x performance improvement, and you could easily find that with just 2 EUs, a 2:4 loop gets you all the available speed-up. Equally, your 16:18 loop might find 6 EUs to use, and thus take 3 cycles to process 16 items (or 5.33 items per cycle) – but then if you’d unrolled to 6:8, you’d have 1 cycle to process 6 items, and be faster.
    
    Read less
  - Neil Rashbrook June 16, 2026
    
    When the code size gets too large you first start overflowing the instruction cache, which might cancel out any potential speed improvement from executing fewer instructions, although I don’t know enough about caching to know whether you should aim to limit the loop to a single cache line or any other particular metric. Even after the various levels of cache are taken into account, once you hit 4K of code you need additional paging, which really kills performance. As such, even a fully rolled loop would be significantly faster than 256K of instructions.
- Liam Proven June 16, 2026
  
  Just let me check… you are trying to explain loop-unrolling to _Raymond Chen_?!
  
  What next? Write a letter to Donald Knuth with this new quick-sort routine you invented? Tweet John Carmack with a handy tip about using background buffers for off-screen redraw? Email Fabrice Bellard to tip him off about your Arm emulator on x86?
  - Scott Jones June 16, 2026
    
    Comments are not only for the blog author, they’re also for readers