Why are there trivial functions like Copy­Rect and Equal­Rect?

Raymond Chen

Raymond

If you dig into the bag of tricks inside user32, you’ll see some seemingly-trivial functions like Copy­Rect and Equal­Rect. Why do we even need functions for things that could be done with the = and == operators?

Because those operators generate a lot of code.

Copying a rectangle would go like this:

c4 5e f0        les  bx, [bp-10]    ; es:bx -> source rect
26 8b 07        mov  ax, es:[bx]    ; ax = source.left
c4 5e ec        les  bx, [bp-14]    ; es:bx -> destination rect
26 89 07        mov  es:[bx], ax    ; dest.left = ax

c4 5e f0        les  bx, [bp-10]    ; es:bx -> source rect
26 8b 47 02     mov  ax, es:[bx+2]  ; ax = source.top
c4 5e ec        les  bx, [bp-14]    ; es:bx -> destination rect
26 89 47 02     mov  es:[bx+2], ax  ; dest.top = ax

c4 5e f0        les  bx, [bp-10]    ; es:bx -> source rect
26 8b 47 04     mov  ax, es:[bx+4]  ; ax = source.right
c4 5e ec        les  bx, [bp-14]    ; es:bx -> destination rect
26 89 47 04     mov  es:[bx+4], ax  ; dest.right = ax

c4 5e f0        les  bx, [bp-10]    ; es:bx -> source rect
26 8b 47 06     mov  ax, es:[bx+6]  ; ax = source.bottom
c4 5e ec        les  bx, [bp-14]    ; es:bx -> destination rect
26 89 47 06     mov  es:[bx+6], ax  ; dest.bottom = ax

This takes 54 bytes of code. It’s rather inefficient because the 8086 processor could indirect only through the bx, bp, si, and di registers. The bp register was reserved for use as the frame pointer, so that was off the table. The si and di registers were used as register variables, so they are busy holding something important. That leaves bx as the only register that can be used to dereference pointers.

Since this is a 16:16 pointer, we also need a segment register, and the 8086 has only four segment registers: cs (code segment), ds (data segment), ss (stack segment), es (extra segment). Three of them have dedicated purposes, so the only one left is es. Even if we could borrow si or di temporarily, we would still be bottlenecked on es.

If we move Copy­Rect to a function, then we can save a bunch of code:

c4 5e f0        les  bx, [bp-10]    ; es:bx -> source rect
53              push bx
06              push es
c4 5e ec        les  bx, [bp-14]    ; es:bx -> destination rect
53              push bx
06              push es
9a xx xx xx xx  call CopyRect

Only 15 bytes. Less than a third the size.

This was the era in which developers counted bytes, and any trick to save a few bytes was worth considering, especially since you had “only” 256KB of memory.¹

And since copying and comparing rectangles were common operations, factoring the code into a function saved a lot of bytes.

Of course, nowadays, it’s not a lot of code to copy a rectangle manually: An entire rectangle fits into a single 128-bit register.

    mov    eax, [sourcerect]
    movups xmm0, [eax]
    mov    eax, [destrect]
    movups [eax], xmm0

Bonus code golf: We could have squeezed out a few instructions by moving two integers at a time. This requires that the two rectangles be non-overlapping in memory (to avoid data aliasing), but that’s probably a safe assumption because the original code didn’t work anyway in that case.

int v[5];
*(RECT*)&v[0] = *(RECT*)&v[1]; // bad idea

Switching to moving two integers at a time doesn’t break anything that wasn’t already broken, so let’s do it:

c4 5e f0        les  bx, [bp-10]    ; es:bx -> source rect
26 8b 07        mov  ax, es:[bx]    ; ax = source.left
26 8b 57 02     mov  dx, es:[bx+2]  ; dx = source.top
c4 5e ec        les  bx, [bp-14]    ; es:bx -> destination rect
26 89 07        mov  es:[bx], ax    ; dest.left = ax
26 89 57 02     mov  es:[bx+2], dx  ; dest.top = dx

c4 5e f0        les  bx, [bp-10]    ; es:bx -> source rect
26 8b 47 04     mov  ax, es:[bx+4]  ; ax = source.right
26 8b 57 06     mov  dx, es:[bx+6]  ; dx = source.bottom
c4 5e ec        les  bx, [bp-14]    ; es:bx -> destination rect
26 89 47 04     mov  es:[bx+4], ax  ; dest.right = ax
26 89 57 06     mov  es:[bx+6], dx  ; dest.bottom = dx

That dropped us down to 42 bytes. It helps, but it’s still a lot of code.

If we’re willing to spill one of our other register variables, say, si, then we can squeeze it even further.

c4 5e f0        les  bx, [bp-10]    ; es:bx -> source rect
26 8b 07        mov  ax, es:[bx]    ; ax = source.left
26 8b 57 02     mov  dx, es:[bx+2]  ; dx = source.top
26 8b 4f 04     mov  cx, es:[bx+4]  ; cx = source.right
26 8b 77 06     mov  si, es:[bx+6]  ; si = source.bottom
c4 5e ec        les  bx, [bp-14]    ; es:bx -> destination rect
26 89 07        mov  es:[bx], ax    ; dest.left = ax
26 89 57 02     mov  es:[bx+2], dx  ; dest.top = dx
26 89 4f 04     mov  es:[bx+4], cx  ; dest.right = cx
26 89 77 06     mov  es:[bx+6], si  ; dest.bottom = si

Only 36 bytes. Getting better. But still twice as big as calling CopyRect, and it cost us a register.

Another trick: Copy the rectangle through the stack.

c4 5e f0        les  bx, [bp-10]    ; es:bx -> source rect
26 ff 37        push es:[bx]        ; push source.left
26 ff 77 02     push es:[bx+2]      ; push source.top
26 ff 77 04     push es:[bx+4]      ; push source.right
26 8b 77 06     push es:[bx+6]      ; push source.bottom
c4 5e ec        les  bx, [bp-14]    ; es:bx -> destination rect
26 8f 47 06     pop  es:[bx+6]      ; pop dest.bottom
26 8f 47 04     pop  es:[bx+4]      ; pop dest.right
26 8f 47 02     pop  es:[bx+2]      ; pop dest.top
26 8f 47        pop  es:[bx]        ; pop dest.left

Hm, same code size as using registers.

Okay, how about borrowing the ds register as well the si and di registers?

1e              push ds
c5 7e ec        lds  di, [bp-14]
c4 76 f0        les  si, [bp-10]
fc              cld
a5              movsw
a5              movsw
a5              movsw
a5              movsw
1f              pop  ds

Thirteen bytes, yay, though it did cost us register spills that are not immediately visible.

This version is a tightrope walk because any operation that yields the processor risks discarding the former ds segment, which will cause problems because we will restore it to an invalid value and corrupt memory!

¹ The word “only” in in quotation marks because 256KB seems like a tiny amount of memory today, but at the time, that was the maximum amount of memory you could get for an IBM PC XT! At least not without resorting to expansion cards.

13 comments

Comments are closed.

  • Avatar
    Kalle Niemitalo

    I was going to post that you need to do “les si, [bx-10]” before “lds di, [bx-14]”, so that [bx-10] is read from the correct segment. However, if the addresses were [bp-10] and [bp-14] like in the other versions, then there would be no problem, because the segment would default to ss rather than ds.

    Does 16-bit Windows have structured exception handling, and does it restore ds before it invokes the exception handlers?

    • Raymond Chen
      Raymond Chen

      The overhead of memcpy is bigger than the overhead of CopyRect. memcpy would push an extra int (the number of bytes to copy, which takes 4 bytes since 8086 had no ‘push immediate’ instruction), and then follow up with an add sp, 10.

  • Yuhong Bao
    Yuhong Bao

    This reminds me of MulDiv. One problem is that the divide operator in C on x86-32 always calls a helper function for divides even for 64 by 32 divides because of the “divide overflow” exception, and the same was true on 16-bit too.

    • Avatar
      Neil Rashbrook

      The other advantage of MulDiv is that it took three 16-bit parameters and returned a 16-bit result, which was readily expressed in assembler but not in C… I mean, you could write `a*(long)b/c` but that division is going to be a library call on the 8086. (And don’t even think about using floats… although I think Raymond already covered that.)

  • Avatar
    Neil Rashbrook

    Having all those “trivial” functions as part of Windows actually made it possible to write trivial and maybe some not quite so trivial applications without linking the CRT. As I recall, although there was an API for it, there was no UI to reboot, except by pressing Ctrl+Alt+Del twice, which I’m not sure counts, as I don’t think it was a proper shutdown. Just by using the basic MSVC compiler options (i.e. not playing around with building an EXE manually) I was able to get the size of the resulting EXE down to 608 bytes, most of that from not linking the CRT, although the icon also weighed in at 1200 bytes.