# Why are there trivial functions like `Copy­Rect` and `Equal­Rect`?

Raymond C

If you dig into the bag of tricks inside `user32`, you’ll see some seemingly-trivial functions like `Copy­Rect` and `Equal­Rect`. Why do we even need functions for things that could be done with the `=` and `==` operators?

Because those operators generate a lot of code.

Copying a rectangle would go like this:

```c4 5e f0        les  bx, [bp-10]    ; es:bx -> source rect
26 8b 07        mov  ax, es:[bx]    ; ax = source.left
c4 5e ec        les  bx, [bp-14]    ; es:bx -> destination rect
26 89 07        mov  es:[bx], ax    ; dest.left = ax

c4 5e f0        les  bx, [bp-10]    ; es:bx -> source rect
26 8b 47 02     mov  ax, es:[bx+2]  ; ax = source.top
c4 5e ec        les  bx, [bp-14]    ; es:bx -> destination rect
26 89 47 02     mov  es:[bx+2], ax  ; dest.top = ax

c4 5e f0        les  bx, [bp-10]    ; es:bx -> source rect
26 8b 47 04     mov  ax, es:[bx+4]  ; ax = source.right
c4 5e ec        les  bx, [bp-14]    ; es:bx -> destination rect
26 89 47 04     mov  es:[bx+4], ax  ; dest.right = ax

c4 5e f0        les  bx, [bp-10]    ; es:bx -> source rect
26 8b 47 06     mov  ax, es:[bx+6]  ; ax = source.bottom
c4 5e ec        les  bx, [bp-14]    ; es:bx -> destination rect
26 89 47 06     mov  es:[bx+6], ax  ; dest.bottom = ax
```

This takes 54 bytes of code. It’s rather inefficient because the 8086 processor could indirect only through the `bx`, `bp`, `si`, and `di` registers. The `bp` register was reserved for use as the frame pointer, so that was off the table. The `si` and `di` registers were used as register variables, so they are busy holding something important. That leaves `bx` as the only register that can be used to dereference pointers.

Since this is a 16:16 pointer, we also need a segment register, and the 8086 has only four segment registers: `cs` (code segment), `ds` (data segment), `ss` (stack segment), `es` (extra segment). Three of them have dedicated purposes, so the only one left is `es`. Even if we could borrow `si` or `di` temporarily, we would still be bottlenecked on `es`.

If we move `Copy­Rect` to a function, then we can save a bunch of code:

```c4 5e f0        les  bx, [bp-10]    ; es:bx -> source rect
53              push bx
06              push es
c4 5e ec        les  bx, [bp-14]    ; es:bx -> destination rect
53              push bx
06              push es
9a xx xx xx xx  call CopyRect
```

Only 15 bytes. Less than a third the size.

This was the era in which developers counted bytes, and any trick to save a few bytes was worth considering, especially since you had “only” 256KB of memory.¹

And since copying and comparing rectangles were common operations, factoring the code into a function saved a lot of bytes.

Of course, nowadays, it’s not a lot of code to copy a rectangle manually: An entire rectangle fits into a single 128-bit register.

```    mov    eax, [sourcerect]
movups xmm0, [eax]
mov    eax, [destrect]
movups [eax], xmm0
```

Bonus code golf: We could have squeezed out a few instructions by moving two integers at a time. This requires that the two rectangles be non-overlapping in memory (to avoid data aliasing), but that’s probably a safe assumption because the original code didn’t work anyway in that case.

```int v[5];
*(RECT*)&v[0] = *(RECT*)&v[1]; // bad idea
```

Switching to moving two integers at a time doesn’t break anything that wasn’t already broken, so let’s do it:

```c4 5e f0        les  bx, [bp-10]    ; es:bx -> source rect
26 8b 07        mov  ax, es:[bx]    ; ax = source.left
26 8b 57 02     mov  dx, es:[bx+2]  ; dx = source.top
c4 5e ec        les  bx, [bp-14]    ; es:bx -> destination rect
26 89 07        mov  es:[bx], ax    ; dest.left = ax
26 89 57 02     mov  es:[bx+2], dx  ; dest.top = dx

c4 5e f0        les  bx, [bp-10]    ; es:bx -> source rect
26 8b 47 04     mov  ax, es:[bx+4]  ; ax = source.right
26 8b 57 06     mov  dx, es:[bx+6]  ; dx = source.bottom
c4 5e ec        les  bx, [bp-14]    ; es:bx -> destination rect
26 89 47 04     mov  es:[bx+4], ax  ; dest.right = ax
26 89 57 06     mov  es:[bx+6], dx  ; dest.bottom = dx
```

That dropped us down to 42 bytes. It helps, but it’s still a lot of code.

If we’re willing to spill one of our other register variables, say, `si`, then we can squeeze it even further.

```c4 5e f0        les  bx, [bp-10]    ; es:bx -> source rect
26 8b 07        mov  ax, es:[bx]    ; ax = source.left
26 8b 57 02     mov  dx, es:[bx+2]  ; dx = source.top
26 8b 4f 04     mov  cx, es:[bx+4]  ; cx = source.right
26 8b 77 06     mov  si, es:[bx+6]  ; si = source.bottom
c4 5e ec        les  bx, [bp-14]    ; es:bx -> destination rect
26 89 07        mov  es:[bx], ax    ; dest.left = ax
26 89 57 02     mov  es:[bx+2], dx  ; dest.top = dx
26 89 4f 04     mov  es:[bx+4], cx  ; dest.right = cx
26 89 77 06     mov  es:[bx+6], si  ; dest.bottom = si
```

Only 36 bytes. Getting better. But still twice as big as calling `CopyRect`, and it cost us a register.

Another trick: Copy the rectangle through the stack.

```c4 5e f0        les  bx, [bp-10]    ; es:bx -> source rect
26 ff 37        push es:[bx]        ; push source.left
26 ff 77 02     push es:[bx+2]      ; push source.top
26 ff 77 04     push es:[bx+4]      ; push source.right
26 8b 77 06     push es:[bx+6]      ; push source.bottom
c4 5e ec        les  bx, [bp-14]    ; es:bx -> destination rect
26 8f 47 06     pop  es:[bx+6]      ; pop dest.bottom
26 8f 47 04     pop  es:[bx+4]      ; pop dest.right
26 8f 47 02     pop  es:[bx+2]      ; pop dest.top
26 8f 47        pop  es:[bx]        ; pop dest.left
```

Hm, same code size as using registers.

Okay, how about borrowing the `ds` register as well the `si` and `di` registers?

```1e              push ds
c5 7e ec        lds  di, [bp-14]
c4 76 f0        les  si, [bp-10]
fc              cld
a5              movsw
a5              movsw
a5              movsw
a5              movsw
1f              pop  ds
```

Thirteen bytes, yay, though it did cost us register spills that are not immediately visible.

This version is a tightrope walk because any operation that yields the processor risks discarding the former `ds` segment, which will cause problems because we will restore it to an invalid value and corrupt memory!

¹ The word “only” in in quotation marks because 256KB seems like a tiny amount of memory today, but at the time, that was the maximum amount of memory you could get for an IBM PC XT! At least not without resorting to expansion cards.

• Kalle Niemitalo

I was going to post that you need to do “les si, [bx-10]” before “lds di, [bx-14]”, so that [bx-10] is read from the correct segment. However, if the addresses were [bp-10] and [bp-14] like in the other versions, then there would be no problem, because the segment would default to ss rather than ds.

Does 16-bit Windows have structured exception handling, and does it restore ds before it invokes the exception handlers?

• Oops, should have been bp. 16-bit Windows did not have structured exception handling. It didn’t really have exception handling at all! There was a way to hook into the global fault handler, but that was usually a path to sadness.

• Kalle Niemitalo

Also, movs copies from [ds:si] to [es:di].

• Piotr Siódmak

Couldn’t you just use memcpy to copy a rectangle? It’s already there and probably already used by other parts of code, so it’s 0 bytes of overhead.

• The overhead of memcpy is bigger than the overhead of CopyRect. memcpy would push an extra int (the number of bytes to copy, which takes 4 bytes since 8086 had no ‘push immediate’ instruction), and then follow up with an `add sp, 10`.

• Yuhong Bao

This reminds me of MulDiv. One problem is that the divide operator in C on x86-32 always calls a helper function for divides even for 64 by 32 divides because of the “divide overflow” exception, and the same was true on 16-bit too.

• Neil Rashbrook

The other advantage of MulDiv is that it took three 16-bit parameters and returned a 16-bit result, which was readily expressed in assembler but not in C… I mean, you could write `a*(long)b/c` but that division is going to be a library call on the 8086. (And don’t even think about using floats… although I think Raymond already covered that.)

• Yuhong Bao

MulDiv is not a trivial function either though.

• Neil Rashbrook

Well, I guess the rounding (which I keep forgetting about) and overflow checking made it non-trivial; the core of the function would otherwise just be a couple of instructions.

• Neil Rashbrook

Having all those “trivial” functions as part of Windows actually made it possible to write trivial and maybe some not quite so trivial applications without linking the CRT. As I recall, although there was an API for it, there was no UI to reboot, except by pressing Ctrl+Alt+Del twice, which I’m not sure counts, as I don’t think it was a proper shutdown. Just by using the basic MSVC compiler options (i.e. not playing around with building an EXE manually) I was able to get the size of the resulting EXE down to 608 bytes, most of that from not linking the CRT, although the icon also weighed in at 1200 bytes.

• 紅樓鍮

Code in the .NET BCL uses “throw helpers” to throw exceptions for the same reason. I always wondered why this is not done automatically by the compiler.

• Chris Guzak

Is the conclusion “Don’t use CopyRect anymore?”