August 3rd, 2020

Peeking inside the implementation of AnsiUpper and AnsiLower in Windows 1.0

Windows 1.0 had functions called AnsiUpper and AnsiLower. You passed these functions a pointer to a string, and it converted the string in place to uppercase or lowercase, respectively. If the segment portion of the pointer is zero, then the offset is treated as a character code, and it returned the uppercase version of that character code in the low byte of the return value.

The single-character version could anachronistically be wrapped like this:¹

inline char AnsiUpperChar(char c)
{
 return reinterpret_cast<char>(
    AnsiUpper(reinterpret_cast<LPSTR>(
      static_cast<unsigned char>(c))));
}

This is an anachronism because in 1983, there was no reinterpret_cast, no static_cast, no inline functions, and no C++.

It was more likely to be a macro.

#define AnsiUpperChar(c) ((char)AnsiUpper((LPSTR)(unsigned char)(c)))

The implementations of these functions is entirely in assembly language.

; Entry: pointer on stack
; Exit:  If single character, AL = converted character
;        If string, DX:AX = original pointer

AnsiUpper proc far
        mov bx, sp          ; custom stack frame
        push di             ; save registers
        push si
        les di, ss:[bx+4]   ; es:di -> string
        mov cx, es          ; cx:ax -> string
        mov ax, di
        call UpperChar      ; uppercase the character in AL
        jcxz aup90          ; Exit if CX = 0
        call UpperString    ; uppercase the string in ES:DI
        mov dx, es          ; return the original pointer
        mov ax, ss:[bx+4]
aup90:  pop si
        pop di
        ret 4
AnsiUpper endp

; Entry: AL = character
; Exit:  AL = uppercase version of character
; Modifies: No other registers

UpperChar proc near
        cmp al, 0x61        ; Q: Less than 'a'?
        jb uch90            ; Y: Nothing to do
        cmp al, 0x7a        ; Q: Less than 'z'?
        jbe uch80           ; Y: Convert to uppercase
        cmp al, 0xe0        ; Q: Less than 'à'?
        jb uch90            ; Y: Nothing to do
        cmp al, 0xfe        ; Q: More than 'þ'?
        ja uch90            ; Y: Nothing to do
uch80:  sub al, 0x20        ; Convert lowercase to uppercase
uch90:  ret
UpperChar endp

; Entry: ES:DI -> string to convert to uppercase
; Exit: String has been converted to uppercase in place
; Modifies: SI, DI, AL

UpperString proc near
        cld                 ; Ensure we walk forward
        mov si, di          ; ES:SI and ES:DI both -> string
ust10:  lodsb es:[si]       ; Load character and advance SI
        call UpperChar      ; Convert to uppercase
        stosb               ; Save result and advance DI
        or al, al           ; Q: End of string?
        jnz ust10           ; N: Keep converting
        ret
UpperString endp

The AnsiLower function is entirely analogous, so I won’t bother writing it out.

The AnsiUpper function doesn’t use the usual BP stack frame. To save code space, it uses BX as the stack frame pointer. That way, it doesn’t need to do all the usual frame setup and teardown stuff. This code does not call out to other code segments, so we won’t trigger any segment-not-present thunks that would require stack patching, so the lack of a proper BP frame is not going to cause a problem.

The structure of the AnsiUpper function is rather odd. It first assumes that you’re calling it with a single character and converts the offset from lowercase to uppercase. Only after the conversion does it check whether you actually called it that way. If so, then it jumps to the exit with the converted character. Otherwise, it throws away all the work it did and starts over by converting the pointed-to string.

Why does it structure the code this way? Because it saves an instruction. Instead of

    if condition goto branch2
    do_branch1
    goto end
branch2:
    do_branch2
end:

you speculatively front-load one of the branches and discard it if it turns out to be the wrong branch.

    do_branch2
    if condition goto end
    do_branch1
end:

This removes the goto end from the instruction stream, saving two bytes.

Of course, this trick requires that do_branch2 has no side effects, or at least that the side effects can be rolled back if the speculation turns out to have been unwarranted.

The UpperChar function has a custom register-based calling convention. This is common in hand-written assembly language, allowing you to tailor the calling convention to the usage pattern.

You may have noticed that the UpperChar function doesn’t consult any code page tables to figure out which characters are uppercase and which are lowercase. It just hard-codes the special knowledge of code page 1252, which was the ANSI code page that Windows 1.0 used.

In the layout of code page 1252, the letters are in two blocks: One from A to Z, and another from À to Þ. Furthermore, the uppercase and lowercase versions are exactly 32 slots apart, so adding 32 gets you from uppercase to lowercase, and subtracting 32 gets you from lowercase to uppercase.

Okay, back to AnsiUpper. If it turns out that we have a string, then the work is done by the UpperString function. This function takes advantage of the special LODSB and STOSB instructions to load a single byte from the string and to write a single byte to the string. These are single-byte instructions that replace two larger instructions (load a byte and increment the index register), so they are handy when trying to squeeze every code byte out of your program.

You may have spotted some quirks in this conversion code.

The CharUpper function treats U+00D7 × as the uppercase version of U+00F7 ÷. If you ask for the lowercase version of the multiplication symbol, you get the division symbol, and conversely when converting from lowercase to uppercase.

Another quirk is that the code doesn’t try to capitalize ß to SS. It just leaves it as ß. There is no uppercase ẞ in code page 1252.

Believe it or not, there was a point to this exercise beyond just digging up ancient code designed under very different constraints and marveling how it worked. We’ll put this function into context next time.

¹ You might be tempted to use this:

inline char AnsiUpperChar(char c)
{
 return reinterpret_cast<char>(
    AnsiUpper(reinterpret_cast<LPSTR>(c)));
}

but that doesn’t work because char is probably a signed type, so the char will be sign-extended, which means that a character in the 0x80 to 0xFF range will produce a pointer of the form 0xFFFF:0xFFxx. Since this does not have zero in the high word, it will be treated as a pointer and corrupt random memory.

Topics
History

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

12 comments

Discussion is closed. Login to edit/delete existing comments.

  • Jonathan Harston

    What strikes me about so many character code pages is that - after the effort made in 7-bit ASCII to ensure there was a logical relationship between upper case and lower case - how little effort was made in any logical relationship between upper case and lower case characters in top-bit-set characters.

    Plus, there's no logic to the ordering of linedraw characters. I grew up with computers where top-bit-set characters had logical upper/lower case, and linedraw...

    Read more
  • googolplex3

    Did you disassemble this or did you get it from source? I would love to see Windows 1.0 open-sourced, but then again NT probably still has code from it lol

    • Gareth Poole

      If they open sourced Windows 1 and 2 like they did with the early versions of DOS, I would be pretty happy. Of course, if they open sourced Windows 3.x it would be like Christmas morning. Win16 is something I’ve always had a lot of fun with.

  • Neil Rashbrook

    I wonder what the point of allowing char to be signed was, apart from forcing everyone to cast to unsigned char when they wanted to do anything useful with it. I do hope char8_t catches on.

    • Raymond ChenMicrosoft employee Author

      For some processors, unsigned char is more expensive than signed char. For example, the SuperH-3 has “load 8-bit value with sign extension” but no “load 8-bit value with zero extension”. Loading an 8-bit value with zero extension requires a second instruction to zero out the high-order bits. Those processors are much more efficient if char is signed.

      • Neil Rashbrook

        Ah, so basically it’s only useful for someone who’s not using the 8th bit.

  • Alexis Ryan

    Specifically windows 1 only?

    Removed from Windows 2 cause there was a standard library equivalent?

    • Neil Rashbrook

      Removed? Don’t be silly. It has to be exported from all 16-bit kernels (including the one in NTVDM in 32-bit Windows 10) so that all of your Windows 1.0 applications still work.

      (The actual fate of the function itself is that is has been since renamed to CharUpper, with a macro to help you port your Windows 1.0 application.)

  • Mystery Man

    Wow. I lost the will to live! Computing must have been very awful in those days.

    • Waleri Todorov

      On the contrary! It was fun!

      • Mystery Man

        For me, it certainly was. I never had any interest in anything other than computing. But for plain users?

        At least, back then, the software worked as intended. Right now, I’m commenting on a blog that allows me to mark only my own comment as spam! This is not what I call “the software that works as intended”.

  • Peter Cooper Jr.

    Well, I haven’t tried finding a better citation, but the Wikipedia article on Windows-1252 that you linked says that “The first version of the codepage 1252 used in Microsoft Windows 1.0 did not have positions D7 and F7 defined.” As those are the multiplication and division signs, it might have been perfect sense at the time to not worry about how the uppercase & lowercase functions worked on them.