March 9th, 2020

Why does MS-DOS put an int 20h at byte 0 of the COM file program segment?

The MS-DOS .com file format is very simple: It just a memory dump of the 16-bit address space starting at offset 0100h, and continuing for the size of the program.

The memory below 0100h also had a specific format, known as the Program Segment Prefix. There’s a lot of stuff in there, but the stuff that’s interesting for today’s discussion are the following:

  • At offset 0000h is an int 20h instruction.
  • At offset 0005h is a jmp instruction.
  • At offset 005Ch is a file control block that contains the first command line argument, parsed as if it were a file name.
  • At offset 006Ch is a file control block that contains the second command line argument, parsed as if it were a file name.
  • At offset 0080h is the command line.

The int 20h is the “exit program” system call. One theory is that it is placed at offset 0000h so that if execution runs off the end of the code segment, the instruction pointer will wrap back around to zero, and then the program will terminate.

An interesting theory, but unlikely. The odds of execution running harmlessly off the end of the code segment are slim to none.

These specific bytes are significant because they line up exactly with how CP/M organized its zero page. Keeping these important addresses the same made it easier to port CP/M programs to MS-DOS.

And CP/M put the “exit program” system call at offset 0000h because it started each program with 0000h on the stack. If the program executed a ret instruction, it would return back to zero, and exit the program. Just like if you do a return from main.

And although int 21h was the primary system call for MS-DOS, it supported the CP/M system call address: call 0005h. To further ease the porting effort from CP/M to MS-DOS, MS-DOS chose system call function codes to match the CP/M function codes.

In other words, the int 20h is at offset 0000h for backward compatibility with CP/M.

Bonus chatter: The CP/M history also calls out how unlikely it is for execution to run off the end of the segment and wrap around. In order for that to happen, it would have to somehow execute through the operating system itself, because CP/M put the operating system at the highest available address. (Also, the highest available address may not be 0xFFFF because the system could very well have less than 64KB of memory.)

Follow-up: Commenter Jim Nelson points out that this jump instruction deserves an entire article by itself, and fortunately he also provided a link to that article. It’s a wild tale of deception, lies, and the A20 line.

Topics
History

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

13 comments

Discussion is closed. Login to edit/delete existing comments.

  • Thomas Harte

    For full CP/M compatibility, the OS has to be present at the top of address space, _and_ begin on a 256-byte boundary. So $FEFF is actually the potential top of RAM.

    Specifically: many applications, including Microsoft BASIC, crib the target address from the JMP at $0005 and dynamically modify themselves slightly to reduce calling costs. So the JMP needs to be to an actual JMP to a real address.

    Some, such as Wordstar 4, go beyond that...

    Read more
    • John Elliott

      In real CP/M, the jump at 5 is to $xx06, because the first 6 bytes of the OS are the serial number.

  • Dmitry

    I’ve been telling (and showing) this stuff to students learning x86 assembly programming for several years by now. Believe it or not but this year discussing call and ret is going to happen the next week, and the fact about zero at the top of the com-program’s stack has already been given. Yet another case of your posts being somehow synchronized with what people around the world do. Cool!

  • John Elliott

    This is connected to the long-standing bug with PSPs created by DEBUG where the CALL 5 points to the wrong place - the code at SAVSTK creates a stack, pushes 0 onto it, and then subtracts 128 words -- so the resulting address (which is then used to calculate the CALL 5 jump) isn't paragraph-aligned. It was still like that in the DEBUG.COM that came with Windows 10 x86, last time I checked.

    Simplest fix is...

    Read more
    • Jim Nelson

      For an interesting look at the PSP, call 0005h, the A20 line, and how MS-DOS attempted to emulate CP/M back in the day, check out http://www.os2museum.com/wp/who-needs-the-address-wraparound-anyway/

      Of particular interest is interrupt 30h, which DOS overwrote with code to support call 0005h and offset 06h (which CP/M's page zero used to indicate the top of program memory). This is why Ralf Brown's interrupt list has NOT A VECTOR! for interrupt 30h (http://www.ctyme.com/intr/int-30.htm)

      Read more
    • Antonio Rodríguez

      If a known bug has been around for nearly 40 years, chances are that someone is relying on it.

      • R Wells

        Or no one for the previous 40 years was developing an application using Call 5 on MSDOS 2 or later and having to debug it with DEBUG.

  • Pierre Baillargeon

    The reason to put int 20h at zero may also to cover calling through a null function pointer.

    IIRC, the Amiga had nothing at address zero for this reason. (And also so that writing through a null pointer was less likely to corrupt something important. The Motorola 68000 didn't have memory protection.) Amusingly, the Amiga also had a very important piece of data at address 4, the pointer to the kernel entry point. (Well, technically, the...

    Read more
    • David Walker

      “The Amiga had nothing” at address zero? The bits weren’t there, or there was a black hole, or what? How can there be “nothing” at a given address?

      • Tuomas Tynkkynen

        As others corrected, Amiga having “nothing” at address zero is not really correct, but here is simplified explanation of how “nothing” can be at a given address:

        When a CPU does a read, it puts out the address it wants to read from on the address pins of the CPU chip. Then every other chip connected to the bus (like ROM or RAM chip) will look at the address to decide if the address belongs to...

        Read more
      • Muzer

        Should probably read "nothing important". I think it was mapped to a piece of RAM or ROM (I forget which) which isn't used by anything.

        EDIT: Though having said that, I just looked it up and it turns out the first exception vector is stored in address 0 - but this is only used during a power on or a reset so is unlikely to be a problem. Address 4 was the second exception vector so...

        Read more
      • smf

        Address 0 is where the 68000 fetches the stack pointer during reset.
        Address 4 is where it fetches the program counter during reset.

        However during reset the Kickstart ROM is mirrored at location $0 so the values can be fetched & is then switched out quite quickly so that RAM appears there.

        Once the Amiga has booted then Address 4 is where the address of exec base is stored. You won't get very far executing it, because...

        Read more
  • Alexis Ryan

    a Very Old new thing. I always find details of how MS-DOS fascinating