The MS-DOS .com
file format is very simple: It just a memory dump of the 16-bit address space starting at offset 0100h
, and continuing for the size of the program.
The memory below 0100h
also had a specific format, known as the Program Segment Prefix. There’s a lot of stuff in there, but the stuff that’s interesting for today’s discussion are the following:
- At offset
0000h
is anint 20h
instruction. - At offset
0005h
is ajmp
instruction. - At offset
005Ch
is a file control block that contains the first command line argument, parsed as if it were a file name. - At offset
006Ch
is a file control block that contains the second command line argument, parsed as if it were a file name. - At offset
0080h
is the command line.
The int 20h
is the “exit program” system call. One theory is that it is placed at offset 0000h
so that if execution runs off the end of the code segment, the instruction pointer will wrap back around to zero, and then the program will terminate.
An interesting theory, but unlikely. The odds of execution running harmlessly off the end of the code segment are slim to none.
These specific bytes are significant because they line up exactly with how CP/M organized its zero page. Keeping these important addresses the same made it easier to port CP/M programs to MS-DOS.
And CP/M put the “exit program” system call at offset 0000h
because it started each program with 0000h
on the stack. If the program executed a ret
instruction, it would return back to zero, and exit the program. Just like if you do a return
from main
.
And although int 21h
was the primary system call for MS-DOS, it supported the CP/M system call address: call 0005h
. To further ease the porting effort from CP/M to MS-DOS, MS-DOS chose system call function codes to match the CP/M function codes.
In other words, the int 20h
is at offset 0000h
for backward compatibility with CP/M.
Bonus chatter: The CP/M history also calls out how unlikely it is for execution to run off the end of the segment and wrap around. In order for that to happen, it would have to somehow execute through the operating system itself, because CP/M put the operating system at the highest available address. (Also, the highest available address may not be 0xFFFF
because the system could very well have less than 64KB of memory.)
Follow-up: Commenter Jim Nelson points out that this jump instruction deserves an entire article by itself, and fortunately he also provided a link to that article. It’s a wild tale of deception, lies, and the A20 line.
For full CP/M compatibility, the OS has to be present at the top of address space, _and_ begin on a 256-byte boundary. So $FEFF is actually the potential top of RAM.
Specifically: many applications, including Microsoft BASIC, crib the target address from the JMP at $0005 and dynamically modify themselves slightly to reduce calling costs. So the JMP needs to be to an actual JMP to a real address.
Some, such as Wordstar 4, go beyond that...
In real CP/M, the jump at 5 is to $xx06, because the first 6 bytes of the OS are the serial number.
I’ve been telling (and showing) this stuff to students learning x86 assembly programming for several years by now. Believe it or not but this year discussing call and ret is going to happen the next week, and the fact about zero at the top of the com-program’s stack has already been given. Yet another case of your posts being somehow synchronized with what people around the world do. Cool!
This is connected to the long-standing bug with PSPs created by DEBUG where the CALL 5 points to the wrong place - the code at SAVSTK creates a stack, pushes 0 onto it, and then subtracts 128 words -- so the resulting address (which is then used to calculate the CALL 5 jump) isn't paragraph-aligned. It was still like that in the DEBUG.COM that came with Windows 10 x86, last time I checked.
Simplest fix is...
For an interesting look at the PSP, call 0005h, the A20 line, and how MS-DOS attempted to emulate CP/M back in the day, check out http://www.os2museum.com/wp/who-needs-the-address-wraparound-anyway/
Of particular interest is interrupt 30h, which DOS overwrote with code to support call 0005h and offset 06h (which CP/M's page zero used to indicate the top of program memory). This is why Ralf Brown's interrupt list has NOT A VECTOR! for interrupt 30h (http://www.ctyme.com/intr/int-30.htm)
If a known bug has been around for nearly 40 years, chances are that someone is relying on it.
Or no one for the previous 40 years was developing an application using Call 5 on MSDOS 2 or later and having to debug it with DEBUG.
The reason to put int 20h at zero may also to cover calling through a null function pointer.
IIRC, the Amiga had nothing at address zero for this reason. (And also so that writing through a null pointer was less likely to corrupt something important. The Motorola 68000 didn't have memory protection.) Amusingly, the Amiga also had a very important piece of data at address 4, the pointer to the kernel entry point. (Well, technically, the...
“The Amiga had nothing” at address zero? The bits weren’t there, or there was a black hole, or what? How can there be “nothing” at a given address?
As others corrected, Amiga having “nothing” at address zero is not really correct, but here is simplified explanation of how “nothing” can be at a given address:
When a CPU does a read, it puts out the address it wants to read from on the address pins of the CPU chip. Then every other chip connected to the bus (like ROM or RAM chip) will look at the address to decide if the address belongs to...
Should probably read "nothing important". I think it was mapped to a piece of RAM or ROM (I forget which) which isn't used by anything.
EDIT: Though having said that, I just looked it up and it turns out the first exception vector is stored in address 0 - but this is only used during a power on or a reset so is unlikely to be a problem. Address 4 was the second exception vector so...
Address 0 is where the 68000 fetches the stack pointer during reset.
Address 4 is where it fetches the program counter during reset.
However during reset the Kickstart ROM is mirrored at location $0 so the values can be fetched & is then switched out quite quickly so that RAM appears there.
Once the Amiga has booted then Address 4 is where the address of exec base is stored. You won't get very far executing it, because...
a Very Old new thing. I always find details of how MS-DOS fascinating