If I issue multiple overlapped I/O requests against the same region of a file, will they execute in the order I issued them?

Raymond Chen

Suppose you open a file in overlapped mode, allowing you to issue multiple overlapped I/O requests against the handle. You then issue two I/O write operations like this:

  1. Write 8192 bytes starting at offset 0.
  2. Write 8192 bytes starting at offset 4096.

What can be guaranteed about the contents of the file after both I/O operations have completed? Let’s assume that 4096 is a multiple of the sector size, so all writes are to full sectors.

Can we assume that when everything is finished, the contents of the overlapping region come from the second write request?

No, that is not a valid assumption. The term overlapped in “overlapped I/O” refers to temporal overlap, and it means that multiple requests can be outstanding against a single handle, but each request has an indepedent lifetime. It is possible for one request to race ahead of another request that had been submitted earlier. This is a common occurrence for rotational storage media, because the driver will reorder I/O requests to reduce the amount of seeking required.

The driver is also likely to coalesce I/O operations that target the same physical sectors, and there is no requirement that overlapping writes be applied in the order of issue. It might be that the I/O passes through other layers before it reaches the disk driver, and the later write request might pass through the layer faster than the earlier one.

For example, in our brief look at the I/O layering, we see that I/O passes through the volume snapshot service, which can turn a single write I/O into a write, read, and write. In theory, then, this could happen:

First I/O
(write to sector N)
Second I/O
(write to sector N)
Check if sector is part of a lazy snapshot  
Yes: Need to back up original data  
 Read sector N  
 Reserve sector N′ for snapshot  
 Write sector N  
 Mark sector N as not part of a lazy snapshot  
  Check if sector is part of a lazy snapshot
  No: Don’t need to back up original data
  Write sector N
Write sector N  

If a second write to sector N occurs at just the right time, it will race ahead of the first write, causing the two writes to reach the disk driver out of order.

Okay, so the writes can complete out of order to storage. But will they at least be atomic? Can I be sure that the final results will either be the entirety of the first write or the entirety of the second?

Not really.

If the write is spread out over multiple sectors, then the individual writes can get split. In the above example, if the write request involved two sectors, and only one of them was part of a lazy snapshot, then the other sector could complete first while the one that is part of the lazy snapshot deals with being backed up in the snapshot.

Okay, but if the write is a single sector, then will those be atomic? Am I sure that I will end up either with the entire sector from the first write or the entire sector from the second write?

I’m not so sure about even that.

So-called Advanced Format drives have a mode known as 512e where the firmware reports a sector size of 512 bytes, but the physical storage unit is something larger, like 4KB. In that case, writes that are not exact multiples of the physical storage unit are internally performed by the firmware as read-modify-write operations. You can run into pathological cases where a partition is created at an offset that is not an exact multiple of the physical storage unit size. Or you might have a VHD file (whose virtual sector size is always 512 bytes)¹ being stored on a host drive whose physical storage unit is 4KB.

I haven’t worked out the math, but I wouldn’t be surprised if there was some horrible pathological case where you have a misaligned VHD stored on a misaligned partition, mounted inside a virtual machine, hosted on a 4KB-native drive, and then exactly the wrong sequence of I/O operations is issued.

So just assume it can happen. If you need writes to complete in a specific order, then don’t overlap them. Wait for the previous write to complete before issuing the next one.

Bonus chatter: If the target of the write isn’t even a physical device but is instead something like an in-memory cache, then even more possibilities open up. In theory, you could even get byte-by-byte inconsistency, if the CPU scheduler is sufficiently devious.

¹ The VHDX virtual sector size is 4KB, so this sort of misalignment on Advanced Format drives is not going to happen for VHDX files, assuming the partition is 4KB-aligned.


Discussion is closed. Login to edit/delete existing comments.

  • Antonio Rodríguez 0

    All this could be summarized by “parallel operations, by default, run in parallel”.

    If you are issuing multiple writes to the same part of a file from the same thread, you should go back, think about what you are doing, and coalesce those multiple writes in a single, larger one, perhaps by placing the data in a memory buffer. Processors nowadays are thousands of times faster than SSDs (and millions faster than physical hard drives), so this is an obvious optimization.

    I assume this article talks about operations issued from the same thread. If they were issued from different thread or processes, it wouldn’t make sense talking about the order they were issued in the first place. Threads from the same process could be arranged to write to a common buffer, or there could be one thread in charge of writing to the file. It would add complexity, but it could be done. Writes from different processes would be a horse of a different color: you could end having to create a server process of some kind, which maybe would not be worth the effort.

    • Simon Farnsworth 0

      Storage stacks often do a lot of these optimizations anyway, even if you don’t – the OS kernel has writes go to an in-memory cache, and the OS then coalesces writes etc before sending them onto the storage controller. This acts the same way as the server process you propose for writes from different processes; as a nice side effect, it means that if you have multiple writes overlapping in both time and space, only one of them is normally sent to the storage device.

      In turn, storage controllers also do these sorts of optimizations; every HDD I’ve owned, and all the NVMe and SATA SSDs I’ve had don’t write straight to the non-volatile medium. Instead, they write to an on-device memory cache, and then write that cache out-of-order to the physical medium. Depending on the host interface, you either get completion of the I/O request when it’s in the device volatile cache, or when it’s written to physical medium (and there’s a *huge* set of complexity for OSes because there are also commands for waiting until the volatile cache is flushed to disk).

      • Antonio Rodríguez 0

        Yes, the good old “I don’t need to do this because somebody will end doing it”. Like in, I don’t need to make the dishes because my flatmate will eventually do it for me. Why bothering with destroying windows or closing files at process’ end, then? The OS will do it, too…

        • Simon Farnsworth 1

          Indeed – why bother closing files at process end, when the OS will do it for you, and faster? Why do the dishes by hand if you have a dishwasher that does them for you? Why duplicate what the OS already does for you, when you can rely on the OS doing a good job?

          I’ve written more than one utility that just leaks all the dynamically allocated memory and files it opens – it’s only going to run for a few seconds, and the OS-level cleanup is faster than having my code do the cleanup, and then exiting.

Feedback usabilla icon