In the commit-on-demand pattern, what happens if an access violation straddles multiple pages?

Raymond Chen

Some time ago, I discussed the technique of reserving a block of address space and committing memory on demand. In the code, I left the exercise

    // Exercise: What happens if the faulting memory access
    // spans two pages?

As far as I can tell, nobody has addressed the exercise, so I’ll answer it.

If the faulting memory access spans two pages, neither of which is present, then an access violation is raised for one of the pages. (The processor chooses which one.) The exception handler commits that page and then requests execution to continue.

When execution continues, it tries to access the memory again, and the access still fails because one of the required pages is missing. But this time the faulting address will be an address on the missing page.

In practice, what happens is that the access violation is raised repeatedly until all of the problems are fixed. Each time it is raised, an address is reported which, if repaired, would allow the instruction to make further progress. The hope is that eventually, you will fix all of the problems,¹ and execution can resume normally.

Bonus chatter: For the x86-64 and x86-32 instruction sets, I think the most number of pages required by a single instruction is six, for the movsw instruction. This reads two bytes from es:rsi/esi, and writes them to ds:rdi/edi. If both addresses straddle a page, that’s four data pages. And the instruction itself is two bytes, so that can straddle two code pages, for a total of six. (There are other things that could go wrong, like an LDT page miss, but those will be handled in kernel mode and are not observable in user mode.)

Bonus exercises: I may as well answer the other exercises on that page. We don’t have to worry about integer overflow in the calculation of sizeof(WCHAR) * (Result + 1) because we have already verified that Result is in the range [1, MaxChars), so Result + 1 ≤ MaxChars, and we also know that MaxChars = Buffer.Length / sizeof(WCHAR), so multiplying both sides by sizeof(WCHAR) tells us that sizeof(WCHAR) * (Result + 1) ≤ Buffer.Length.

For the final exercise, we use CopyMemory instead of StringCchCopy because the result may contain embedded nulls, and we don’t want to stop copying at the first null.

¹ Though it’s possible that your attempt to fix one problem may undo a previous fix, putting you into an infinite cycle of repair.

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

11 comments

Discussion is closed. Login to edit/delete existing comments.

LH_Mouse November 25, 2025

> This reads two bytes from es:rsi/esi, and writes them to ds:rdi/edi.

The segment registers are reversed. For x86-32 it’s `movsw es:[edi], ds:[esi]` i.e. from DS:ESI to ES:EDI.
Gil-Ad Ben Or November 22, 2025

Regarding the linked article about alignment and page faults on bank switched graphic cards – how buffers with 24 bits of days per pixel were handled? Or those modes were just unsupported?

The issue is that since 64k bytes is not divisible by 3, and you usually need a pixel granularity if you aren’t using some kind of buffering…
- Simon Farnsworth November 24, 2025
  
  I can't find a card out there that does packed 24 bit per pixel (as opposed to 32 bit per pixel with 8 unused bits, or separate R, G and B planes), and that also doesn't support either linear addressing of the entirety of VRAM or two independent windows into VRAM.
  
  I therefore suspect that cards like you describe weren't handled because Microsoft never encountered them.
  
  Note that x86 processors can't do pixel granularity accesses to 24 bit packed pixels - they offer 8 bit and 16 bit accesses up until the 80286, 32 bit in the 80386, and 64 bit added...
  Read more
  I can’t find a card out there that does packed 24 bit per pixel (as opposed to 32 bit per pixel with 8 unused bits, or separate R, G and B planes), and that also doesn’t support either linear addressing of the entirety of VRAM or two independent windows into VRAM.
  
  I therefore suspect that cards like you describe weren’t handled because Microsoft never encountered them.
  
  Note that x86 processors can’t do pixel granularity accesses to 24 bit packed pixels – they offer 8 bit and 16 bit accesses up until the 80286, 32 bit in the 80386, and 64 bit added in the Pentium. Thus, you have to either split your 24 bit accesses into a 16 bit and an 8 bit (or three 8 bit accesses) or use a 32 bit access and write to multiple pixels at once. This also means that, should such a card exist, you already handled the problem anyway, because you had no pixel granularity access to use.
  
  Read less
  - Thiago Macieira November 25, 2025
    
    The AVX2 versions have a comment "A fault exits the instruction" (VGATHERDPS, VPGATHERDD), which is missing from the AVX512 versions. I'm not certain the instructions can be restarted, in any version. There is no architecturally-visible state changing while the instruction is running. In fact, the docs say "A given implementation of this instruction is repeatable - given the same input values and architectural state, the same set of elements to the left of the faulting one will be gathered" implying that it will re-gather the elements it had previously gathered and therefore all 32 pages must be present. ng.
    
    Interestingly, the...
    Read more
    The AVX2 versions have a comment “A fault exits the instruction” (VGATHERDPS, VPGATHERDD), which is missing from the AVX512 versions. I’m not certain the instructions can be restarted, in any version. There is no architecturally-visible state changing while the instruction is running. In fact, the docs say “A given implementation of this instruction is repeatable – given the same input values and architectural state, the same set of elements to the left of the faulting one will be gathered” implying that it will re-gather the elements it had previously gathered and therefore all 32 pages must be present. ng.
    
    Interestingly, the instruction page(s) do not. Once the instruction has been decoded and dispatched, the front-end no longer requires the pages to be present to execute. So the VMM could evict the instruction pages to only keep the 32 data pages.
    
    AMX’s TILELOADD instruction could load from more than 16 locations – it can load 16 rows and with a non-zero stride. But like the REP string functions, TILELOADD (and the store version) are specifically documented to be restartable and do modify architectural state.
    
    Read less
  - Raymond Chen Author November 24, 2025
    
    @Fabian on VGATHER: That’s hilarious. Though technically, it’s not 34 simultaneously present pages. You presumably need only 4 simultaneously present pages (2 for the instruction, and 2 for the data being gathered). So you could cycle through the data pages 2 at a time and eventually finish. The funny thing about movsw is that you need 6 pages to be present simultaneously!
  - Fabian Giesen November 24, 2025
    
    As of the introduction of VGATHER vector instructions (AVX2) and VSCATTER (AVX-512), the max number of page faults caused by a single x86 instruction has gone way up!
    
    These use the same strategy as the REP MOVS/CMPS/SCAS etc. family of instructions: they are carefully specified so they can make partial progress up to the location of the first page fault, and then save their current state to registers so they can resume from there (rather than retrying from the beginning). Namely, they update RDI/RDI/RCX with the current source/dest pointers and remaining count "as they go". They don't literally increment it every...
    Read more
    As of the introduction of VGATHER vector instructions (AVX2) and VSCATTER (AVX-512), the max number of page faults caused by a single x86 instruction has gone way up!
    
    These use the same strategy as the REP MOVS/CMPS/SCAS etc. family of instructions: they are carefully specified so they can make partial progress up to the location of the first page fault, and then save their current state to registers so they can resume from there (rather than retrying from the beginning). Namely, they update RDI/RDI/RCX with the current source/dest pointers and remaining count “as they go”. They don’t literally increment it every byte or anything of the sort, but crucially, they are specified in such a way that this would be a valid implementation. This ensures forward progress.
    
    Gather/Scatter work the same way. Specifically, they _always_ take a mask register (that, from the user PoV, ends up cleared when the instruction finishes) for which vector lanes are active. If these instructions hit a page fault, they clear the mask for the lanes they already processed, so that (hopefully) with every page fault, the number of remaining unprocessed vector lanes decreases. So something like an 16-lane AVX-512 VPGATHERDD with every address unaligned and spanning a page boundary can end up producing something like 34 page faults from a single instruction. (2 from the instruction itself, if it spans a page boundary, and 32 from the data accesses.)
    
    Read less
  - Fabian Giesen November 24, 2025
    
    I did use SVGAs with bank switching in the mid-90s with 24-bit pixels and the answer is that bank switches could and did happen in the middle of pixels.
    
    In practice, in the ISA days, the bus was 8-bit anyway. As long as you wrote 24-bit pixels with 3 byte writes, no problem. If you tried to do unaligned 2- or 4-byte writes, yes, you couldn’t do that while crossing 64k boundaries (that would require bank switching) and expect it to work.
Adam Rosenfield November 21, 2025

> As far as I can tell, nobody has addressed the exercise, so I’ll answer it.

I suspect the exercise was previously answered in the comments, but those comments have long since been deleted. It's true that there are no comments on that 2012 blog post now, but I do recall that this blog has gone through multiple backend migrations over its long history, and at one point all of the old comments were deleted. All of the other blog posts from that era have no comments on them as well. The oldest snapshot on The Wayback Machine...
Read more
> As far as I can tell, nobody has addressed the exercise, so I’ll answer it.

I suspect the exercise was previously answered in the comments, but those comments have long since been deleted. It’s true that there are no comments on that 2012 blog post now, but I do recall that this blog has gone through multiple backend migrations over its long history, and at one point all of the old comments were deleted. All of the other blog posts from that era have no comments on them as well. The oldest snapshot on The Wayback Machine is from 2019, and that shows no comments from that time too. There may be older snapshots under a different URL, but I’m not sure what the previous URL(s) were.

IIRC it was because of GDPR regulations that required users to be allowed to delete their comments, and there wasn’t a sufficiently strong identity mechanism attached to the old comments to allow the original authors to log in and delete them, so the only way to comply was to migrate to a stronger authentication system and delete all of the old comments.

Read less
- LB November 22, 2025
  
  Wasn’t that all originally explained by Mr. Chen himself when it happened? He might have access to the archived comments that can’t be published.
- Brian Boorman November 21, 2025
  
  The original blog post had 14 comments. Unfortunately it doesn’t look like Wayback was able to capture the comments due to how the old blog site scripting worked.
  
  Wayback Link to old blogs.msdn.com
Melissa P November 20, 2025

"or the x86-64 and x86-32 instruction sets, I think the most number of pages required by a single instruction is six" --- depends on if you count e.g. "rep movsw" as a single instruction or treat it as multiple instructions where rip doesnt change until rcx is zero; theoretically you can have nearly unlimited page faults with that one; the CPU trace flag triggers after every single data transfer so I'd say it's not a single instruction

the trickier question is now... what's the order of the 6 page faults with "movsw"... code first of course... then data... but then load/store...
Read more
“or the x86-64 and x86-32 instruction sets, I think the most number of pages required by a single instruction is six” — depends on if you count e.g. “rep movsw” as a single instruction or treat it as multiple instructions where rip doesnt change until rcx is zero; theoretically you can have nearly unlimited page faults with that one; the CPU trace flag triggers after every single data transfer so I’d say it’s not a single instruction

the trickier question is now… what’s the order of the 6 page faults with “movsw”… code first of course… then data… but then load/store order is undefined by AMD/Intel so it could be possible that the instruction fetcher pulls the 2nd code page first and then the 1st page; or for read/write, the memory fetcher may pull the 2nd write page first, then the 1st read page and then the rest.. who knows; and yes write instructions need to pull memory first into the D caches

fascinating stuff… I wonder what happens to the order if you movsw by reading from the B/C page boundary writing into the A/B page boundary

Read less