{"id":110486,"date":"2024-11-07T07:00:00","date_gmt":"2024-11-07T15:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=110486"},"modified":"2024-11-08T08:19:48","modified_gmt":"2024-11-08T16:19:48","slug":"20241107-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20241107-00\/?p=110486","title":{"rendered":"Why do I observe reads from a memory-mapped file when writing large blocks?"},"content":{"rendered":"<p>A customer had created a memory-mapped file, and wanted to set large chunks of the memory to zero. For concreteness, consider this function that takes a memory-mapped file mapping (<code>map<\/code>) and a collection of blocks, each represented as a file offset and length. Its job is to set each block to zero.<\/p>\n<pre>void ZeroOutBlocks(uint8_t* map,\r\n    std::initializer_list&lt;\r\n        std::pair&lt;uint32_t, uint32_t&gt;&gt; blocks)\r\n{\r\n    for (auto&amp;&amp; block : blocks) {\r\n        memset(map + block.first, 0, block.second);\r\n    }\r\n}\r\n<\/pre>\n<p>The customer found that even though this function is performing only write operations, a performance trace showed that the system was nevertheless reading from the file.<\/p>\n<p>Memory-mapped files work by trapping the first access to each page. When the access occurs, the entire page is read from the disk, mapped into memory, and then the memory access operation is permitted to proceed.<\/p>\n<p>The system does this regardless of whether the access was a read or a write. After all, that one written byte has to be merged with the existing content of the page.<\/p>\n<p>&#8220;But in my case, the offsets and lengths are all page multiples, so there is no need to read the entire page into memory.&#8221;<\/p>\n<p>Well, you know that, but the CPU doesn&#8217;t.<\/p>\n<p>The CPU sees the first write to the page and traps to the kernel. The operating system doesn&#8217;t go to the trouble of analyzing the code surrounding the fault and realizing that this code sequence:<\/p>\n<pre>@@: vmovntdq ymmword ptr [rcx],ymm0\r\n    vmovntdq ymmword ptr [rcx+20h],ymm0\r\n    vmovntdq ymmword ptr [rcx+40h],ymm0\r\n    vmovntdq ymmword ptr [rcx+60h],ymm0\r\n    vmovntdq ymmword ptr [rcx+80h],ymm0\r\n    vmovntdq ymmword ptr [rcx+0A0h],ymm0\r\n    vmovntdq ymmword ptr [rcx+0C0h],ymm0\r\n    vmovntdq ymmword ptr [rcx+0E0h],ymm0\r\n    add     rcx,100h\r\n    sub     r8,100h\r\n    cmp     r8,100h\r\n    jae     @B\r\n<\/pre>\n<p>is a <code>memset<\/code> loop that writes <code>r8<\/code> \u00f7 256 \u00d7 8 copies of the <code>ymm0<\/code> register to consecutive bytes of memory.<\/p>\n<p>But suppose the operating system had code to recognize the top 50 most common implementations of <code>memset<\/code>. And suppose that it saw that the <code>memset<\/code> was going to write an entire page of zeroes. In that case, could it avoid the useless read?<\/p>\n<p>I guess the operating system could do that. It would have to realize that this was a full-page memset and perform the same memset on the newly-mapped page before making the memory visible to the process (in case other threads read from the page before the faulting thread finishes the loop).<\/p>\n<p>Still, that&#8217;s a lot of <code>memset<\/code> detection, since it would have to run at every write page fault. I suspect the write page faults that are due to <code>memset<\/code> of more than one page represent too small a fraction of total write page faults to be worth the trouble.<\/p>\n<p>But who knows. <a title=\"Zeroing out my memory does cause them to page in faster after all\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20161109-00\/?p=94675\"> My intuition on this has been wrong before<\/a>.<\/p>\n<p><b>Update<\/b>: Commenter Nir Lichtman pointed out that memory-mapped files and I\/O are not necessarily coherent, so the <code>Write\u00adFile<\/code> trick doesn&#8217;t work.<\/p>\n<p><del> What you can do instead is write zeroes to the memory-mapped file the old-fashioned way: with <code>Write\u00adFile<\/code>. Since Windows NT unifies memory-mapped files with the file cache, these writes will be coherent with the memory mapping. If you want to get fancy, you can use overlapped writes and wait for them all to complete.<\/del><\/p>\n<p><del> You need only set aside one page of zeroes for this trick. For regions up to a page in size, you can issue a <code>Write\u00adFile<\/code> from your special zero-buffer. Regions larger than a page can be broken up into page-sized chunks, or you can use <code>Write\u00adFile\u00adGather<\/code> to write that page to consecutive pages in the file.<\/del><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The CPU doesn&#8217;t see the entire write at once.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-110486","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>The CPU doesn&#8217;t see the entire write at once.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/110486","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=110486"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/110486\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=110486"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=110486"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=110486"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}