{"id":109636,"date":"2024-04-12T07:00:00","date_gmt":"2024-04-12T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=109636"},"modified":"2024-04-06T08:53:14","modified_gmt":"2024-04-06T15:53:14","slug":"20240412-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20240412-00\/?p=109636","title":{"rendered":"The case of the string being copied from a mysterious pointer to invalid memory"},"content":{"rendered":"<p>A customer ran some stress tests on their program with Application Verifier enabled. Thanks for doing that!<\/p>\n<p>They found that there were rare but repeated crashes where their program appeared to be copying a string from invalid memory due to a pointer that didn&#8217;t appear to match the member variable it was supposed to have come from. One such hit might be chalked up to a flaky CPU, but they had three, from three different machines.<\/p>\n<p>Since this shows up in stress testing, it&#8217;s not practical to collect a Time Travel Trace, but let&#8217;s see what we can do with the crash dumps that were produced.<\/p>\n<pre>This dump file has an exception of interest stored in it.\r\nThe stored exception information can be accessed via .ecxr.\r\n(5db8.4180): Access violation - code c0000005 (first\/second chance not available)\r\nFor analysis of this file, run !analyze -v\r\n\r\nrax=000001e081072fe0 rbx=000001e080fe8fc8 rcx=000001e081072fe0\r\nrdx=000001e081070ff0 rsi=000001e081070fe0 rdi=000000000000000f\r\nrip=00007ffc08ab143a rsp=000000b3ed0ffa28 rbp=000001e081072fe0\r\n r8=0000000000000010  r9=0000000000000000 r10=00007ffc08ab0000\r\nr11=000001e0fa3bfde0 r12=0000000000000000 r13=0000000000000000\r\nr14=000000000000001f r15=000000000000022c\r\nVCRUNTIME140!memcpy+0x12a:\r\n00007ffc`08ab143a movdqu  xmm2,xmmword ptr [rdx+r8-10h] ds:000001e0`81070ff0=????????\r\n0:005&gt; kn\r\n # RetAddr           Call Site\r\n00 00007ff6`06144342 VCRUNTIME140!memcpy+0x12a\r\n01 (Inline Function) contoso!std::_Char_traits&lt;char,int&gt;::copy+0x13\r\n02 (Inline Function) contoso!std::string::_Construct+0xba\r\n03 (Inline Function) contoso!std::string::{ctor}+0xd3\r\n04 00007ff6`06149dcf contoso!Widget::GetId+0x202\r\n05 00007ff6`0614742d contoso!&lt;lambda_9cd1b560a470dbb5c6e1d5da28bd3866&gt;::operator()+0xc0\r\n09 00007ffc`22785976 ntdll!TppSimplepExecuteCallback+0xa3\r\n0a 00007ffc`2172257d ntdll!TppWorkerThread+0x8f6\r\n0b 00007ffc`227aaa78 kernel32!BaseThreadInitThunk+0x1d\r\n0c 00000000`00000000 ntdll!RtlUserThreadStart+0x28\r\n<\/pre>\n<p>So we have a lambda that called <code>Widget::GetId<\/code>, and we crashed trying to copy a string. Here&#8217;s what <code>Widget::GetId<\/code> looks like:<\/p>\n<pre>struct Widget\r\n{\r\n    std::string GetId()\r\n    {\r\n        if (m_uniqueId.empty())\r\n        {\r\n            m_uniqueId = SlowGetId();\r\n        }\r\n        return m_uniqueId; \/\/ &lt;&lt;&lt; crash here\r\n    }\r\n\r\nprivate:\r\n    std::string m_uniqueId;\r\n\r\n    std::string SlowGetId();\r\n};\r\n<\/pre>\n<p>If we dig into the STL code, we are in the <code>copy<\/code> method at the point where we copy the characters from the string into a newly-allocated buffer.<\/p>\n<pre>        _CSTD memcpy(_First1, _First2, _Count * sizeof(_Elem));\r\n<\/pre>\n<p>The address we are trying to copy from is in the <code>rdx<\/code> register, <code>000001e0`81070ff0<\/code>, which is presumably the string buffer hiding inside <code>m_uniqueId<\/code>, but when we ask the debugger for the contents of <code>m_uniqueId<\/code>, we see something different:<\/p>\n<pre style=\"white-space: pre-wrap;\">0:005&gt; ?? this-&gt;m_uniqueId\r\nclass std::string\r\n   +0x000 _Mypair          : std::_Compressed_pair&lt;std::allocator&lt;char&gt;,std::_String_val&lt;std::_Simple_types&lt;char&gt; &gt;,1&gt;\r\n0:005&gt; ?? this-&gt;m_uniqueId._Mypair\r\nclass std::_Compressed_pair&lt;std::allocator&lt;char&gt;,std::_String_val&lt;std::_Simple_types&lt;char&gt; &gt;,1&gt;\r\n   +0x000 _Myval2          : std::_String_val&lt;std::_Simple_types&lt;char&gt; &gt;\r\n0:005&gt; ?? this-&gt;m_uniqueId._Mypair._Myval2\r\nclass std::_String_val&lt;std::_Simple_types&lt;char&gt; &gt;\r\n   +0x000 _Bx              : std::_String_val&lt;std::_Simple_types&lt;char&gt; &gt;::_Bxty\r\n   +0x010 _Mysize          : ??\r\n   +0x018 _Myres           : 0x10\r\n0:005&gt; ?? this-&gt;m_uniqueId._Mypair._Myval2._Bx\r\nunion std::_String_val&lt;std::_Simple_types&lt;char&gt; &gt;::_Bxty\r\n   +0x000 _Buf             : [16]  \"???\"\r\n   +0x000 _Ptr             : <span style=\"border: solid 1px currentcolor;\">0x000001e0`81074ff0  \"fdf551a3ebd7f381\"<\/span>\r\n   +0x000 _Alias           : [16]  \"???\"\r\n0:005&gt;\r\n<\/pre>\n<p>In memory, the <code>m_uniqueId<\/code> of the widget is a reasonable-looking string of hex digits, and the pointer is nothing like the address we crashed on.<\/p>\n<p>I suspected that at the time we started copying, the string was indeed at <code>000001e0`81070ff0<\/code>, but while we were copying the string, another thread came in and changed the <code>m_uniqueId<\/code>, which freed the string out from under us.<\/p>\n<p>Since this crash occurred when running under Application Verifier, we can ask Application Verifier for the histories of these two memory blocks.<\/p>\n<pre>0:005&gt; !avrf -?\r\nVerifier package version &gt;= 3.00 \r\nApplication verifier debugger extension                      \r\n                                                             \r\n...\r\n!avrf -hp N           dump last N entries from heap log.     \r\n!avrf -hp -a ADDR     searches ADDR in the heap log.         \r\n...\r\n<\/pre>\n<p>Great, we can use the <code>!avrf -hp -a<\/code> command to ask AppVerifier to tell us what it knows about an address on the heap.<\/p>\n<pre>0:005&gt; !avrf -hp -a 0x000001e0`81070ff0\r\n\r\nSearching call tracker @ 000001e0fe042fc0 with 422 valid entries ...\r\n--------------------------------------------------------------\r\n2024-04-12T07:00:20.140Z <span style=\"border: solid 1px currentcolor;\">GlobalIndex 19E1E6<\/span> ThreadId NA\r\nHeapFree: 1E081070FF0 11 1E0FBB60000 <span style=\"border: solid 1px currentcolor;\">7870<\/span>\r\n\r\n    00007ffc1fae37eb: ucrtbase!_free_base+0x1B\r\n    00007ffb45614ff7: vfbasics!AVrfp_ucrt_free+0x57\r\n    00007ff606144219: contoso!Widget::GetId+0xD9\r\n    00007ff606149400: contoso!&lt;lambda_963292f271044b3ead3564f3abbc4b26&gt;::operator()+0xb7\r\n    00007ffc2279e4a3: ntdll!TppSimplepExecuteCallback+0xA3\r\n    00007ffc22785976: ntdll!TppWorkerThread+0x8F6\r\n    00007ffc2172257d: KERNEL32!BaseThreadInitThunk+0x1D\r\n    00007ffc227aaa78: ntdll!RtlUserThreadStart+0x28\r\n\r\n--------------------------------------------------------------\r\n2024-04-12T07:00:20.139Z <span style=\"border: solid 1px currentcolor;\">GlobalIndex 19E1E0<\/span> ThreadId NA\r\nHeapAlloc: 1E081070FF0 11 1E0FBB60000 <span style=\"border: solid 1px currentcolor;\">4180<\/span>\r\n\r\n    00007ffc22854438: ntdll!RtlDebugAllocateHeap+0x48\r\n    00007ffc2280d6f0: ntdll!RtlpAllocateHeap+0x7EAB0\r\n    00007ffc2278cd49: ntdll!RtlpAllocateHeapInternal+0x6C9\r\n    00007ffbf8bcc3dc: vrfcore!VfCoreRtlAllocateHeap+0x2C\r\n    00007ffb456137d5: vfbasics!AVrfpRtlAllocateHeap+0x155\r\n    00007ffc1fae1b06: ucrtbase!_malloc_base+0x36\r\n    00007ffb45614e20: vfbasics!AVrfp_ucrt_malloc+0x40\r\n    00007ff6061491f3: contoso!operator new+0x1F\r\n    00007ff606144365: contoso!Widget::SlowGetId+0x25\r\n    00007ff606144185: contoso!Widget::GetId+0x45\r\n    00007ff60614742d: contoso!&lt;lambda_9cd1b560a470dbb5c6e1d5da28bd3866&gt;::operator()+0xc0\r\n    00007ffc22785976: ntdll!TppSimplepExecuteCallback+0xa3\r\n    00007ffc22785976: ntdll!TppWorkerThread+0x8F6\r\n    00007ffc2172257d: KERNEL32!BaseThreadInitThunk+0x1D\r\n    00007ffc227aaa78: ntdll!RtlUserThreadStart+0x28\r\n<\/pre>\n<p>And the history for the string in <code>m_uniqueId<\/code> right now is<\/p>\n<pre>0:005&gt; !avrf -hp -a 0x000001e0`81074ff0\r\n\r\nSearching call tracker @ 000001e0fe042fc0 with 422 valid entries ...\r\n--------------------------------------------------------------\r\n2024-04-12T07:00:20.140Z <span style=\"border: solid 1px currentcolor;\">GlobalIndex 19E1E3<\/span> ThreadId NA\r\nHeapAlloc: 1E081074FF0 11 1E0FBB60000 <span style=\"border: solid 1px currentcolor;\">7870<\/span>\r\n\r\n    00007ffc22854438: ntdll!RtlDebugAllocateHeap+0x48\r\n    00007ffc2280d6f0: ntdll!RtlpAllocateHeap+0x7EAB0\r\n    00007ffc2278cd49: ntdll!RtlpAllocateHeapInternal+0x6C9\r\n    00007ffbf8bcc3dc: vrfcore!VfCoreRtlAllocateHeap+0x2C\r\n    00007ffb456137d5: vfbasics!AVrfpRtlAllocateHeap+0x155\r\n    00007ffc1fae1b06: ucrtbase!_malloc_base+0x36\r\n    00007ffb45614e20: vfbasics!AVrfp_ucrt_malloc+0x40\r\n    00007ff6061491f3: contoso!operator new+0x1F\r\n    00007ff606144365: contoso!Widget::SlowGetId+0x25\r\n    00007ff606144185: contoso!Widget::GetId+0x45\r\n    00007ff606149400: contoso!&lt;lambda_963292f271044b3ead3564f3abbc4b26&gt;::operator()+0xb7\r\n    00007ffc2279e4a3: ntdll!TppSimplepExecuteCallback+0xA3\r\n    00007ffc22785976: ntdll!TppWorkerThread+0x8F6\r\n    00007ffc2172257d: KERNEL32!BaseThreadInitThunk+0x1D\r\n    00007ffc227aaa78: ntdll!RtlUserThreadStart+0x28\r\n<\/pre>\n<p>The timestamps and GlobalIndex let us reconstruct the chronology of what happened.<\/p>\n<p>First (GlobalIndex <tt>19E1E0<\/tt>), the original string was allocated when a lambda called <code>Widget::<wbr \/>GetId<\/code> from thread <tt>4180<\/tt>. Which happens to be the thread we&#8217;re on right now:<\/p>\n<pre>0:005&gt; ~.\r\n.  5  Id: 5db8.<span style=\"border: solid 1px currentcolor;\">4180<\/span> Suspend: 0 Teb: 000000b3`ecb84000 Unfrozen\r\n      Start: ntdll!TppWorkerThread (00007ffc`22785080)\r\n      Priority: 0  Priority class: 32  Affinity: ff\r\n<\/pre>\n<p>The <code>GetId()<\/code> called <code>SlowGetId()<\/code>, and that&#8217;s what actually allocated the string.<\/p>\n<p>Next (GlobalIndex <tt>19E1E3<\/tt>), thread <tt>7870<\/tt> allocated a replacement string, also through <code>SlowGetId()<\/code>.<\/p>\n<p>And then (GlobalIndex <tt>19E1E6<\/tt>), thread <tt>7870<\/tt> freed the first string.<\/p>\n<p>And finally, we crashed because thread <tt>4180<\/tt> (the current thread) tried to copy from that first string, which was already freed.<\/p>\n<p>My guess is that we had a race condition where two threads called <code>GetId()<\/code> at the same time, and both of them decided to do the lazy initialization of <code>m_uniqueId<\/code>.<\/p>\n<p>Thread <tt>4180<\/tt> enters <code>GetId()<\/code> and sees that <code>m_uniqueId<\/code> is empty, so it calls <code>SlowGetId()<\/code> to get the ID.<\/p>\n<p>While <code>SlowGetId()<\/code> is doing its slow work, thread <tt>7870<\/tt> calls <code>GetId()<\/code>, and it too sees that <code>m_uniqueId<\/code> is empty, so it also calls <code>SlowGetId()<\/code> to get the ID.<\/p>\n<p>Thread <tt>4180<\/tt> finishes getting the ID from <code>SlowGetId()<\/code> and saves it in <code>m_uniqueId<\/code>. It then makes a copy of <code>m_uniqueId<\/code> to return to the caller.<\/p>\n<p>While this copy is being made, thread <tt>7870<\/tt> finishes its call to <code>SlowGetId()<\/code> and (here&#8217;s where things go bad) saves it in <code>m_uniqueId<\/code>, which causes the previous string to become freed, even though thread <tt>4180<\/tt> is still copying from it!<\/p>\n<p>The problem is therefore conflicting multithreaded access to <code>m_uniqueId<\/code>: One thread is reading while another is writing.<\/p>\n<p>The design of the <code>Widget<\/code> class is apparently that multithreaded use is allowed, but simultaneous multithreaded use is permitted only for read operations. You cannot have a write operation concurrent with a read or another write.<\/p>\n<p>The lambdas that called <code>GetId()<\/code> were careful to lock a <code>shared_mutex<\/code> in shared mode, thinking that <code>GetId()<\/code> was a read-only operation. I mean, look at its name: It&#8217;s &#8220;Get&#8221;. It just reads something!<\/p>\n<p>Unfortunately, the lazy initialization of <code>m_uniqueId<\/code> made <code>GetId()<\/code> a read-write operation, even though its name sure sounds like a read-only operation.<\/p>\n<p>One way to fix this is to make sure all callers of <code>GetId()<\/code> take an exclusive lock rather than a shared lock before calling <code>GetId()<\/code>.<\/p>\n<p>Another way to fix this is to have <code>GetId()<\/code> apply internal locking:<\/p>\n<pre>    std::string GetId()\r\n    {\r\n        if (auto lock = std::shared_lock(m_sharedMutex);\r\n            !m_uniqueId.empty())\r\n        {\r\n            return m_uniqueId;\r\n        }\r\n\r\n        auto uniqueId = SlowGetId();\r\n        auto lock = std::unique_lock(m_sharedMutex);\r\n        if (m_uniqueId.empty()) {\r\n            m_uniqueId = std::move(uniqueId);\r\n        }\r\n        return m_uniqueId;\r\n    }\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Using AppVerifier to deduce the heap allocation history.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-109636","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Using AppVerifier to deduce the heap allocation history.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/109636","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=109636"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/109636\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=109636"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=109636"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=109636"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}