{"id":106786,"date":"2022-06-24T07:00:00","date_gmt":"2022-06-24T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=106786"},"modified":"2022-06-24T06:46:48","modified_gmt":"2022-06-24T13:46:48","slug":"20220624-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20220624-00\/?p=106786","title":{"rendered":"The case of the mysterious over-release from deep inside the marshaling infrastructure"},"content":{"rendered":"<p>A customer had a reference-counting bug where a certain scenario caused a COM object to be over-released. We started by taking a <a href=\"https:\/\/docs.microsoft.com\/en-us\/windows-hardware\/drivers\/debugger\/time-travel-debugging-overview\"> Time Travel Trace<\/a> and gathering all the points where the <code>AddRef<\/code> or <code>Release<\/code> methods were called.<\/p>\n<p>The Time Travel Debugging object model makes it easy to find all the places a method was called and print the results in a nice table.<\/p>\n<pre style=\"white-space: pre-wrap;\">0:040&gt; dx -g @$cursession.TTD.Calls(\"contoso!CWidget::AddRef\").Where(c =&gt; (__int64)c.Parameters[0]==0x1a792dd0).OrderBy(c =&gt; c.TimeStart).Select(c =&gt; new {TimeStart = c.TimeStart, RefCount = c.ReturnValue, ThreadId = c.ThreadId})\r\n<\/pre>\n<p>This expression breaks down as follows:<\/p>\n<ul>\n<li><code>@$cursession.TTD.Calls(\"contoso!CWidget::AddRef\")<\/code> \u2014 Find all the calls to this function<\/li>\n<li><code>.Where(c =&gt; (__int64)c.Parameters[0]==0x1a792dd0)<\/code> \u2014 Filtered to the calls where the <code>this<\/code> pointer (the invisible first parameter) is a specific value<\/li>\n<li><code>.OrderBy(c =&gt; c.TimeStart)<\/code> \u2014 Sorted chronologically<\/li>\n<li><code>.Select(c =&gt; new {TimeStart = c.TimeStart, RefCount = c.ReturnValue, ThreadId = c.ThreadId})<\/code> \u2014 Print these fields<\/li>\n<\/ul>\n<p>Therefore, the command finds all the times in the trace where <code>contoso!CWidget::AddRef<\/code> was called for the object that was over-released, sorts them chronologically, and prints the timestamp, the resulting reference count, and the thread that issued the call.<\/p>\n<p>A similar command finds all the calls to <code>CWidget::<wbr \/>Release<\/code>.<\/p>\n<p>Sort the two lists together by position to get a chronology of events.<\/p>\n<p>Now we can use the <code>!tt<\/code> command to go to each of those time positions and get a stack trace. Many of the <code>AddRef<\/code> and <code>Release<\/code> calls are easily paired because the calls are made from the same function. Two of the <code>AddRef<\/code> calls are never released beause they correspond to references being held by objects that are still live at the end of the trace.<\/p>\n<p>And then there is the batch of unmatched calls, and we need to match them up and figure out which are the unmatched <code>Release<\/code> calls.<\/p>\n<p>The unmatched <code>AddRef<\/code> calls comes from <code>combase!<wbr \/>CDestObjectWrapper::<wbr \/>MarshalInterface<\/code>, which is an internal function called from <code>CoMarshal\u00adInterface<\/code>.<\/p>\n<p>There are two categories of unmatched <code>Release<\/code> calls. One of them belongs to the <code>Release<\/code> of another object <code>CNamedWidget<\/code> that contains the <code>CWidget<\/code> as a member, and another comes from <code>combase!<wbr \/>CStaticMarshaler::<wbr \/>ReleaseMarshalData<\/code>, which is an internal function called from <code>CoRelease\u00adMarshal\u00adData<\/code>.<\/p>\n<p>And it turns out that the two categories of <code>Release<\/code> calls collectively outnumber the number of unmatched <code>AddRef<\/code> calls. So one of those categories of calls is wrong, but which one?<\/p>\n<p>From what we learned earlier, <a title=\"Understanding the marshaling flags: The free-threaded marshaler\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20220622-00\/?p=106774\"> when a normal-marshaled object is unmarshaled, you do not observe any change to the reference count<\/a> because the ownership of the reference count is transfered from the stream directly to the unmarshaled object. A breakpoint on the reference count is not going to fire because there is no change to the reference count. It happens behind your back.<\/p>\n<p>If an object is normal-marshaled, it should either be unmarshaled, or the marshal data should be released, but here we&#8217;re doing both, which is the source of the double-release. Why is it doing both?<\/p>\n<p>The stack for the <code>AddRef<\/code> looks like this:<\/p>\n<pre>contoso!CWidget::AddRef\r\ncontoso!CWidget::QueryInterface+0x31\r\ncombase!CStaticMarshaler::MarshalInterface+0x728\r\ncombase!CDestObjectWrapper::MarshalInterface+0x2f4\r\ncombase!CoMarshalInterface+0x2dc\r\ncontoso!CNamedWidget::MarshalInterface+0x208b63\r\ncombase!CDestObjectWrapper::MarshalInterface+0x2f4\r\ncombase!CoMarshalInterface+0x2dc\r\ncombase!GitRegisterHlp+0x2af\r\ncombase!CGIPTable::RegisterInterfaceInGlobalHlp+0x292\r\ncombase!CGIPTable::RegisterInterfaceInGlobal+0x1b\r\ncombase!RoGetAgileReference+0x8a1\r\n<\/pre>\n<p>The marshaling request is coming from the creation of an agile reference, and that is marshaling the <code>CNamedWidget<\/code>, which is in turn marshaling the <code>CWidget<\/code>. After some digging, we noticed an anomaly: For the outer call to <code>Co\u00adMarshal\u00adInterface<\/code> coming from <code>Git\u00adRegister\u00adHlp<\/code>, the flags are<\/p>\n<pre>rax=00007ffb71634d20 rbx=0000000000000000 rcx=00000000042ea568\r\nrdx=00007ffb5a43cbb0 rsi=0000000000000000 rdi=0000000017c009e8\r\nrip=00007ffb7143ebf0 rsp=00000000042ea4d8 rbp=00000000042ea5b1\r\n r8=000000001ac11338  r9=0000000000000003 r10=0000000000000000\r\nr11=3ff8000000000003 r12=0000000000000001 r13=00007ffb5a43cbb0\r\nr14=0000000017c00a38 r15=00000000042ea678\r\niopl=0         nv up ei pl nz na po nc\r\ncs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000206\r\ncombase!CoMarshalInterface:\r\n00007ffb`7143ebf0 4055            push    rbp\r\n0:000&gt; dps @rsp+28 L2\r\n00000000`042ea500  00000000`00000000\r\n00000000`042ea508  00007ffb`00000001\r\n<\/pre>\n<p>From this we see that the parameters passed to <code>Co\u00adMarshal\u00adInterface<\/code> are<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>Parameter<\/th>\n<th>Passed in<\/th>\n<th>Value<\/th>\n<th>Notes<\/th>\n<\/tr>\n<tr>\n<td><code>pStm<\/code><\/td>\n<td><code>rcx<\/code><\/td>\n<td><code>0x42ea568<\/code><\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td><code>riid<\/code><\/td>\n<td><code>rdx<\/code><\/td>\n<td><code>0x00007ffb5a43cbb0<\/code><\/td>\n<td>The interface being marshaled<\/td>\n<\/tr>\n<tr>\n<td><code>pUnk<\/code><\/td>\n<td><code>r8<\/code><\/td>\n<td><code>0x1ac11338<\/code><\/td>\n<td>The <code>CNamed\u00adWidget<\/code> being marshaled<\/td>\n<\/tr>\n<tr>\n<td><code>dwDestContext<\/code><\/td>\n<td><code>r9<\/code><\/td>\n<td><code>3<\/code><\/td>\n<td><code>MSHCTX_INPROC<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>pvDestContext<\/code><\/td>\n<td><code>[rsp+28]<\/code><\/td>\n<td><code>nullptr<\/code><\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td><code>mshlflags<\/code><\/td>\n<td><code>[rsp+30]<\/code><\/td>\n<td><code>1<\/code><\/td>\n<td><code>MSHLFLAGS_TABLESTRONG<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>However, when the <code>CNamed\u00adWidget<\/code> goes to marshal the inner <code>CWidget<\/code>, we see this:<\/p>\n<pre>rax=0000000000000000 rbx=000000001ac11330 rcx=00000000042ea568\r\nrdx=00007ffb6d31c4b0 rsi=0000000000000003 rdi=00000000042ea568\r\nrip=00007ffb7143ebf0 rsp=00000000042ea178 rbp=00000000042ea2d0\r\n r8=000000001a792dd0  r9=0000000000000003 r10=00000000161a3080\r\nr11=000000000cf59986 r12=00000000042ea568 r13=0000000000000004\r\nr14=0000000000000001 r15=00007ffb5a43cbb0\r\niopl=0         nv up ei pl zr na po nc\r\ncs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246\r\ncombase!CoMarshalInterface:\r\n00007ffb`7143ebf0 4055            push    rbp\r\n0:000&gt; dps @rsp+28 L2\r\n00000000`042ea1a0  00000000`00000000\r\n00000000`042ea1a8  00000000`00000000\r\n<\/pre>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>Parameter<\/th>\n<th>Passed in<\/th>\n<th>Value<\/th>\n<th>Notes<\/th>\n<\/tr>\n<tr>\n<td><code>pStm<\/code><\/td>\n<td><code>rcx<\/code><\/td>\n<td><code>0x42ea568<\/code><\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td><code>riid<\/code><\/td>\n<td><code>rdx<\/code><\/td>\n<td><code>0x00007ffb6d31c4b0<\/code><\/td>\n<td>The interface being marshaled<\/td>\n<\/tr>\n<tr>\n<td><code>pUnk<\/code><\/td>\n<td><code>r8<\/code><\/td>\n<td><code>0x1a792dd0<\/code><\/td>\n<td>The <code>CWidget<\/code> being marshaled<\/td>\n<\/tr>\n<tr>\n<td><code>dwDestContext<\/code><\/td>\n<td><code>r9<\/code><\/td>\n<td><code>3<\/code><\/td>\n<td><code>MSHCTX_INPROC<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>pvDestContext<\/code><\/td>\n<td><code>[rsp+28]<\/code><\/td>\n<td><code>nullptr<\/code><\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td><code>mshlflags<\/code><\/td>\n<td><code>[rsp+30]<\/code><\/td>\n<td><code>0<\/code><\/td>\n<td><code>MSHLFLAGS_NORMAL<\/code> \u2190 huh?<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Okay, that seems awfully strange. The outer object is being strong-marshaled but the inner object is only normal-marshaled.<\/p>\n<p>That explains why we are seeing a double-release of the inner object: The <code>Ro\u00adGet\u00adAgile\u00adReference<\/code> function strong-marshaled the object, which means that it is going to call <i>both<\/i> <code>Unmarshal\u00adInterface<\/code> <i>and<\/i> <code>Release\u00adMarshal\u00adData<\/code>. However, the <code>CWidget<\/code> was normal-marshaled, which means that it expected to receive <i>either<\/i> <code>Unmarshal\u00adInterface<\/code> <i>or<\/i> <code>Release\u00adMarshal\u00adData<\/code>, but not both. If you call both, then the marshal data gets double-destroyed, and that&#8217;s where the double-release is coming from.<\/p>\n<p>Here&#8217;s a sketch of the marshaling code for the <code>CNamed\u00adWidget<\/code>:<\/p>\n<pre>HRESULT CNamedWidget::MarshalInterface(...)\r\n{\r\n  if (\u27e6want to marshal by shallow copy\u27e7) {\r\n    RETURN_IF_FAILED(IStream_WriteStr(pstm, m_name));\r\n    RETURN_IF_FAILED(CoMarshalInterface(pstm, __uuidof(m_widget.Get()), m_widget.Get(),\r\n                              dwDestCtx, pvDestCtx, <span style=\"color: red;\">MSHLFLAGS_NORMAL<\/span>));\r\n    return S_OK;\r\n  }\r\n  \u27e6delegate to standard marshaler\u27e7\r\n}\r\n<\/pre>\n<p>Observe that the <code>Marshal\u00adInterface<\/code> method always marshals the <code>CWidget<\/code> with <code>MSHLFLAGS_<wbr \/>NORMAL<\/code> instead of using the same marshal flags that it was given. That&#8217;s the source of the problem.<\/p>\n<p>The customer confirmed that making that one change fixed their problem.<\/p>\n<p><b>Bonus chatter<\/b>: Here are the other marshaling and unmarshaling methods:<\/p>\n<pre>HRESULT CNamedWidget::UnmarshalInterface(...)\r\n{\r\n  *ppv = nullptr;\r\n  RETURN_IF_FAILED(IStream_ReadStr(pstm, &amp;m_name));\r\n  RETURN_IF_FAILED(CoUnmarshalInterface(pstm, IID_PPV_ARGS(&amp;m_widget)));\r\n  RETURN_IF_FAILED(QueryInterface(riid, ppv));\r\n  return S_OK;\r\n}\r\n\r\nHRESULT CNamedWidget::ReleaseMarshalData(...)\r\n{\r\n  if (!m_widget) {\r\n    \/\/ Read the string and throw it away.\r\n    CComHeapPtr&lt;wchar_t&gt; name;\r\n    RETURN_IF_FAILED(IStream_ReadStr(pstm, &amp;name));\r\n    RETURN_IF_FAILED(CoReleaseMarshalData(pstm));\r\n  }\r\n  return S_OK;\r\n}\r\n<\/pre>\n<p>You might notice that there&#8217;s also a bug in the <code>Release\u00adMarshal\u00adData<\/code> method: That method is supposed to clean up the marshal data unconditionally, but this version does so only if the unmarshaler hasn&#8217;t yet been used to unmarshal anything: It releases the marshal data only if the <code>m_widget<\/code> is still empty. Furthermore, in the case where it decides not to release the marshal data, it doesn&#8217;t even bother to move the stream pointer past the marshal data. It just leaves the stream pointer where it was, causing the next object in the stream to receive the <code>CNamed\u00adWidget<\/code>&#8216;s unmarshal data instead of the data it expects.<\/p>\n<p>Fortunately, in practice, the unmarshaler is nearly always empty, because <code>Co\u00adRelease\u00adMarshal\u00adData<\/code> will create a brand new unmarshaler in order to call its <code>Release\u00adMarshal\u00adData<\/code>. So this bug ends up masked, but they made a note to fix it anyway.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The tedium of monitoring reference counts.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-106786","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>The tedium of monitoring reference counts.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/106786","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=106786"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/106786\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=106786"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=106786"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=106786"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}