{"id":635,"date":"2018-10-09T21:39:41","date_gmt":"2018-10-09T21:39:41","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/maoni\/?p=635"},"modified":"2018-10-09T21:39:41","modified_gmt":"2018-10-09T21:39:41","slug":"you-should-never-see-this-callstack-in-production","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/dotnet\/you-should-never-see-this-callstack-in-production\/","title":{"rendered":"You Should Never See This Callstack in Production"},"content":{"rendered":"<p>A customer who just experienced some server outage asked us for help as they thought it was due to some very long GC pauses. I thought this diagnostics exercise might be useful for other folks as well so I am sharing it with you.<\/p>\n<p>We had them collect a GCCollectOnly trace which showed there were some very long induced gen2 blocking GCs. And they are using Server GC which means we wanted stacks for the GCTrigger event which would show us exactly who induced these GCs. For Server GC the GCStart event is actually fired on one of the Server GC threads which would mean it&#8217;s not useful for figuring out who triggered the GC so we have this GCTrigger event to help exactly with that. However GCCollectOnly traces do not include stacks for CLR events so we asked them to collect some more traces for us with this perfview command line:<\/p>\n<p><span style=\"color: blue;font-family: lucida console;font-size: small\">PerfView.exe \/nogui \/accepteula \/BufferSizeMB:4096 \/CircularMB:2000 \/CollectMultiple:3 \/StopOnGcOverMsec:3000 \/DelayAfterTriggerSec:0 \/KernelEvents:Memory,VirtualAlloc,Default \/Process:&#8221;w3wp&#8221; \/OnlyProviders:&#8221;ClrPrivate:1:5,Clr:0x40000001:5&#8243; \/NoRundown \/NoNGenRundown \/NoClrRundown \/Merge:true \/Zip:true collect<\/span><\/p>\n<p>Yes I know it\u2019s a very long command so let me break it down for you. I am intentionally putting things on separated lines so it\u2019s easier to read &#8211;<\/p>\n<p><span style=\"color: Tomato;font-family: lucida console;font-size: small\">Collect a trace with default kernel events + some memory events (specified with <span style=\"color: blue\">\/KernelEvents:Memory,VirtualAlloc,Default<\/span> \u2013 Default is there for things like being able to decode process names so you don\u2019t get a trace where each process is only indicated by its process ID and it also includes the CPU sample events which we want in this case as we\u2019d like to see which part of the work is expensive in GC); since we are collecting a trace to diagnose GC problems we are generally also interested in the OS memory events)<\/span><\/p>\n<p><span style=\"color: Tomato;font-family: lucida console;font-size: small\">and CLR GC events at verbose level with stacks for the public GC events (specified with <span style=\"color: blue\">\/OnlyProviders:&#8221;ClrPrivate:1:5,Clr:0x40000001:5&#8243;<\/span> \u2013 keyword 1 for both CLR public and private providers means GC and the 0x40000000 in the Clr public provider means collect stacks for those events. We do this so we get stacks with the GCTriggered event)<\/p>\n<p><span style=\"color: Tomato;font-family: lucida console;font-size: small\">and when you detect a GC longer than 3s (specified with <span style=\"color: blue\">\/StopOnGcOverMsec:3000<\/span>)<\/p>\n<p><span style=\"color: Tomato;font-family: lucida console;font-size: small\">in the w3wp.exe process (specified with <span style=\"color: blue\">\/Process:&#8221;w3wp&#8221;<\/span>, note that you specify w3wp, not w3wp.exe here)<\/p>\n<p><span style=\"color: Tomato;font-family: lucida console;font-size: small\">stop the trace immediately when you detect that long GC (specified with <span style=\"color: blue\">\/DelayAfterTriggerSec:0<\/span>)<\/p>\n<p><span style=\"color: Tomato;font-family: lucida console;font-size: small\">don\u2019t do any rundown events (specified with <span style=\"color: blue\">\/NoRundown \/NoNGenRundown \/NoClrRundown<\/span>)<\/p>\n<p><span style=\"color: Tomato;font-family: lucida console;font-size: small\">merge and zip the trace (specified with <span style=\"color: blue\">\/Merge:true \/Zip:true<\/span>. You should definitely do merge if you need to analyze the trace on a different machine. I can\u2019t tell you how many times people have failed to do this and then couldn\u2019t get symbols decoded and had to recollect the trace as they already detected the unmerged traces from the machine where they collected on)<\/p>\n<p><span style=\"color: Tomato;font-family: lucida console;font-size: small\">and do that 3 times total (specified with <span style=\"color: blue\">\/CollectMultiple:3<\/span>)<\/p>\n<p><\/span><\/p>\n<p>So if you were to use it for your own purpose the args to change are<\/p>\n<p><span style=\"color: blue;font-family: lucida console;font-size: small\">\/StopOnGcOverMsec:3000<\/span> \u2013 replace 3000 with the pause (in ms) you want to stop the trace for<\/p>\n<p><span style=\"color: blue;font-family: lucida console;font-size: small\">\/Process:&#8221;w3wp&#8221;<\/span> \u2013 replace w3wp with the .exe you want to detect the long GC pause in<\/p>\n<p>When we got the trace GCStats showed that there was an induced full blocking GC in one of the w3wp processes that took 7.8 seconds! And to see what induced this GC, goto Events view and find the <span style=\"color: black;font-family: lucida console;font-size: small\">Microsoft-Windows-DotNETRuntime\/GC\/Triggered event<\/span> (if you type \u201ctri\u201d in the Filter textbox it\u2019ll filter to only events whose names have tri in them so you can quickly find this event). Click on it and it\u2019ll display all the instances of that event in all processes. Find the process and the GC of interest. The \u201cRest\u201d column for this event will say <span style=\"color: black;font-family: lucida console;font-size: small\">HasStack=\u201dTrue\u201d<\/span> which means there\u2019s a stack associated with this event and if you right click on the timestamp and do <span style=\"color: black;font-family: lucida console;font-size: small\">\u201cOpen any stacks\u201d<\/span> it\u2019ll show you the stack for this event which will reveal what induced this GC:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2018\/10\/HeapWalkInProd-1.jpg\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2018\/10\/HeapWalkInProd-1.jpg\" alt=\"\" class=\"alignnone size-medium wp-image-645\" width=\"600\" height=\"408\" srcset=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2018\/10\/HeapWalkInProd-1.jpg 1057w, https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2018\/10\/HeapWalkInProd-1-300x204.jpg 300w, https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2018\/10\/HeapWalkInProd-1-768x521.jpg 768w, https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2018\/10\/HeapWalkInProd-1-1024x695.jpg 1024w\" sizes=\"(max-width: 600px) 100vw, 600px\" \/><\/a><\/p>\n<p>This is saying that someone requested to walk the heap with ETW (this is one of the mechanisms Perfview uses for taking a heap snapshot).<\/p>\n<p>Now it leaves the question why this GC was so long. To see what the GC is actually doing that takes so long, open the CPU stack view with the time range set to the beginning and ending of this GC (which you can get from the GCStats view). And it\u2019s pretty clear what\u2019s taking up the time. We are spending 6.6s in this method called <span style=\"color: black;font-family: lucida console;font-size: small\">clr!SVR::GCProfileWalkHeapWorker<\/span>!<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2018\/10\/HeapWalkInProd-0.jpg\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2018\/10\/HeapWalkInProd-0.jpg\" alt=\"\" class=\"alignnone size-medium wp-image-655\" width=\"600\" height=\"156\" srcset=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2018\/10\/HeapWalkInProd-0.jpg 1650w, https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2018\/10\/HeapWalkInProd-0-300x78.jpg 300w, https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2018\/10\/HeapWalkInProd-0-768x201.jpg 768w, https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2018\/10\/HeapWalkInProd-0-1024x267.jpg 1024w\" sizes=\"(max-width: 600px) 100vw, 600px\" \/><\/a><\/p>\n<p>Normally <span style=\"color: black;font-family: lucida console;font-size: small\">do_post_gc<\/span> should take very, very little time but in the case of someone actually requesting a heap walk, <span style=\"color: black;font-family: lucida console;font-size: small\">GCProfileWalkHeap<\/span> would do a full blown walk of the heap which obviously can be very expensive. You should never seen this callstack in production (unless you are intentionally taking a heap snapshot but that should just be a one time thing, or at best once in a long while)! I\u2019ve seen this happen a couple of times where folks were completely oblivious some tool was asking for this continuously and it of course made GC pauses very long.<\/p>\n<p>So something was asking for GC heap walks (it was not due to the perfview I showed above as that doesn\u2019t include the keywords for heap walk events). It turned out the team just deployed a diagnostics tool that was doing this. After getting rid of the tool they no longer saw these long induced full blocking GCs. <\/p>\n<p>Curious readers would also notice that the When column shows fairly low CPU usage (click on \u2018?\u2019 for the When column to see an explanation how to interpret what you see here). It was because the tool also induced 2 back to back full blocking GCs in the other w3wp process at the same time which made each GC much slower. Very destructive tool \ud83d\ude42<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A customer who just experienced some server outage asked us for help as they thought it was due to some very long GC pauses. I thought this diagnostics exercise might be useful for other folks as well so I am sharing it with you. We had them collect a GCCollectOnly trace which showed there were [&hellip;]<\/p>\n","protected":false},"author":3542,"featured_media":58792,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[3009],"tags":[3011],"class_list":["post-635","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-performance","tag-maoniposts"],"acf":[],"blog_post_summary":"<p>A customer who just experienced some server outage asked us for help as they thought it was due to some very long GC pauses. I thought this diagnostics exercise might be useful for other folks as well so I am sharing it with you. We had them collect a GCCollectOnly trace which showed there were [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/635","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/users\/3542"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/comments?post=635"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/635\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media\/58792"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media?parent=635"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/categories?post=635"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/tags?post=635"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}