{"id":237509,"date":"2022-05-02T07:30:48","date_gmt":"2022-05-02T14:30:48","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/visualstudio\/?p=237509"},"modified":"2022-05-03T13:24:21","modified_gmt":"2022-05-03T20:24:21","slug":"case-study-double-performance-in-under-30-minutes","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/visualstudio\/case-study-double-performance-in-under-30-minutes\/","title":{"rendered":"Case Study: Double performance in under 30 minutes"},"content":{"rendered":"<h2>Summary<\/h2>\n<p>Recently I was converting some decompression code to C# so that we could use it cross platform and to aid in our team\u2019s effort to migrate our new analysis process to .NET 6. After I got the initial implementation done with the simplest, cleanest code that I could, I proceeded to profile it to make sure I wasn\u2019t doing anything silly. I want to go over my normal process for this and what I found. Spoiler, I got a 2x performance improvement with under 30 minutes of work.<\/p>\n<h2>What is the code<\/h2>\n<p>The code in question has 2 main components, <code>BlockCompressorStream<\/code> and <code>XPressStream<\/code>.<\/p>\n<ul>\n<li><code>BlockCompressorStream<\/code> \u2013 Reads a block header containing the compressed and decompressed size and holds the block, max 65K, in memory to return to the next stream.<\/li>\n<li><code>XPressStream<\/code> \u2013 Reads a compressed byte stream and decompresses it on the fly using the XPress decompression algorithm. The general algorithm is to read a byte or two, determine the \u201ccodeword\u201d, then use that to inflate by either adding the next few bytes or copying them from the already decompressed stream.<\/li>\n<\/ul>\n<p>The general structure of the code looks like so:\n<a href=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Stream.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Stream.png\" alt=\"XPress decompression path\" width=\"4019\" height=\"1131\" class=\"alignnone size-full wp-image-237708\" srcset=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Stream.png 4019w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Stream-300x84.png 300w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Stream-1024x288.png 1024w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Stream-768x216.png 768w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Stream-1536x432.png 1536w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Stream-2048x576.png 2048w\" sizes=\"(max-width: 4019px) 100vw, 4019px\" \/><\/a><\/p>\n<h2>Digging for performance<\/h2>\n<p>To get a general idea of the performance I first started with the <a href=\"https:\/\/docs.microsoft.com\/en-us\/visualstudio\/profiling\/cpu-usage?view=vs-2022\" rel=\"noopener\" target=\"_blank\">CPU Usage tool<\/a>, I was curious how long it would take and what part of the algorithm is the hot spot. While I could launch a Visual Studio patch with my code, I opted for an easier route that scoped what I was profiling to just my decompression algorithm. During development I created a bunch of unit tests to verify the compression. One such test was to take a large, compressed file and decompress it to a known file that I can then check the hashes against. I wrote a simple console application in my solution that instantiates the unit test class and then just invokes the unit test, allowing me to reuse assets and quickly profile things as the \u201cstartup project\u201d. After running the CPU Usage tool against this exe, I had the following:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture1.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-237510\" src=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture1.png\" alt=\"CPU Usage Top Functions - XPressStream.ReadBlock\" width=\"780\" height=\"69\" srcset=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture1.png 780w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture1-300x27.png 300w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture1-768x68.png 768w\" sizes=\"(max-width: 780px) 100vw, 780px\" \/><\/a><\/p>\n<p>This is what I expected to see since <code>XPressStream.ReadBlock()<\/code> is doing the actual decompression which is good, my mental model of what my code does matches reality.<\/p>\n<p>What is a bit odd is that less than a third of the time is spent in the function (self), the rest of the time is in child functions (total &#8211; self). Looking at the CPU tool\u2019s new flame chart I see most of the time seems to be spent reading in <code>UInt16<\/code>.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture2.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-237511\" src=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture2.png\" alt=\"Flame Chart showing time spent readying Uint16\" width=\"780\" height=\"331\" srcset=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture2.png 780w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture2-300x127.png 300w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture2-768x326.png 768w\" sizes=\"(max-width: 780px) 100vw, 780px\" \/><\/a><\/p>\n<p>I assumed reading in the data would take some time, <strong>but I didn\u2019t expect it to take 2\/3 of my algorithms time<\/strong>. Peeking at the <code>BlockCompressor.Read<\/code> I can see it spending most of the time reading the stream which got me thinking maybe I\u2019m not buffering enough. Looking at where I construct the BufferedStream, sure enough I\u2019m using the default buffer size which normally is fine, but I know that my blocks can never be more than <code>65,536<\/code>.<\/p>\n<pre class=\"prettyprint\">\r\npublic BlockCompressorStream(Stream stream)\r\n{\r\n    \/\/ We add a buffer stream around the stream to handle the buffering for us\r\n    this.stream = new BufferedStream(stream);\r\n\r\n    if (!this.ReadHeader())\r\n    {\r\n        throw new InvalidDataException();\r\n    }\r\n}\r\n<\/pre>\n<p>I could buffer that much by default though it seems it bit much, instead I chose to buffer half as much <code>32,768<\/code>. If tuning this was super important, I would profile the code with various sizes to determine the correct tradeoff of memory size to performance, but I\u2019m not overly concerned and instead will just leave a comment for the next dev to hit this code. Rerunning the CPU Usage, I find that I didn\u2019t quite double performance, but I got pretty close.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture4.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-237512\" src=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture4.png\" alt=\"Improved performance of XPressStream.ReadBlock\" width=\"780\" height=\"42\" srcset=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture4.png 780w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture4-300x16.png 300w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture4-768x41.png 768w\" sizes=\"(max-width: 780px) 100vw, 780px\" \/><\/a><\/p>\n<p>Poking around a little more I don\u2019t immediately notice anything that is jumping out. Since we are using a buffered stream and I don\u2019t allocate anything in the algorithm I figured I should verify my allocations are at a steady state. To do that I ran the <a href=\"https:\/\/docs.microsoft.com\/en-us\/visualstudio\/profiling\/dotnet-alloc-tool?view=vs-2022\" rel=\"noopener\" target=\"_blank\">.NET Object Allocation profiler<\/a> on the test case and got the following:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture5.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-237513\" src=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture5.png\" alt=\"Allocation state of decompression\" width=\"780\" height=\"370\" srcset=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture5.png 780w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture5-300x142.png 300w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture5-768x364.png 768w\" sizes=\"(max-width: 780px) 100vw, 780px\" \/><\/a><\/p>\n<p>Immediately I noticed the two GCs which I really didn\u2019t expect. Maybe the <code>BufferedStream<\/code> allocation put us over budget for 1 GC but 2 seems a bit suspect. Looking at the collections tab I actually have 3 GCs, and they are due to \u201cAllocSmall\u201d indicating we are allocating a bunch of small objects and trying to collect them:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture6.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-237514\" src=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture6.png\" alt=\"Top collected types showing System.Byte[1]\" width=\"780\" height=\"442\" srcset=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture6.png 780w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture6-300x170.png 300w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture6-768x435.png 768w\" sizes=\"(max-width: 780px) 100vw, 780px\" \/><\/a><\/p>\n<p>What is interesting is the top type we are allocating is a byte array of length 1, which is almost never correct. Jumping over to allocations I can confirm I\u2019m actually allocating over 200K <code>byte[1]<\/code> arrays which is terrible. Looking at the backtrace it is coming from <code>Stream.ReadByte<\/code>, which is the default implementation my <code>BlockCompressorStream<\/code> is using.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture7.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-237515\" src=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture7.png\" alt=\"Backtrace of System.Byte[1] allocations\" width=\"780\" height=\"288\" srcset=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture7.png 780w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture7-300x111.png 300w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture7-768x284.png 768w\" sizes=\"(max-width: 780px) 100vw, 780px\" \/><\/a><\/p>\n<p>Navigating to source through the context menu I see the code and I\u2019m horrified to realize that yep, it is allocating a <code>byte[1]<\/code> so it can call into <code>Read(byte[], int, int)<\/code>. Since <code>BufferedStream<\/code> is holding the buffered data in memory it should be able to return the byte from its current location. Checking on ReferenceSource I can confirm it is doing just that: <a href=\"https:\/\/referencesource.microsoft.com\/#mscorlib\/system\/io\/bufferedstream.cs,afb5cf7431ff6f02\">bufferedstream.cs (microsoft.com)<\/a><\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Allocating.jpg\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Allocating-300x290.jpg\" alt=\"Batman meme about allocating garbage\" width=\"300\" height=\"290\" class=\"alignnone size-medium wp-image-237710\" srcset=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Allocating-300x290.jpg 300w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Allocating-24x24.jpg 24w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Allocating.jpg 508w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>Overriding <code>Stream.ReadByte<\/code> in my <code>BlockCompressorStream<\/code> and using the <code>BufferedStream<\/code> implementation I\u2019m now in a much better allocation situation:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture8.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-237516\" src=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture8.png\" alt=\"System.Byte[] allocations after fix, no more System.Byte[1]\" width=\"412\" height=\"232\" srcset=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture8.png 412w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture8-300x169.png 300w\" sizes=\"(max-width: 412px) 100vw, 412px\" \/><\/a><\/p>\n<p>The .NET GC is really fast but giving it a bunch of garbage work to do is going to slow things down. Rerunning the CPU Usage tool to see if removing all these allocations I can confirm that it had a significant impact.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture9.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-237517\" src=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture9.png\" alt=\"XPressStream.ReadBlock doubled in performance\" width=\"780\" height=\"58\" srcset=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture9.png 780w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture9-300x22.png 300w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2022\/04\/Picture9-768x57.png 768w\" sizes=\"(max-width: 780px) 100vw, 780px\" \/><\/a><\/p>\n<p>I reduced the time by almost 1\/3 again. Compared with my original baseline trace I went from 2610 -&gt; 1116 (2.3x), more than doubling performance with a little bit of investigation. At this point while I could continue to dig for performance, I got most of the low hanging fruit and I really shouldn\u2019t spend time on this unless it is a hot path in the overall application.<\/p>\n<h2>Conclusion<\/h2>\n<p>Taking a little time to confirm my mental model of my code with real world data is enlightening. I spent less than 30 minutes and more than doubled the performance on my code and learned more about how Stream is implemented in .NET. Let me know what you found while profiling your code and what factor of improvement you were able to achieve with the tools!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Summary Recently I was converting some decompression code to C# so that we could use it cross platform and to aid in our team\u2019s effort to migrate our new analysis process to .NET 6. After I got the initial implementation done with the simplest, cleanest code that I could, I proceeded to profile it to [&hellip;]<\/p>\n","protected":false},"author":54885,"featured_media":255385,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[155],"tags":[6848,6847,6803],"class_list":["post-237509","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-visual-studio","tag-net-object-allocation","tag-cpu-usage","tag-visual-studio-profiler"],"acf":[],"blog_post_summary":"<p>Summary Recently I was converting some decompression code to C# so that we could use it cross platform and to aid in our team\u2019s effort to migrate our new analysis process to .NET 6. After I got the initial implementation done with the simplest, cleanest code that I could, I proceeded to profile it to [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-json\/wp\/v2\/posts\/237509","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-json\/wp\/v2\/users\/54885"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-json\/wp\/v2\/comments?post=237509"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-json\/wp\/v2\/posts\/237509\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-json\/wp\/v2\/media\/255385"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-json\/wp\/v2\/media?parent=237509"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-json\/wp\/v2\/categories?post=237509"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-json\/wp\/v2\/tags?post=237509"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}