{"id":234842,"date":"2021-10-11T12:37:44","date_gmt":"2021-10-11T19:37:44","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/visualstudio\/?p=234842"},"modified":"2021-10-11T12:37:44","modified_gmt":"2021-10-11T19:37:44","slug":"case-study-using-visual-studio-profiler-to-reduce-memory-allocations-in-the-windows-terminal-console-host-startup-path","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/visualstudio\/case-study-using-visual-studio-profiler-to-reduce-memory-allocations-in-the-windows-terminal-console-host-startup-path\/","title":{"rendered":"Case Study: Using Visual Studio Profiler to reduce memory allocations in the Windows Terminal console host startup path"},"content":{"rendered":"<h2>Setting The Stage<\/h2>\n<p>Around the holidays of 2020 it was a bit quieter, and I decided it might be a great time to go investigate how the Visual Studio profiling tools worked. I have been a long-time user of the <a href=\"https:\/\/docs.microsoft.com\/windows-hardware\/test\/wpt\/windows-performance-analyzer\">Windows Performance Analyzer (WPA)<\/a>, which is a great tool for doing systems-level performance analysis. It is, however, quite complex and difficult to learn, and I&#8217;d heard that the <a href=\"https:\/\/docs.microsoft.com\/visualstudio\/profiling\/?view=vs-2019&amp;preserve-view=true\">Visual Studio Profiler<\/a> had improved a lot since I last looked at it.<\/p>\n<p>I could have chosen a toy program to look at, but it felt like I&#8217;d learn more by approaching a real codebase of significant size. The <a href=\"https:\/\/github.com\/Microsoft\/Terminal\">Windows Terminal<\/a> project was a great one because it is a sizeable C++ codebase with plenty of history and legacy. If I could find anything cool to optimize, it would give me a chance to contribute back as it&#8217;s open source. As a bonus, I happened to know from some internal conversations that conhost.exe is launched over 1 billion times a day, so if I found any optimizations they might have a measurable impact on CPU cycles spent across the globe.<\/p>\n<h2>Cloning and getting the first trace<\/h2>\n<p>I began by forking and cloning the Windows Terminal repo. If you&#8217;d like to follow along at home and repeat my steps, you can also clone that repo to the commit I had at the time, <a href=\"https:\/\/github.com\/microsoft\/terminal\/commit\/551cc9a98b72ef82a3ac2ba28cfd8c5e7f97b0d1\">551cc9a9<\/a>. I followed the Windows Terminal repo readme to ensure I had all the right dependencies set up, then opened OpenConsole.sln in Visual Studio 2019.<\/p>\n<p>Once in Visual Studio, I switched the Solution Configuration to Release since performance investigations should almost always be done on Release builds to see the full effects of any optimizations the toolchain will perform. Then I switched the Solution Platform to x64 since I have a 64-bit machine and figure most people do too by now. Lastly, I changed the Startup Project to Host.EXE which is where OpenConsole.exe builds &#8211; this is the open source version of conhost.exe that I wanted to look at.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2021\/10\/windows-terminal-startup-allocations-set-startup-project.png\"><img decoding=\"async\" width=\"452\" height=\"438\" class=\"aligncenter size-full wp-image-234846\" src=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2021\/10\/windows-terminal-startup-allocations-set-startup-project-small.png\" alt=\"The context menu on the Host.EXE project, with the mouse hovered over 'Set as Startup Project'\" srcset=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2021\/10\/windows-terminal-startup-allocations-set-startup-project-small.png 452w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2021\/10\/windows-terminal-startup-allocations-set-startup-project-small-300x291.png 300w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2021\/10\/windows-terminal-startup-allocations-set-startup-project-small-24x24.png 24w, https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2021\/10\/windows-terminal-startup-allocations-set-startup-project-small-48x48.png 48w\" sizes=\"(max-width: 452px) 100vw, 452px\" \/><\/a><\/p>\n<p>I built the solution then launched the Visual Studio Profiler. You can go to <strong>Debug &gt; Performance Profiler<\/strong>, but I like to launch it with <kbd>Alt+F2<\/kbd> as a shortcut. The profiler then prompts you to choose what you want to investigate: CPU, GPU, memory, etc.<\/p>\n<p>I decided I wanted to look at memory since I have some experience in that area. Memory allocations can have significant CPU costs because the heap must do work to maintain a list of freed allocations, buckets, and various other heap-y details. Usually reducing allocations will noticeably impact CPU time, so it&#8217;s a cool way to improve memory and CPU at the same time. I checked the Memory Usage option and clicked Start.<\/p>\n<p><img decoding=\"async\" class=\"aligncenter size-full wp-image-234846\" src=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2021\/10\/windows-terminal-startup-allocations-launching-vs-profiler.png\" alt=\"The Performance Profiler tools available, with the 'Memory Usage' tool checked.\" \/><\/p>\n<p>At that point the console app starts up, and I can see a command-line with a blinking cursor as expected. In Visual Studio, I&#8217;m offered the option to take a memory snapshot:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2021\/10\/windows-terminal-startup-allocations-take-snapshot.png\"><img decoding=\"async\" class=\"aligncenter size-full wp-image-234846\" src=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2021\/10\/windows-terminal-startup-allocations-take-snapshot-small.png\" alt=\"The 'Take snapshot' UI in the profiler.\" \/><\/a><\/p>\n<p>I click this button to snapshot memory that has been allocated up until now in the startup process then click Stop Collection to finish this profiling session and close the app. Note that the first time you do this Visual Studio may process things for several minutes as symbols get loaded and things index, but it&#8217;s much faster on future iterations &#8211; so be patient the first time. The time to do this has been substantially improved in Visual Studio 2022 as well.<\/p>\n<p>The exact results you see may vary by machine. On my machine I saw this:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2021\/10\/windows-terminal-startup-allocations-snapshot-1.png\"><img decoding=\"async\" class=\"aligncenter size-full wp-image-234846\" src=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2021\/10\/windows-terminal-startup-allocations-snapshot-1-small.png\" alt=\"4.74 MB of memory and 29,461 allocations as the baseline.\" \/><\/a><\/p>\n<p>I clicked on the link for &#8220;29,461 allocations&#8221; and saw this:<\/p>\n<p><img decoding=\"async\" class=\"aligncenter size-full wp-image-234846\" src=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2021\/10\/windows-terminal-startup-allocations-initial-collapsed-heap.png\" alt=\"A callstack of allocations that total up to 29,461.\" \/><\/p>\n<p>That view is not very useful yet, but we can expand the callstack to look for where these come from in our code. Expanding out several nodes in RtlUserThreadStart where the vast majority of the allocations are to look for something interesting, I see this:<\/p>\n<p><img decoding=\"async\" class=\"aligncenter size-full wp-image-234846\" src=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2021\/10\/windows-terminal-startup-allocations-callstack-expanded-1.png\" alt=\"An expanded callstack, highlighting an interesting near-leaf-node called ROW::ROW that accounts for over 18,000 of the allocations.\" \/><\/p>\n<h2>Discovering a suspicious allocation<\/h2>\n<p>What this shows is that we start off with 29,461 total allocations in this trace. 27,077 of these allocations belong to the <code>ConsoleIoThread<\/code> function and the things it calls. Digging in we can see that 18,002 of these come from <code>ROW::ROW<\/code>. Wow, that seems like a large percentage of the total allocations! This is basically a leaf node in the allocation callstack and starting at a leaf node is nice because it should be the thing directly doing the memory allocation, so I&#8217;ll start there.<\/p>\n<p>It seems like a lot of these ROW objects are being allocated, but why? At this point, I started reading a little code near this callstack and found this loop in <code>TextBuffer<\/code> constructor:<\/p>\n<pre class=\"prettyprint\">\/\/ initialize ROWs\r\nfor (size_t i = 0; i &lt; static_cast&lt;size_t&gt;(screenBufferSize.Y); ++i)\r\n{\r\n    _storage.emplace_back(static_cast&lt;SHORT&gt;(i), screenBufferSize.X, _currentAttributes, this);\r\n}\r\n<\/pre>\n<p>The next question is: what is the value of <code>screenBufferSize.Y<\/code>? Debugging through this code will reveal that it is, in fact, 9001 &#8211; but where does that come from? It&#8217;s from <code>Settings::ApplyDesktopSpecificDefaults<\/code>, where <code>_dwScreenBufferSize.Y = 9001<\/code>, which happens very early in startup and gets used as the default number of rows in the console.<\/p>\n<p>Ok, cool &#8211; so for some reason or other (likely compatibility) the default number of rows of text in the buffer is 9001 &#8211; but why do we need to allocate 9001 times to do this? Shouldn&#8217;t we be able to just do one allocation of an array of 9001 of these <code>ROW<\/code> objects?<\/p>\n<p>Looking at TextBuffer.hpp, the <code>ROW<\/code> objects are stored in a <code>std::deque<\/code>:<\/p>\n<pre class=\"prettyprint\">std::deque&lt;ROW&gt; _storage;\r\n<\/pre>\n<p>A <a href=\"https:\/\/docs.microsoft.com\/cpp\/standard-library\/deque-class\">std::deque<\/a> will allocate every time an element is inserted into it. By contrast, a <a href=\"https:\/\/docs.microsoft.com\/cpp\/standard-library\/vector-class\">std::vector<\/a> internally maintains an array. So in this case when we initially set up the <code>TextBuffer::_storage<\/code>, we insert 9,001 <code>ROW<\/code> objects into it. This results in the <code>ROW<\/code> constructor running 9,001 times as well as 9,001 allocations inside of the <code>std::deque<\/code>.<\/p>\n<p>This should be an easy fix, right? Can we just swap it out for a <code>std::vector<\/code>? Let&#8217;s do that. I changed <code>_storage<\/code> to be a <code>std::vector<\/code> and added <code>#include &lt;vector&gt;<\/code> at the top of the file to see where I get.<\/p>\n<h2>The simple fixes usually require a little iteration<\/h2>\n<p>Of course, it&#8217;s not that simple, as I hit some issues with test code. Usually while hacking around and trying to understand something, I find it helpful to unload the tests from the solution to improve build iteration times until something is promising enough to run functional tests on. Performance investigations can have many dead-ends and waiting to run and fix up tests for any changes to data structures can be put off until later. So, for the purposes of this case study, I unloaded all the test projects to improve iteration times &#8211; but of course before the final PR was submitted I ran and fixed all the tests.<\/p>\n<p>With tests unloaded, it&#8217;s still not quite as easy as changing a <code>std::deque<\/code> to a <code>std::vector<\/code> &#8211; this code was using <code>deque<\/code> as a circular buffer so it&#8217;s using <code>pop_front()<\/code> which <code>vector<\/code> does not have. For now, let&#8217;s ignore that and hack away just to get some numbers.<\/p>\n<p>Let&#8217;s turn this loop in <code>TextBuffer::ResizeTraditional<\/code>:<\/p>\n<pre class=\"prettyprint\">while (&amp;newTopRow != &amp;_storage.front())\r\n{\r\n    _storage.push_back(std::move(_storage.front()));\r\n    _storage.pop_front();\r\n}\r\n<\/pre>\n<p>Into this loop:<\/p>\n<pre class=\"prettyprint\">for (int i = 0; i &lt; TopRowIndex; i++)\r\n{\r\n    _storage.emplace_back(std::move(_storage.front()));\r\n    _storage.erase(_storage.begin());\r\n}\r\n<\/pre>\n<p>Ok, now everything builds again, and I can re-run numbers to see if this had the expected effect of reducing allocations.<\/p>\n<h2>Small victories can have big impacts<\/h2>\n<p>On my machine just that one change of a <code>deque<\/code> to a <code>vector<\/code>, I see this:<\/p>\n<p><img decoding=\"async\" class=\"aligncenter size-full wp-image-234846\" src=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2021\/10\/windows-terminal-startup-allocations-callstack-expanded-2.png\" alt=\"Total allocations reduced to 20,458.\" \/><\/p>\n<p>Good news! This reduced the allocation count total during startup from 29,461 -&gt; 20,458 (by 9,003!). Wow, this reduced the allocations in the startup path by just over 30% from this small change. How cool! This is what gets me so excited about doing performance work &#8211; when it works, the numbers are so compelling and measurable.<\/p>\n<p>Still, I think I can do better. <code>std::vector<\/code> will start out with some amount of elements in it, then when you <code>push_back<\/code> beyond its capacity it will re-allocate and copy all elements over into a buffer twice as large. So initially we&#8217;ll start out with a small array and end up re-allocating it multiple times as it starts up to finish fitting all 9,001 of these. After this is completed, we&#8217;ll also have an improperly sized buffer with some &#8220;waste&#8221; on the end.<\/p>\n<p>We can fix this simply by telling the <code>vector<\/code> to <a href=\"https:\/\/docs.microsoft.com\/cpp\/standard-library\/vector-class#reserve\">reserve<\/a> the right amount of elements before inserting all of the <code>ROW<\/code> objects. In the <code>TextBuffer<\/code> constructor, add this as the first line:<\/p>\n<pre class=\"prettyprint\">_storage.reserve(static_cast&lt;size_t&gt;(screenBufferSize.Y));\r\n<\/pre>\n<p>Re-running the trace on my machine shows this:<\/p>\n<p><img decoding=\"async\" class=\"aligncenter size-full wp-image-234846\" src=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2021\/10\/windows-terminal-startup-allocations-callstack-expanded-3.png\" alt=\"Total allocations are unaffected, but total memory used has reduced by approximately 300 KB.\" \/><\/p>\n<p>I don&#8217;t understand why the allocation count didn&#8217;t reduce. This should have seen a few allocations reduce since all 9,001 <code>ROW<\/code> objects get reserved right away. But at least it right-sized the allocation and we see a drop in total memory consumption of about 300 KB so now there&#8217;s no regression in total memory consumption &#8211; in fact it reduced total memory consumption from the starting point by a little bit.<\/p>\n<p>So far so good! Can we do even better than this? Yes, I think we can. Each of those <code>ROW<\/code> instances shows it is allocating in a <code>std::vector&lt;TextAttributeRun&gt;<\/code>, but this is a bit misleading as the <code>ROW<\/code> doesn&#8217;t have any <code>std::vector&lt;TextAttributeRun&gt;<\/code> members. Each <code>ROW<\/code> instance owns an <code>ATTR_ROW<\/code> and that is what actually owns the <code>std::vector&lt;TextAttributeRun&gt;<\/code>. The reason it shows up this way in the profiler UI is because the <code>ATTR_ROW<\/code> constructor is so small it gets inlined into <code>ROW::ROW<\/code> &#8211; this is a thing to watch out for in optimized code, sometimes inlining gets callstacks to be a bit confusing.<\/p>\n<p>If you were to look at this in memory (which a tool like <a href=\"https:\/\/devblogs.microsoft.com\/performance-diagnostics\/sizebench-a-new-tool-for-analyzing-windows-binary-size\/\">SizeBench<\/a> can help you do), each instance of ROW is laid out like this:<\/p>\n<h3>ROW structure in memory<\/h3>\n<table>\n<tbody>\n<tr>\n<td align=\"right\">Offset<\/td>\n<td>Member<\/td>\n<td align=\"right\">Member Size in Bytes<\/td>\n<\/tr>\n<tr>\n<td align=\"right\">0<\/td>\n<td>CharRow _charRow<\/td>\n<td align=\"right\">40<\/td>\n<\/tr>\n<tr>\n<td align=\"right\">+40<\/td>\n<td>ATTR_ROW _attrRow<\/td>\n<td align=\"right\">32<\/td>\n<\/tr>\n<tr>\n<td align=\"right\">+72<\/td>\n<td>short _id<\/td>\n<td align=\"right\">2<\/td>\n<\/tr>\n<tr>\n<td align=\"right\">+74<\/td>\n<td>&lt;alignment padding&gt;<\/td>\n<td align=\"right\">6<\/td>\n<\/tr>\n<tr>\n<td align=\"right\">+80<\/td>\n<td>int64 _rowWidth<\/td>\n<td align=\"right\">8<\/td>\n<\/tr>\n<tr>\n<td align=\"right\">+88<\/td>\n<td>TextBuffer* _pParent<\/td>\n<td align=\"right\">8<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>That <code>ATTR_ROW<\/code> member takes up 32 bytes. Let&#8217;s expand that in memory too and it looks like this:<\/p>\n<h3>ATTR_ROW structure in memory<\/h3>\n<table>\n<tbody>\n<tr>\n<td align=\"right\">Offset<\/td>\n<td>Member<\/td>\n<td align=\"right\">Member Size in Bytes<\/td>\n<\/tr>\n<tr>\n<td align=\"right\">0<\/td>\n<td>std::vector&lt;TextAttributeRun&gt; _list<\/td>\n<td align=\"right\">24<\/td>\n<\/tr>\n<tr>\n<td align=\"right\">+24<\/td>\n<td>int64 _cchRowWidth<\/td>\n<td align=\"right\">8<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Ah, but that <code>_list<\/code> member, now that&#8217;s interesting &#8211; that vector is only 24 bytes because internally a vector is storing a pointer off to the allocated array of data it contains. So that vector might be one element long or 1000 elements long, and it&#8217;ll remain 24 bytes in the <code>ATTR_ROW<\/code>. That&#8217;s unfortunate in our case because most rows of a console have a single <code>TextAttributeRun<\/code> yet we pay to allocate the array elsewhere in memory and then indirectly access via that pointer when we want to get to the elements.<\/p>\n<h2>Paying attention to common patterns and practices<\/h2>\n<p>There are many codebases that run into situations like this and want a <code>vector<\/code> that can store some number of elements inside of itself and only go out to the heap when they spill over that default amount. This is such a common pattern that <a href=\"https:\/\/docs.microsoft.com\/cpp\/standard-library\/cpp-standard-library-reference\">the STL<\/a> even has it implemented for a very important type &#8211; <code>std::basic_string<\/code>, the underlying type behind <code>std::string<\/code> and <code>std::wstring<\/code>. This so-called &#8220;small string optimization&#8221; keeps strings up to a certain size directly in the <code>std::[w]string<\/code> without allocating from the heap, only reaching out to the heap for larger strings. Unfortunately <code>std::vector<\/code> has no such optimization, but a number of variations of it exist in many codebases.<\/p>\n<p>For Windows Terminal, I decided to use the implementation in boost called <a href=\"https:\/\/www.boost.org\/doc\/libs\/1_76_0\/doc\/html\/boost\/container\/small_vector.html\">boost::container::small_vector<\/a>. With a <code>small_vector<\/code>, you can specify the type for it to hold (just like <code>std::vector<\/code>), and you also specify how many elements it should reserve space for. Because I know most <code>ROW<\/code> objects have just a single <code>TextAttributeRun<\/code>, I replaced this line in AttrRow.hpp:<\/p>\n<pre class=\"prettyprint\">std::vector&lt;TextAttributeRun&gt; _list;\r\n<\/pre>\n<p>With this line:<\/p>\n<pre class=\"prettyprint\">boost::container::small_vector&lt;TextAttributeRun, 1&gt; _list;\r\n<\/pre>\n<p>After adjusting some build files to allow the right boost header to be located, I began to hit problems with <code>small_vector<\/code> not being exactly the same as <code>vector<\/code>. Instead of calling <code>push_back<\/code>, I needed to call <code>emplace_back<\/code> in a few places and this was a good time to sprinkle in a bit of <code>noexcept<\/code> because <code>small_vector<\/code> makes more guarantees about when it can and cannot throw compared to <code>vector<\/code>. All the changes in AttrRow.cpp, AttrRow.hpp, AttrRowIterator.cpp and AttrRowIterator.hpp can be seen in <a href=\"https:\/\/github.com\/microsoft\/terminal\/pull\/8489\">the Pull Request for the full change<\/a>.<\/p>\n<p>Re-building and re-running the trace from there shows these results:<\/p>\n<p><img decoding=\"async\" class=\"aligncenter size-full wp-image-234846\" src=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2021\/10\/windows-terminal-startup-allocations-callstack-expanded-4.png\" alt=\"Total allocations reduced to 11,449, and TextBuffer::TextBuffer reducing by 9,001.\" \/><\/p>\n<p>Nice &#8211; another 9,001 allocations knocked out. Now the memory layout of each ROW looks like this:<\/p>\n<h3>ROW structure in memory after using small_vector<\/h3>\n<table>\n<tbody>\n<tr>\n<td align=\"right\">Offset<\/td>\n<td>Member<\/td>\n<td align=\"right\">Member Size in Bytes<\/td>\n<\/tr>\n<tr>\n<td align=\"right\">0<\/td>\n<td>CharRow _charRow<\/td>\n<td align=\"right\">40<\/td>\n<\/tr>\n<tr>\n<td align=\"right\">+40<\/td>\n<td>ATTR_ROW _attrRow<\/td>\n<td align=\"right\">56<\/td>\n<\/tr>\n<tr>\n<td align=\"right\">+96<\/td>\n<td>short _id<\/td>\n<td align=\"right\">2<\/td>\n<\/tr>\n<tr>\n<td align=\"right\">+98<\/td>\n<td>&lt;alignment padding&gt;<\/td>\n<td align=\"right\">6<\/td>\n<\/tr>\n<tr>\n<td align=\"right\">+104<\/td>\n<td>int64 _rowWidth<\/td>\n<td align=\"right\">8<\/td>\n<\/tr>\n<tr>\n<td align=\"right\">+112<\/td>\n<td>TextBuffer* _pParent<\/td>\n<td align=\"right\">8<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>That <code>ATTR_ROW<\/code> member now takes up 56 bytes &#8211; so what&#8217;s it doing with that extra space? Let&#8217;s expand that in memory too and it looks like this:<\/p>\n<h3>ATTR_ROW structure in memory after using small_vector<\/h3>\n<table>\n<tbody>\n<tr>\n<td align=\"right\">Offset<\/td>\n<td>Member<\/td>\n<td align=\"right\">Member Size in Bytes<\/td>\n<\/tr>\n<tr>\n<td align=\"right\">0<\/td>\n<td>small_vector&lt;TextAttributeRun, 1&gt; _list<\/td>\n<td align=\"right\">48<\/td>\n<\/tr>\n<tr>\n<td align=\"right\">+48<\/td>\n<td>int64 _cchRowWidth<\/td>\n<td align=\"right\">8<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>So, isn&#8217;t this worse, since things are bigger? Nope. As seen in the callstack screenshots, the total memory usage of the app during startup went from 4,836,438 to 4,834,754 &#8211; it actually went down just a bit. Why&#8217;s that? Because now these <code>small_vector<\/code> instances contain space to store a single <code>TextAttributeRun<\/code> internally without needing to allocate &#8211; this saves us an allocation (and thus CPU time spent in the heap) while also saving a bit of memory by not needing the vector&#8217;s internal pointer and the array it pointed to.<\/p>\n<p>If a specific <code>ATTR_ROW<\/code> instance needs space for more than one <code>TextAttributeRun<\/code> at runtime, that&#8217;s no problem &#8211; it will just allocate at that time. For the common case of a single run, this will keep memory constrained and also speed up startup when the buffer is filled with single run <code>ROW<\/code> objects.<\/p>\n<h2>Optimization confusion in callstacks and when to profile in Debug configuration<\/h2>\n<p>Next up: what are the other 9,001 allocations in the <code>ROW<\/code> constructor? I puzzled about that for a while unsure of what it could be. Due to <a href=\"https:\/\/en.wikipedia.org\/wiki\/Inline_expansion\">inlining<\/a>, I couldn&#8217;t see what was causing it. Therefore, it was a good time to switch into Debug mode and build and run the trace there in hopes it would show a callstack easier to follow. Doing that shows this:<\/p>\n<p><img decoding=\"async\" class=\"aligncenter size-full wp-image-234846\" src=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2021\/10\/windows-terminal-startup-allocations-callstack-expanded-5.png\" alt=\"A deeper callstack because it was taken in Debug mode where inlining is not happening.\" \/><\/p>\n<p>The exact allocation counts and amount of memory differ substantially here, and this is normal in many codebases. This codebase may have extra debug information stored that is not present in Release builds or things like that. What&#8217;s interesting is to see the deeper callstack of things like <code>std::make_unique&lt;TextBuffer...&gt;<\/code> that were not present in the Release trace, because that call was completely inlined in a Release build. Likewise, we can now see that the constructor for <code>ROW<\/code> is calling the constructor of <code>CharRow<\/code> which is where there is yet another <code>std::vector<\/code> &#8211; this time holding <code>CharRowCell<\/code> instances, and this is the source of our other mysterious 9,001 allocations.<\/p>\n<h2>Rinse and repeat<\/h2>\n<p>With that discovered, it looked like we could use the same playbook as before and see if <code>std::vector<\/code> could be replaced with a <code>small_vector<\/code>. I switched back to Release builds to have comparable numbers with the earlier traces now that I know where to focus attention. Looking at <code>CharRow<\/code> and that vector of <code>CharRowCell<\/code> suggests another opportunity to change to a <code>boost::container::small_vector<\/code> to place the initial storage into the <code>CharRow<\/code> instance, thus avoiding another allocation per row. So, I changed the <code>_data<\/code> member from this:<\/p>\n<pre class=\"prettyprint\">std::vector&lt;value_type&gt; _data;\r\n<\/pre>\n<p>To this:<\/p>\n<pre class=\"prettyprint\">boost::container::small_vector&lt;value_type, 120&gt; _data;\r\n<\/pre>\n<p>How did I know to pre-size it to 120? Once again, with some debugging this comes back to that same function we saw earlier, <code>Settings::ApplyDesktopSpecificDefaults<\/code>. It sets the default screen width in a variable called <code>_dwScreenBufferSize.X<\/code> to 120, which eventually flows through to the constructors of each <code>ROW<\/code> and then to each <code>CharRow<\/code>. So, by pre-sizing this to 120 elements it avoids allocating for any <code>CharRow<\/code>&#8216;s initial storage of <code>CharRowCell<\/code> objects.<\/p>\n<p>This again requires a few fixups throughout CharRowCell.hpp and CharRowCell.cpp, which you can see in detail in <a href=\"https:\/\/github.com\/microsoft\/terminal\/pull\/8489\">the Pull Request for the full change<\/a>. Once those changes are made, let&#8217;s see what this data structure change did to the trace. As a reminder, this is now back in Release mode so I&#8217;ll compare numbers to the last run we did in Release mode (not the previous one showing Debug that helped navigate the callstack):<\/p>\n<p><img decoding=\"async\" class=\"aligncenter size-full wp-image-234846\" src=\"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-content\/uploads\/sites\/4\/2021\/10\/windows-terminal-startup-allocations-callstack-expanded-6.png\" alt=\"Total allocations reduced to 2,448 and TextBuffer::TextBuffer all the way down to 9 allocations.\" \/><\/p>\n<p>Wow! Another large reduction of allocations. So much so that the entire <code>ConsoleIoThread<\/code> isn&#8217;t even at the top of allocations under <code>BaseThreadInitThunk<\/code> anymore. The total allocations have now dropped to 2,448, meaning we&#8217;ve reduced allocations by almost 92% from where they started (29,461 to 2,448). Total memory consumption has gone from 4,968,062 to 4,906,762, a much more modest reduction of 1.2% but still a reduction. So basically, this is changing the layout of memory to not need as many calls to the heap to be arranged without affecting total memory consumption.<\/p>\n<p>At this point it looks like the allocation reduction well is drying up, so this case study will stop here. The <a href=\"https:\/\/github.com\/microsoft\/terminal\/pull\/8489\">final Pull Request<\/a> included a few other small changes that helped with memory usage and CPU usage a bit, but those are outside the scope of this case study.<\/p>\n<h2>To ask the right question is already half the solution of a problem<\/h2>\n<p>I was not familiar with the Windows Terminal codebase when approaching this and was very pleasantly surprised to see how quickly the Visual Studio Profiling tools helped me find interesting questions to ask and interesting code to dig into. It can be easy to get lost in large swaths of code, but great tools like this help to focus attention on things that really matter, and we can see that just a few relatively tactical changes to a large codebase can result in substantial savings. Even in code that is decades old, there&#8217;s bound to be something cool to find!<\/p>\n<p>If this were a cheesy 80s cartoon ending, I would say &#8220;Now you know, and knowing is half the battle!&#8221; and we&#8217;d all chuckle as the outro music played \ud83d\ude01<\/p>\n<p>But seriously, I want to extend my thanks to the Windows Terminal team for asking good questions during the Pull Request process and helping shepherd my change in to ultimately being accepted. You can see the results of this change in Windows Terminal 1.6 and later, and the changes also flowed back into the Windows 11 version of conhost.exe.<\/p>\n<p>Hopefully this provides some ideas on strategies for reducing memory allocations in your codebase and shows how to use the very nice Visual Studio memory profiling tools to zero in on interesting problems.<\/p>\n<p>Happy Profiling,<\/p>\n<p>Austin<\/p>\n<p>If you give this tool a try and have any feedback for us, please fill out this survey, we\u2019d love to hear from you.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Setting The Stage Around the holidays of 2020 it was a bit quieter, and I decided it might be a great time to go investigate how the Visual Studio profiling tools worked. I have been a long-time user of the Windows Performance Analyzer (WPA), which is a great tool for doing systems-level performance analysis. It [&hellip;]<\/p>\n","protected":false},"author":70643,"featured_media":228872,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[155],"tags":[],"class_list":["post-234842","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-visual-studio"],"acf":[],"blog_post_summary":"<p>Setting The Stage Around the holidays of 2020 it was a bit quieter, and I decided it might be a great time to go investigate how the Visual Studio profiling tools worked. I have been a long-time user of the Windows Performance Analyzer (WPA), which is a great tool for doing systems-level performance analysis. It [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-json\/wp\/v2\/posts\/234842","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-json\/wp\/v2\/users\/70643"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-json\/wp\/v2\/comments?post=234842"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-json\/wp\/v2\/posts\/234842\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-json\/wp\/v2\/media\/228872"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-json\/wp\/v2\/media?parent=234842"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-json\/wp\/v2\/categories?post=234842"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/visualstudio\/wp-json\/wp\/v2\/tags?post=234842"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}