{"id":35603,"date":"2005-05-19T07:00:00","date_gmt":"2005-05-19T14:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/2005\/05\/19\/loading-the-dictionary-part-6-taking-advantage-of-our-memory-allocation-pattern\/"},"modified":"2005-05-19T07:00:00","modified_gmt":"2005-05-19T14:00:00","slug":"loading-the-dictionary-part-6-taking-advantage-of-our-memory-allocation-pattern","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20050519-00\/?p=35603","title":{"rendered":"Loading the dictionary, part 6:  Taking advantage of our memory allocation pattern"},"content":{"rendered":"<p>\nAfter our latest round of optimization, the 100ms barrier teased us,\njust milliseconds away.\nProfiling the resulting program reveals that\n60% of the CPU is spent in <code>operator new<\/code>.\nIs there anything we can do about that?\n<\/p>\n<p>\nIndeed, we can.\nNotice that the memory allocation pattern for the strings in our\ndictionary is quite special:\nOnce a string is allocated into the dictionary,\nit is never modified or freed while the dictionary is in use.\nWhen the dictionary is freed, all the strings are deleted at once.\nThis means that we can design an allocator tailored to this\nusage pattern.\n<\/p>\n<p>\nI don&#8217;t know whether there is a standard name for this thing,\nso I&#8217;m just going to call it a <code>StringPool<\/code>.\nA string pool has the following characteristics:\n<\/p>\n<ul>\n<li>Once you allocate a string, you can&#8217;t modify or free it\n    as long as the pool remains in existence.<\/p>\n<li>If you destroy the string pool, all the strings in it are destroyed.\n<\/ul>\n<p>\nWe implement it by using the same type of fast allocator that\nthe CLR uses:  A single pointer.\n[25 May 2005: The blog server software corrupts the diagram, sorry.]\n<\/p>\n<table CELLPADDING=\"0\" CELLSPACING=\"0\">\n<tr>\n<td ALIGN=\"center\" STYLE=\"border: solid 1px black;width: 20em\">allocated<\/td>\n<td ALIGN=\"center\" STYLE=\"border: solid 1px black;border-left: none;width: 15em\">free<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td ALIGN=\"left\"><span>&uarr;<\/span><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td ALIGN=\"left\"><span>p<\/span><\/td>\n<\/tr>\n<\/table>\n<p>\nTo allocate memory, we just increment <code>p<\/code> by the number\nof bytes we need.\nIf we run out of memory, we just allocate a new block, point <code>p<\/code>\nto its start, and carve the memory out of the new block.\nDestroying the pool consists of freeing all the blocks.\n<\/p>\n<p>\nNote also that this memory arrangement has very good locality.\nInstead of scattering the strings all over the heap, they are\ncollected into one location.  Furthermore, they are stored in\nmemory <strong>in exactly the order we&#8217;re going to access them<\/strong>,\nwhich means no wasted page faults or cache lines.\n(Well, you don&#8217;t know that&#8217;s the order we&#8217;re going to access them,\nbut it&#8217;s true.\nThis is one of those\n&#8220;<a HREF=\"http:\/\/blogs.msdn.com\/oldnewthing\/archive\/2005\/05\/11\/416430.aspx\">performance-guided designs<\/a>&#8221;\nI mentioned a little while ago.)\n<\/p>\n<pre>\n<font COLOR=\"blue\">class StringPool\n{\npublic:\n StringPool();\n ~StringPool();\n LPWSTR AllocString(const WCHAR* pszBegin, const WCHAR* pszEnd);\nprivate:\n union HEADER {\n  struct {\n   HEADER* m_phdrPrev;\n   SIZE_T  m_cb;\n  };\n  WCHAR alignment;\n };\n enum { MIN_CBCHUNK = 32000,\n        MAX_CHARALLOC = 1024*1024 };\nprivate:\n WCHAR*  m_pchNext;   \/\/ first available byte\n WCHAR*  m_pchLimit;  \/\/ one past last available byte\n HEADER* m_phdrCur;   \/\/ current block\n DWORD   m_dwGranularity;\n};<\/font> \/\/ colorization fixed 25 May\n<\/pre>\n<p>\nEach block of memory we allocate begins with a\n<code>StringPool::HEADER<\/code> structure, which we use\nto maintain a linked list of blocks as well as providing enough\ninformation for us to free the block when we&#8217;re done.\n<\/p>\n<p>\n<strong>Exercise<\/strong>: Why is <code>HEADER<\/code> a union\ncontaining a structure rather than just being a structure?\nWhat is the significance of the <code>alignment<\/code> member?\n<\/p>\n<pre>\n<font COLOR=\"blue\">inline RoundUp(DWORD cb, DWORD units)\n{\n    return ((cb + units - 1) \/ units) * units;\n}\nStringPool::StringPool()\n : m_pchNext(NULL), m_pchLimit(NULL), m_phdrCur(NULL)\n{\n SYSTEM_INFO si;\n GetSystemInfo(&amp;si);\n m_dwGranularity = RoundUp(sizeof(HEADER) + MIN_CBCHUNK,\n                           si.dwAllocationGranularity);\n}<\/font>\n<\/pre>\n<p>\nAt construction, we compute the size of our chunks.\nWe base it on the system allocation granularity, choosing\nthe next multiple of the system allocation granularity\nthat is at least <code>sizeof(HEADER) + MIN_CBCHUNK<\/code> in size.\nSince a chunk is supposed to be a comfortably large block of\nmemory, we need to enforce a minimum chunk size to avoid having\nan enormous number of tiny chunks if we happen to be running on\na machine with a very fine allocation granularity.\n<\/p>\n<pre>\n<font COLOR=\"blue\">LPWSTR StringPool::AllocString(const WCHAR* pszBegin, const WCHAR* pszEnd)\n{\n size_t cch = pszEnd - pszBegin + 1;\n LPWSTR psz = m_pchNext;\n if (m_pchNext + cch &lt;= m_pchLimit) {\n  m_pchNext += cch;\n  lstrcpynW(psz, pszBegin, cch);\n  return psz;\n }\n if (cch &gt; MAX_CHARALLOC) goto OOM;\n DWORD cbAlloc = RoundUp(cch * sizeof(WCHAR) + sizeof(HEADER),\n                                                          m_dwGranularity);\n BYTE* pbNext = reinterpret_cast&lt;BYTE*&gt;(\n                  VirtualAlloc(NULL, cbAlloc, MEM_COMMIT, PAGE_READWRITE));\n if (!pbNext) {\nOOM:\n  static std::bad_alloc OOM;\n  throw(OOM);\n }\n m_pchLimit = reinterpret_cast&lt;WCHAR*&gt;(pbNext + cbAlloc);\n HEADER* phdrCur = reinterpret_cast&lt;HEADER*&gt;(pbNext);\n phdrCur-&gt;m_phdrPrev = m_phdrCur;\n phdrCur-&gt;m_cb = cbAlloc;\n m_phdrCur = phdrCur;\n m_pchNext = reinterpret_cast&lt;WCHAR*&gt;(phdrCur + 1);\n return AllocString(pszBegin, pszEnd);\n}<\/font>\n<\/pre>\n<p>\nTo allocate a string, we first try to carve it out of the\nremainder of the current chunk.  This nearly always succeeds.\n<\/p>\n<p>\nIf the string doesn&#8217;t fit in the chunk, we allocate a new chunk\nbased on our allocation granularity.\nTo avoid integer overflow in the computation of the desired\nchunk size, we check against a fixed &#8220;maximum allocation&#8221; and\ngo stright to the out-of-memory handler if it&#8217;s too big.\n<\/p>\n<p>\nOnce we have a new chunk, we link it into our list of\n<code>HEADER<\/code>s and abandon the old chunk.\n(Yes, this wastes some memory, but in our usage pattern,\nit&#8217;s not much, and trying to squeeze out those last few bytes\nisn&#8217;t worth the added complexity.)\nOnce that&#8217;s done, we try to allocate again; this second time\nwill certainly succeed since we made sure the new chunk was big\nenough.  (And any decent compiler will detect this as a tail\nrecursion and turn it into a &#8220;goto&#8221;.)\n<\/p>\n<p>\nThere is subtlety here.  Notice that we do not update\n<code>m_pchNext<\/code> until after we&#8217;re sure we either\nsatisfied the allocation or allocated a new chunk.\nThis ensures that our member variables are stable at the points\nwhere exceptions can be thrown.\nWriting exception-safe code is hard, and\nseeing the difference between code that is and isn&#8217;t exception\nsafe is often quite difficult.\n<\/p>\n<pre>\n<font COLOR=\"blue\">StringPool::~StringPool()\n{\n HEADER* phdr = m_phdrCur;\n while (phdr) {\n  HEADER hdr = *phdr;\n  VirtualFree(hdr.m_phdrPrev, hdr.m_cb, MEM_RELEASE);\n  phdr = hdr.m_phdrPrev;\n }\n}<\/font>\n<\/pre>\n<p>\nTo destroy the string pool, we walk our list of chunks and\nfree each one.  Note the importance of copying the <code>HEADER<\/code>\nout of the chunk before we free it!\n<\/p>\n<p>\nUsing this string pool requires only small changes to the\nrest of our program.\n<\/p>\n<pre>\nstruct DictionaryEntry\n{\n bool Parse(const WCHAR *begin, const WCHAR *end, <font COLOR=\"blue\">StringPool&amp; pool<\/font>);\n<font COLOR=\"blue\"><strike> \/\/ void Destruct() {\n \/\/  delete[] m_pszTrad;\n \/\/  delete[] m_pszSimp;\n \/\/  delete[] m_pszPinyin;\n \/\/  delete[] m_pszEnglish;\n \/\/ }<\/strike><\/font>\n LPWSTR m_pszTrad;\n LPWSTR m_pszSimp;\n LPWSTR m_pszPinyin;\n LPWSTR m_pszEnglish;\n};\nclass Dictionary\n{\npublic:\n Dictionary();\n <font COLOR=\"blue\"><strike>\/\/ ~Dictionary();<\/strike><\/font>\n int Length() { return v.size(); }\nprivate:\n vector v;\n <font COLOR=\"blue\">StringPool m_pool;<\/font>\n};\n<font COLOR=\"blue\"><strike>\/\/ Dictionary::~Dictionary()\n\/\/ {\n\/\/  for (vector&lt;DictionaryEntry&gt;::iterator i = v.begin();\n\/\/       i != v.end(); i++) {\n\/\/   i-&gt;Destruct();\n\/\/  }\n\/\/ }<\/strike><\/font>\n<\/pre>\n<p>\nWe no longer need to free the strings in the <code>DictionaryEntry<\/code>\nmanually, so the <code>Destruct<\/code> method and the\n<code>Dictionary<\/code> destructor can go.\n<\/p>\n<pre>\nbool DictionaryEntry::Parse(\n   const WCHAR *begin, const WCHAR *end,\n   <font COLOR=\"blue\">StringPool&amp; pool<\/font>)\n{\n const WCHAR* pch = std::find(begin, end, L' ');\n if (pch &gt;= end) return false;\n m_pszTrad = <font COLOR=\"blue\">pool.<\/font>AllocString(begin, pch);\n begin = std::find(pch, end, L'[') + 1;\n if (begin &gt;= end) return false;\n pch = std::find(begin, end, L']');\n if (pch &gt;= end) return false;\n m_pszPinyin = <font COLOR=\"blue\">pool.<\/font>AllocString(begin, pch);\n begin = std::find(pch, end, L'\/') + 1;\n if (begin &gt;= end) return false;\n for (pch = end; *--pch != L'\/'; ) { }\n if (begin &gt;= pch) return false;\n m_pszEnglish = <font COLOR=\"blue\">pool.<\/font>AllocString(begin, pch);\n return true;\n}\nDictionary::Dictionary()\n{\n ...\n    if (de.Parse(buf, buf + cchResult, <font COLOR=\"blue\">m_pool<\/font>)) {\n ...\n}\n<\/pre>\n<p>\nAnd finally, we pass our string pool to\n<code>DictionaryEntry::Parse<\/code> so it knows where\nto get memory for its strings from.\n<\/p>\n<p>\nWith these changes, the dictionary loads in 70ms\n(or 80ms if you include the time it takes to destroy the\ndictionary).\nThis is 70% faster than the previous version,\nand over three times as fast if you include the destruction time.\n<\/p>\n<p>\nAnd now that we&#8217;ve reached our 100ms goal, it&#8217;s a good time to stop.\nWe&#8217;ve gotten the running time of dictionary loading down from\nan uncomfortable 2080ms to a peppier 70ms, a nearly 30-fold improvement,\nby repeatedly profiling and focusing on where the most time is\nbeing spent.\n(I have some more simple tricks that shave a few\nmore milliseconds off the startup time.\nPerhaps I&#8217;ll bring them into play if other changes to startup\npush us over the 100ms boundary.\nAs things stand, the largest CPU consumers are\n<code>MultiByteToWideChar<\/code> and <code>lstrcpynW<\/code>,\nso that&#8217;s where I would focus next.)\n<\/p>\n<p>\nThat&#8217;s the end of the first stage.  The next stage will be\ndisplaying the dictionary in an owner-data listview, but you&#8217;ll\nhave to wait until next month.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Designing a memory allocator that exploits our memory allocation pattern.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-35603","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Designing a memory allocator that exploits our memory allocation pattern.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/35603","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=35603"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/35603\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=35603"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=35603"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=35603"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}