{"id":35643,"date":"2005-05-13T09:04:26","date_gmt":"2005-05-13T09:04:26","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/2005\/05\/13\/loading-the-dictionary-part-3-breaking-the-text-into-lines\/"},"modified":"2005-05-13T09:04:26","modified_gmt":"2005-05-13T09:04:26","slug":"loading-the-dictionary-part-3-breaking-the-text-into-lines","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20050513-26\/?p=35643","title":{"rendered":"Loading the dictionary, part 3:  Breaking the text into lines"},"content":{"rendered":"<p>\nEven after moving the character conversion out of the\n<code>getline<\/code> function, profiling reveals that\n<code>getline<\/code> is still taking nearly 50% of our CPU.\nThe fastest code is code that isn&#8217;t there, so let&#8217;s get rid of\n<code>getline<\/code> altogether.  Oh wait, we still need to break\nthe file into lines.\nBut maybe we can break the file into lines faster than\n<code>getline<\/code> did.\n<\/p>\n<pre>\n<font COLOR=\"blue\">#include &lt;algorithm&gt;\nclass MappedTextFile\n{\npublic:\n MappedTextFile(LPCTSTR pszFile);\n ~MappedTextFile();\n const CHAR *Buffer() { return m_p; }\n DWORD Length() const { return m_cb; }\nprivate:\n PCHAR   m_p;\n DWORD   m_cb;\n HANDLE  m_hf;\n HANDLE  m_hfm;\n};\nMappedTextFile::MappedTextFile(LPCTSTR pszFile)\n    : m_hfm(NULL), m_p(NULL), m_cb(0)\n{\n m_hf = CreateFile(pszFile, GENERIC_READ, FILE_SHARE_READ,\n                   NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);\n if (m_hf != INVALID_HANDLE_VALUE) {\n  DWORD cb = GetFileSize(m_hf, NULL);\n  m_hfm = CreateFileMapping(m_hf, NULL, PAGE_READONLY, 0, 0, NULL);\n  if (m_hfm != NULL) {\n   m_p = reinterpret_cast&lt;PCHAR&gt;\n                 (MapViewOfFile(m_hfm, FILE_MAP_READ, 0, 0, cb));\n   if (m_p) {\n    m_cb = cb;\n   }\n  }\n }\n}\nMappedTextFile::~MappedTextFile()\n{\n if (m_p) UnmapViewOfFile(m_p);\n if (m_hfm) CloseHandle(m_hfm);\n if (m_hf != INVALID_HANDLE_VALUE) CloseHandle(m_hf);\n}<\/font>\n<\/pre>\n<p>\nThis very simple class babysits a read-only memory-mapped file.\n(Yes, there is a bit of oddness with files greater than 4GB,\nbut let&#8217;s ignore that for now, since it&#8217;s a distraction from our\nmain point.)\n<\/p>\n<p>\nNow that the file is memory-mapped, we can just scan it directly.\n<\/p>\n<pre>\nDictionary::Dictionary()\n{\n <font COLOR=\"blue\">MappedTextFile mtf(TEXT(\"cedict.b5\"));<\/font>\n typedef std::codecvt&lt;wchar_t, char, mbstate_t&gt; widecvt;\n std::locale l(\".950\");\n const widecvt&amp; cvt = _USE(l, widecvt); \/\/ use_facet&lt;widecvt&gt;(l);\n <font COLOR=\"blue\">const CHAR* pchBuf = mtf.Buffer();\n const CHAR* pchEnd = pchBuf + mtf.Length();\n while (pchBuf &lt; pchEnd) {\n  const CHAR* pchEOL = std::find(pchBuf, pchEnd, '\\n');\n  if (*pchBuf != '#') {\n   size_t cchBuf = pchEOL - pchBuf;\n   wchar_t* buf = new wchar_t[cchBuf];<\/font>\n   mbstate_t state = 0;\n   char* nextsrc;\n   wchar_t* nextto;\n   if (cvt.in(state, <font COLOR=\"blue\">pchBuf, pchEOL<\/font>, nextsrc,\n                   buf, buf + <font COLOR=\"blue\">cchBuf<\/font>, nextto) == widecvt::ok) {\n    wstring line(buf, nextto - buf);\n    DictionaryEntry de;\n    if (de.Parse(line)) {\n     v.push_back(de);\n    }\n   }\n   delete[] buf;\n  }\n  <font COLOR=\"blue\">pchBuf = pchEOL + 1;<\/font>\n }\n}\n<\/pre>\n<p>\nWe simply scan the memory-mapped file for a <code>'\\n'<\/code>\ncharacter, which tells us where the line ends.\nThis tells us the location and length of the line,\nwhich we use to convert it to Unicode and continue our parsing.\n<\/p>\n<p><strong>Exercise<\/strong>:Why don&#8217;t we have to worry about\n<a href=\"http:\/\/blogs.msdn.com\/oldnewthing\/archive\/2004\/03\/18\/91899.aspx\">the carriage\nreturn that comes before the linefeed<\/a>?\n<\/p>\n<p>\n<strong>Exercise<\/strong>:Why don&#8217;t we have to worry about\npossibly reading past the end of the file when we check\n<code>*pchBuf != '#'<\/code>?\n<\/p>\n<p>\nWith this change, the  program now loads the dictionary in 480ms\n(or 550ms if you include the time it takes to destroy the\ndictionary).  That&#8217;s over twice as fast as the previous version.<\/p>\n<p><p>\nBut it&#8217;s still not fast enough.  A half-second delay between hitting\n<code>Enter<\/code> and getting the visual feedback is still\nunsatisfying.  We can do better.\n<\/p>\n<p>\n[Raymond is currently on vacation; this message was pre-recorded.]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Even after moving the character conversion out of the getline function, profiling reveals that getline is still taking nearly 50% of our CPU. The fastest code is code that isn&#8217;t there, so let&#8217;s get rid of getline altogether. Oh wait, we still need to break the file into lines. But maybe we can break the [&hellip;]<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-35643","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Even after moving the character conversion out of the getline function, profiling reveals that getline is still taking nearly 50% of our CPU. The fastest code is code that isn&#8217;t there, so let&#8217;s get rid of getline altogether. Oh wait, we still need to break the file into lines. But maybe we can break the [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/35643","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=35643"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/35643\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=35643"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=35643"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=35643"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}