{"id":17913,"date":"2009-06-12T07:00:00","date_gmt":"2009-06-12T14:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/2009\/06\/12\/a-concrete-illustration-of-practical-running-time-vs-big-o-notation\/"},"modified":"2009-06-12T07:00:00","modified_gmt":"2009-06-12T14:00:00","slug":"a-concrete-illustration-of-practical-running-time-vs-big-o-notation","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20090612-00\/?p=17913","title":{"rendered":"A concrete illustration of practical running time vs big-O notation"},"content":{"rendered":"<p><P>\nOne of the\n<A HREF=\"http:\/\/channel9.msdn.com\/Showpost.aspx?postid=116704\">\nfive things every Win32 programmer needs to know<\/A>\nis that memory latency\ncan throw your big-<I>O<\/I> computations out the window.\nBack in 2003, I ran into a concrete example of this.\n<\/P>\n<P>\nSomebody started with the algorithm presented in\n<A HREF=\"http:\/\/www.cs.princeton.edu\/~rs\/strings\/\">\n<I>Fast Algorithms for Sorting and Searching Strings<\/I><\/A>\nby\n<A HREF=\"http:\/\/www.cs.bell-labs.com\/cm\/cs\/pearls\/\">\nJon L.&nbsp;Bentley<\/A>\nand\n<A HREF=\"http:\/\/www.cs.princeton.edu\/~rs\/\">\nRobert Sedgewick<\/A>,\nimplemented it in C#, and compared the performance\nagainst a <CODE>HashTable<\/CODE>, <CODE>TernaryTree<\/CODE>\nand <CODE>SortedList<\/CODE>.\nSurprisingly, the hash table won on insertion and retrieval of\ntens of thousands of randomly-generated strings.\nWhy?\n<\/P>\n<P>\nRemember, big-<I>O<\/I> notation hides the constants,\nand those constants can get pretty big.\nWhat&#8217;s more, the impact of those constants is critical\nfor normal-sized workloads.\nThe big-<I>O<\/I> notation allows you to compare algorithms\nwhen the data sets become extremely large,\nbut you have to keep the constants in mind\nto see when the balance tips.\n<\/P>\n<P>\nThe central point of my presentation at the PDC was\nthat complexity analysis typically ignores memory bandwidth effects\nand assumes that all memory accesses perform equally.\nThis is rarely true in practice.\nAs we saw, leaving L2 is a big hit on performance,\nand accessing the disk is an even greater hit.\n<\/P>\n<P>\nThe tree doesn&#8217;t rebalance,\nso inserting strings in alphabetical order\nwill result in a bad search tree.\n(They do call this out in their paper.)\nTo locate a <I>k<\/I>-character string,\nBentley-Sedgewick traverses at least <I>k<\/I> pointers, usually more.\n(&#8220;How much more&#8221; depends on how many prefixes are shared.\nMore shared prefixes = more pointers traversed.)\nIt also requires <I>k<\/I>(<I>4p<\/I>) bytes of memory to store that string,\nwhere <I>p<\/I> is the size of a pointer.\nRemember those pesky constants.\nHigh constant factor overhead starts to kill you\nwhen you have large datasets.\n<\/P>\n<P>\nMore on those constants:\nComplexity analysis assumes that\nan add instruction executes in the same amount of time as a memory access.\nThis is not true in practice,\nbut the difference is a bounded constant factor,\nso it can be ignored for big-<I>O<\/I> purposes.\nNote, however, that that constant often exceeds\none million if you take a page fault.\nOne million is a big constant.\n<\/P>\n<P>\nGoing back to memory bandwidth effects:\nAt each node, you access one character and one pointer.\nSo you use only 6 bytes out of a 64-byte cache line.\nYou&#8217;re wasting 90% of your bus bandwidth and you will certainly fall out of L2.\n<\/P>\n<P>\nBentley-Sedgewick says that this is beaten out\nby not traversing the entire string being searched for in the case of a miss.\n<I>I.e., their algorithm is tuned for misses<\/I>.\nIf you expect most of your probes to be misses, this can be a win.\n(The entire string is traversed on a hit,\nof course, so there is no gain for hits.)\n<\/P>\n<P>\nNote also that this &#8220;cost&#8221; for traversing the string on a miss\nis overstated due to memory bandwidth effects.\nThe characters in a string are contiguous,\nso traversing the string costs you only <I>L<\/I>\/64 cache lines,\nwhere <I>L<\/I> is the length of the string,\nand one potential page fault,\nassuming your string is less than 4KB.\nTraversing the tree structure costs you at least <I>L<\/I> cache lines\nand probably more depending on your branching factor,\nas well as <I>L<\/I> potential page faults.\n<\/P>\n<P>\nLet&#8217;s look at the testing scenario again.\nThe testing was only on hits,\nso the improved performance on misses was overlooked entirely.\nWhat&#8217;s more,\nthe algorithm takes advantage of strings with common prefixes,\nbut the testing scenario used randomly-generated strings,\nwhich generates a data set opposite from the one the algorithm\nwas designed for,\nsince randomly-generated strings are spread out across the problem space\ninstead of being clustered with common prefixes.\n<\/P>\n<P>\nThose are some general remarks; here are some performance notes\nspecific to the CLR.\n<P>\nI don&#8217;t know whether it does or not, but\nI would not be surprised if <CODE>System.String.GetHashCode<\/CODE>\ncaches the hash value in the string,\nwhich would mean that the cost of computing the hash\nis shared by everybody who uses it in a hashing operation.\n(Therefore, if you count the cost incurred only by the algorithm,\nhashing is free.)\n<P>\nNote also that Bentley-Sedgewick&#8217;s <CODE>insert()<\/CODE> function\nstores the object back into the tree in the recursive case.\nMost of the time, the value being stored is the same\nas the value that was already there.\nThis dirties the cache line for no reason\n(forcing memory bandwidth to be wasted on a flush)\nand&mdash;particularly\n<A HREF=\"http:\/\/msdn2.microsoft.com\/en-us\/library\/ms973837.aspx\">\npainful for the CLR<\/A>&mdash;you hit\nthe\n<A HREF=\"http:\/\/citeseerx.ist.psu.edu\/viewdoc\/summary?doi=10.1.1.54.1655\">\nwrite barrier<\/A>\nand end up dirting a whole boatload of\n<A HREF=\"http:\/\/portal.acm.org\/citation.cfm?doid=66068.66077\">\ncards<\/A>.\nA very small change avoids this problem:\nChange\n<\/P>\n<PRE>\n    p-&gt;eqkid = insert(p-&gt;eqkid, ++s); \n<\/PRE>\n<P>\nto\n<\/P>\n<PRE>\n    Tptr newkid = insert(p-&gt;eqkid, ++s);\n    if (p-&gt;eqkid != newkid) p-&gt;eqkid = newkid;\n<\/PRE>\n<P>\n(and similarly in the other branches).\n&#8220;This code is short but subtle, and worth careful study.&#8221;  How very true.\n<\/P>\n<P>\nNote also that if you use their &#8220;sleazy hack&#8221; of\ncoercing a string to a <CODE>Tptr<\/CODE>,\nyou had to have changed the type of <CODE>eqkid<\/CODE> from\n<CODE>Tptr<\/CODE> to <CODE>object<\/CODE>.\nThis introduces a CLR type-check into the inner loop.\nCongratulations, you just tubed the inner loop performance.\n<\/P>\n<P>\nNow go to the summary at the end of the article.\n<OL>\n<LI>&#8220;Ternary trees do not incur extra overhead\n    for insertion or successful searches.&#8221;\n    I&#8217;m not sure what &#8220;extra&#8221; means here,\n    but hash tables have the same behavior.\n<LI>&#8220;Ternary trees are usually substantially faster\n    than hashing for unsuccessful searches.&#8221;\n    Notice that they are optimized for misses.\n<LI>&#8220;Ternary trees gracefully grow and shrink;\n    hash tables need to be rebuilt after large size changes.&#8221;\n    True, but the CLR hashtable does this so you\n    don&#8217;t have to. Somebody wrote it for you.\n<LI>&#8220;Ternary trees support advanced searches\n    such as partial-match and near-neighbor search.&#8221;\n    These operations weren&#8217;t tested.\n<LI>&#8220;Ternary trees support many other operations,\n    such as traversal to report items in sorted order.&#8221;\n    These operations weren&#8217;t tested either.\n<\/OL>\n<P>\nNotice that the article doesn&#8217;t claim that ternary trees\nare faster than hashing for successful searches.\nSo if that&#8217;s all you&#8217;re testing, you&#8217;re testing the wrong thing.\nOne of the big benefits of ternary trees\nis the new operations available (4 and 5),\nbut if you&#8217;re not going to perform those operations,\nthen you ended up paying for something you don&#8217;t use.\n<\/P>\n<P>\nThere are several morals of the story.\n<\/P>\n<OL>\n<LI>Constants are important.\n<LI>Memory bandwith is important.\n<LI>Performance optimizations for unmanged code\n    do not necessarily translate to managed code.\n<LI>What are you really testing?\n<\/OL>\n<P>\nMind you, Bentley and Sedgewick are not morons.  They know all this.\n<\/P>\n<P>\n[Typo fixed 11am, thanks Nathan_works and Jachym Kouba.]\n<\/P><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Memory access time changes the math.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-17913","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Memory access time changes the math.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/17913","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=17913"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/17913\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=17913"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=17913"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=17913"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}