January 19th, 2006

The cost of trying too hard: String searching

There are many algorithms for fast string searching, but the running of a string search is inherently O(n), where n is the length of the string being searched: If m is the length of the string being searched for (which I will call the “target string”), then any algorithm that accesses fewer than n/m elements of the string being searched will have a gap of m unaccessed elements, which is enough room to hide the target string. More advanced string searching algorithms can take advantage of characteristics of the target string, but in the general case, where the target string is of moderate size and is not pathological, all that the fancy search algorithms give you over the naive search algorithm is a somewhat smaller multiplicative constant. In the overwhelming majority of cases, then, a naive search algorithm is adequate. As long as you’re searching for normal strings and not edge cases like “Find aaaaaaaaaaaaaaab in the string aaaaaaaaaaaaaabaaaaaaaaaaaaaaab“. If you have a self-similar target string, the running time of a naive search is O(mn) where m is the length of the target string. The effort in the advanced searching algorithms goes towards diminishing the effect of m, but pay for it by requiring preliminary analysis of the target string. If your searches are for “relatively short” “normal” target strings, then the benefit of this analysis doesn’t merit the cost.

That’s why nearly all library functions that do string searching use the naive algorithm. The naive algorithm is the correct algorithm over 99% of the time.

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

0 comments

Discussion are closed.

Feedback