{"id":109742,"date":"2024-05-10T07:00:00","date_gmt":"2024-05-10T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=109742"},"modified":"2024-05-13T14:23:38","modified_gmt":"2024-05-13T21:23:38","slug":"an-informal-comparison-of-the-three-major-implementations-of-stdstring","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20240510-00\/?p=109742","title":{"rendered":"An informal comparison of the three major implementations of <CODE>std::string<\/CODE>"},"content":{"rendered":"<p>[Note: This article has been updated since original publication.]<\/p>\n<p>We saw some time ago that <a title=\"Inside STL: The string\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20230803-00\/?p=108532\"> the three major implementations of <code>std::string<\/code> are all quite different<\/a>. To summarize:<\/p>\n<pre>\/\/ gcc\r\nstruct string\r\n{\r\n    char* ptr;\r\n    size_t size;\r\n    union {\r\n        size_t capacity;\r\n        char buf[16];\r\n    };\r\n\r\n    bool is_large() { return ptr != buf; }\r\n    auto data() { return ptr; }\r\n    auto size() { return size; }\r\n    auto capacity() { return is_large() ? capacity : 15; }\r\n};\r\n\r\n\/\/ msvc\r\nstruct string\r\n{\r\n    union {\r\n        char* ptr;\r\n        char buf[16];\r\n    };\r\n    size_t size;\r\n    size_t capacity;\r\n\r\n    bool is_large() { return capacity &gt; 15; }\r\n    auto data() { return is_large() ? ptr : buf; }\r\n    auto size() { return size; }\r\n    auto capacity() { return capacity; }\r\n};\r\n\r\n\/\/ clang\r\nunion string\r\n{\r\n    struct {\r\n        size_t capacity;\r\n        size_t size;\r\n        char* ptr;\r\n    } large;\r\n\r\n    struct {\r\n        unsigned char is_large:1;\r\n        unsigned char size:7;\r\n        char buf[sizeof(large) - 1];\r\n    } small;\r\n\r\n    bool is_large() { return small.is_large; }\r\n    auto data() { return is_large() ? large.ptr : small.buf; }\r\n    auto size() { return is_large() ? large.size : small.size; }\r\n    auto capacity() { return is_large() ? large.capacity : sizeof(large) - 2; }\r\n};\r\n<\/pre>\n<p>Note that these implementations above are for expository purposes only. The actual implementations are far more complicated. (For example, the real implementations are uglified, and they have to store the allocator somewhere.)<\/p>\n<p><b>Update<\/b>: In the original version of this article, I got the sense of the &#8220;small\/large&#8221; bit backward in the clang implementation. This in turn led to redoing the code generation and new code golfing results.<\/p>\n<p>We&#8217;ll compare these versions based on the complexity of some commonly-used operations.<\/p>\n<p>Detecting whether the string is small or large is a single member comparison with msvc and clang, but on gcc, it involves comparing a member against the address of another member, so it will take an extra instruction to calculate that address.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>gcc is_large<\/th>\n<th>msvc is_large<\/th>\n<th>clang is_large<\/th>\n<\/tr>\n<tr>\n<td valign=\"baseline\"><code>lea rdx, [rcx].buf<\/code><br \/>\n<code>cmp rdx, [rcx].ptr<\/code><br \/>\n<code>jnz large<\/code><\/td>\n<td valign=\"baseline\"><code>cmp [rcx].capacity, 15<\/code><br \/>\n<code>ja large<\/code><\/td>\n<td valign=\"baseline\"><code>test [rcx].is_large, 1<\/code><br \/>\n<code>jnz large<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Note: gcc could have shaved an instruction by reordering the members so that the <code>buf<\/code> comes first (thereby avoiding the need to calculate its address). On the other hand, it increases the cost of accessing <code>ptr<\/code> on some processors: On the x86 family, it forces a larger encoding because the offset is nonzero. On the Itanium, it requires two instructions because the Itanium cannot perform an offset load in a single instruction. On most other processors, the offset can be folded into the load instruction at no extra cost. My guess is that gcc biased their design to optimize for x86.<\/p>\n<p>On the other hand, gcc wins the race to access the <code>data()<\/code>, since the <code>ptr<\/code> is always valid, and that&#8217;s probably why they chose their design.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>gcc data()<\/th>\n<th>msvc data()<\/th>\n<th>clang data()\u00b9<\/th>\n<\/tr>\n<tr>\n<td valign=\"baseline\"><code>mov rdx, [rcx].ptr<\/code><\/td>\n<td valign=\"baseline\"><code>lea rdx, [rcx].buf<\/code><br \/>\n<code>cmp [rcx].capacity, 15<\/code><br \/>\n<code>cmova rdx, [rdx]<\/code><\/td>\n<td valign=\"baseline\"><code>lea rdx, [rcx].small.buf<\/code><br \/>\n<code>test [rcx].small.is_large, 1<\/code><br \/>\n<code>jz @1<\/code><br \/>\n<code>mov rdx, [rcx].large.ptr<\/code><br \/>\n<code>@1:<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The clang implementation also has extra work to calculate the size.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>gcc size()<\/th>\n<th>msvc size()<\/th>\n<th>clang size()\u00b2<\/th>\n<\/tr>\n<tr>\n<td valign=\"baseline\"><code>mov rdx, [rcx].size<\/code><\/td>\n<td valign=\"baseline\"><code>mov rdx, [rcx].size<\/code><\/td>\n<td valign=\"baseline\"><code>movzx eax, [rcx].small.is_large<\/code><br \/>\n<code>test al, 1<\/code><br \/>\n<code>jz @1<\/code><br \/>\n<code>mov rax, [rcx].large.size<\/code><br \/>\n<code>jmp @2<\/code><br \/>\n<code>@1: shr eax, 1<\/code><br \/>\n<code>@2:<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>A special case of checking the size is checking whether the string is empty.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>gcc empty()<\/th>\n<th>msvc empty()<\/th>\n<th>clang empty()\u00b3<\/th>\n<\/tr>\n<tr>\n<td valign=\"baseline\"><code>cmp [rcx].size, 0<\/code><br \/>\n<code>jz empty<\/code><\/td>\n<td valign=\"baseline\"><code>cmp [rcx].size, 0<\/code><br \/>\n<code>jz empty<\/code><\/td>\n<td valign=\"baseline\"><code>movzx eax, [rcx].small.is_large<\/code><br \/>\n<code>test al, 1<\/code><br \/>\n<code>jz @1<\/code><br \/>\n<code>mov rax, [rcx].large.size<\/code><br \/>\n<code>jmp @2<\/code><br \/>\n<code>@1: shr eax, 1<\/code><br \/>\n<code>@2: test rax, rax<\/code><br \/>\n<code>jz empty<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The capacity comes into play behind the scenes when extending the string. For example, <code>append(char)<\/code> can do a fast-append if there is excess capacity, and delegate to a function call if the capacity needs to be increased. Here, msvc has an edge.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>gcc capacity()<\/th>\n<th>msvc capacity()<\/th>\n<th>clang capacity()<\/th>\n<\/tr>\n<tr>\n<td style=\"font-size: 90%;\" valign=\"baseline\"><code>lea rax, [rcx].buf<\/code><br \/>\n<code>cmp rax, [rcx].ptr<\/code><br \/>\n<code>je @1<\/code><br \/>\n<code>mov rax, [rcx].large.capacity<\/code><br \/>\n<code>jmp @2<\/code><br \/>\n<code>@1: mov eax, 15<\/code><br \/>\n<code>@2:<\/code><\/td>\n<td style=\"font-size: 90%;\" valign=\"baseline\"><code>mov rax, [rcx].capacity<\/code><\/td>\n<td style=\"font-size: 90%;\" valign=\"baseline\"><code>test [rcx].small.is_large, 1<\/code><br \/>\n<code>mov eax, 22<\/code><br \/>\n<code>jz @1<\/code><br \/>\n<code>mov rax, [rcx].large.capacity<\/code><br \/>\n<code>@1:<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The clang implementation does have an edge in terms of memory usage: Despite an overall smaller size, it has a larger small-string capacity in 64-bit mode.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>sizeof \/ SSO capacity<\/th>\n<th>gcc<\/th>\n<th>msvc<\/th>\n<th>clang<\/th>\n<\/tr>\n<tr>\n<td>32-bit mode<\/td>\n<td align=\"right\">24 \/ 15<\/td>\n<td align=\"right\">24 \/ 15<\/td>\n<td align=\"right\">12 \/ 11<\/td>\n<\/tr>\n<tr>\n<td>64-bit mode<\/td>\n<td align=\"right\">32 \/ 15<\/td>\n<td align=\"right\">32 \/ 15<\/td>\n<td align=\"right\">24 \/ 22<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>If you <code>reserve()<\/code> a lot of space for a string, but use only a little bit of it, and then call <code>shrink_<wbr \/>to_<wbr \/>fit()<\/code>, you can potentially get into a mixed state where the string is allocated externally (as if it were a large string), even though the capacity is smaller than the capacity of a small string.<\/p>\n<p>The msvc implementation uses the capacity to detect whether it is using the small string optimization, so this mixed state is illegal for msvc, and it must convert large strings to small strings if <code>shrink_<wbr \/>to_<wbr \/>fit()<\/code> shrinks the string below the small-string threshold.<\/p>\n<p>The gcc and clang implementations allow external allocations to have a small capacity. Nevertheless, both gcc and clang force the conversion of externally-allocated strings to small strings if they shrink below the small-string threshold.<\/p>\n<p><b>Update<\/b>: A previous version of this article erroneously said that <code>shrink_<wbr \/>to_<wbr \/>fit()<\/code> is a nop on gcc.<\/p>\n<p>One final point of comparison is how the three implementations deal with static initialization.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>gcc<\/th>\n<th>msvc<\/th>\n<th>clang<\/th>\n<\/tr>\n<tr>\n<td><code>{ buf, 0, { 0 } }<\/code><\/td>\n<td><code>{ { 0 }, 0, 15 }<\/code><\/td>\n<td><code>{ 0, 0, 0, ... }<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>A statically-initialized empty string in gcc consists of a pointer to the internal buffer, a constant 0 (size), and a bunch of zeroes (buf). The presence of a pointer introduces a relocation into the data segment and <a title=\"Just how constexpr is C++20's std::string?\" href=\"https:\/\/quuxplusone.github.io\/blog\/2023\/09\/08\/constexpr-string-firewall\/\"> silently messes up string&#8217;s <code>constexpr<\/code>-ness<\/a>.<\/p>\n<p>Statically-initialized empty strings in msvc and clang consist of integer constant data; no pointers. This means no relocations and a better shot at <code>constexpr<\/code>-ness.<\/p>\n<p>Okay, so let&#8217;s summarize all this information into a table.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>\u00a0<\/th>\n<th>gcc<\/th>\n<th>msvc<\/th>\n<th>clang<\/th>\n<\/tr>\n<tr>\n<th>is_large<\/th>\n<td>slower<\/td>\n<td>faster<\/td>\n<td>faster<\/td>\n<\/tr>\n<tr>\n<th>data()<\/th>\n<td>fast<\/td>\n<td>slower<\/td>\n<td>slower<\/td>\n<\/tr>\n<tr>\n<th>size()<\/th>\n<td>fast<\/td>\n<td>fast<\/td>\n<td>much slower<\/td>\n<\/tr>\n<tr>\n<th>empty()<\/th>\n<td>fast<\/td>\n<td>fast<\/td>\n<td>much slower\u00b3<\/td>\n<\/tr>\n<tr>\n<th>capacity()<\/th>\n<td>slowest<\/td>\n<td>fast<\/td>\n<td>slower<\/td>\n<\/tr>\n<tr>\n<th>32-bit size<\/th>\n<td>24<\/td>\n<td>24<\/td>\n<td>12<\/td>\n<\/tr>\n<tr>\n<th>64-bit size<\/th>\n<td>32<\/td>\n<td>32<\/td>\n<td>24<\/td>\n<\/tr>\n<tr>\n<th>32-bit SSO capacity<\/th>\n<td>15<\/td>\n<td>15<\/td>\n<td>11<\/td>\n<\/tr>\n<tr>\n<th>64-bit SSO capacity<\/th>\n<td>15<\/td>\n<td>15<\/td>\n<td>22<\/td>\n<\/tr>\n<tr>\n<th>ABI supports mixed state?<\/th>\n<td>yes<\/td>\n<td>no<\/td>\n<td>yes<\/td>\n<\/tr>\n<tr>\n<th>implementation uses mixed state<\/th>\n<td>no<\/td>\n<td>forbidden<\/td>\n<td>no<\/td>\n<\/tr>\n<tr>\n<th>Static initialization<\/th>\n<td>relocation<\/td>\n<td>no relocation<\/td>\n<td>no relocation<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\u00b9 I don&#8217;t see clang generating this slightly smaller alternative<\/p>\n<pre>lea rdx, [rcx].small.buf\r\ntest [rcx].small.is_large, 1\r\ncmovnz rdx, [rcx].large.ptr\r\n<\/pre>\n<p>perhaps because the <code>cmov<\/code> instruction always reads from its second parameter even if the value is not used, and there might be a store-to-load forwarding penalty because in the case of a small string, the read is unlikely to match the size of the previous write.<\/p>\n<p>\u00b2 I don&#8217;t see clang generating this slightly smaller alternative<\/p>\n<pre>movzx eax, [rcx].small.is_large\r\nshr eax, 1\r\njnc @1\r\nmov rax, [rcx].large.size\r\n@1:\r\n<\/pre>\n<p>probably because &#8220;shift right and look at carry&#8221; is not something natively expressible in C++. If you really went for it, you could also fold in a <code>cmov<\/code>.<\/p>\n<pre>movzx eax, [rcx].small.is_large\r\nshr eax, 1\r\ncmovc rax, [rcx].large.size\r\n<\/pre>\n<p>\u00b3 <span style=\"text-decoration: line-through;\">My hand-golfed version of clang <code>empty()<\/code> brings the performance of clang <code>empty()<\/code> to be comparable to gcc and msvc:<\/span><\/p>\n<pre><span style=\"text-decoration: line-through;\">cmp byte ptr [rcx].small.is_large, 0<\/span>\r\n<span style=\"text-decoration: line-through;\">jz empty<\/span>\r\n<\/pre>\n<p><span style=\"text-decoration: line-through;\">The trick here is that since clang always uses SSO when possible (no mixed state), the <code>is_large<\/code> is sufficient to tell us whether the string is empty. A large string is never empty, and it will have the <code>is_large<\/code> bit set, so a large string will never have zero as its initial byte. A small string has the <code>is_large<\/code> bit clear and the string size in the remaining bits, so comparing the entire byte against zero tests the size and the <code>is_large<\/code> bit simultaneously.<\/span><\/p>\n<p>While it&#8217;s true that clang always uses SSO when possible, it&#8217;s still a valid state for a large string to be empty, because it might have a large capacity but no contents. So we just have to get the size and test it against zero. (Though we can cheat and omit the shift-right since zero shifted left or right by one is still zero.)<\/p>\n<pre>movzx eax, [rcx].small.is_large\r\ntest al, 1\r\ncmovnz rax, [rcx].large.size\r\ntest rax, rax\r\njz empty\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Pros and cons.<\/p>\n","protected":false},"author":1069,"featured_media":100998,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-109742","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Pros and cons.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/109742","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=109742"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/109742\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/100998"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=109742"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=109742"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=109742"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}