{"id":103749,"date":"2020-05-14T07:00:00","date_gmt":"2020-05-14T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=103749"},"modified":"2020-05-14T10:17:45","modified_gmt":"2020-05-14T17:17:45","slug":"20200514-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20200514-00\/?p=103749","title":{"rendered":"Inside <CODE>std::function<\/CODE>, part 2: Storage optimization"},"content":{"rendered":"<p><a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20200513-00\/?p=103745\"> Last time<\/a>, we looked at the basic idea behind <code>std::function<\/code>. For each callable type, it creates a custom <code>callable<\/code> wrapper class that knows how to invoke the instance of that type. Each <code>callable<\/code> derives from a common base interface, and the main <code>std::function<\/code> holds a pointer to that interface.<\/p>\n<p>As I noted, the standard requires that if the callable is a plain function pointer or a <code>reference_wrapper<\/code>, then no additional allocations are permitted.<\/p>\n<p>The way the optimization works is that the <code>function<\/code> also carries with it a small buffer of raw bytes. If the <code>callable<\/code> is small (as it will be for a function pointer or or a <code>reference_wrapper<\/code>), then the <code>callable<\/code> is stored directly in the buffer, avoiding an extra allocation. If the <code>callable<\/code> is large, then a separate allocation is done as before.\u00b9<\/p>\n<p>The concept behind the optimization is simple, but the execution is rather complicated. Since the memory is now being partly managed by the class itself, we can&#8217;t use a <code>unique_ptr<\/code> any more. Instead, we keep a raw pointer and do manual memory management.<\/p>\n<p>Let&#8217;s start with what we need in our <code>function<\/code>:<\/p>\n<pre>inline constexpr size_t storage_size =\r\n  std::max(sizeof(callable&lt;void(*)()&gt;),\r\n           sizeof(reference_wrapper&lt;char&gt;));\r\n\r\nusing storage_t = typename\r\n           aligned_storage&lt;storage_size&gt;::type;\r\n\r\nstruct function\r\n{\r\n  callable_base* m_callable = nullptr;\r\n\r\n  storage_t m_storage;\r\n\r\n  void* storage() { return addressof(m_storage); }\r\n\r\n  ...\r\n};\r\n<\/pre>\n<p>In addition to the <code>callable_base<\/code> pointer, we also have some storage that we can use for small callables. In order to satisfy the standard, our storage needs to be large enough to hold a callable function pointer or a callable reference wrapper.<\/p>\n<p>There&#8217;s a bit of a trick here: Calculating the size of an arbitrary function pointer and an arbitrary reference wrapper. Fortunately, the standard requires that all function pointers have the same size,\u00b2 due to the rule that says that you can <code>reinterpret_cast<\/code> among function pointers, and if you <code>reinterpret_cast<\/code> back to where you started, the result is the same. A <code>reference_wrapper<\/code> is a wrapper around a pointer, and the largest pointer is the <code>void*<\/code>, which the standard requires to be the same size as a <code>char*<\/code>.\u00b3<\/p>\n<p>The idea is that the <code>m_callable<\/code> points either to a heap-allocated <code>callable&lt;T&gt;<\/code> (if it is large) or to a <code>callable&lt;T&gt;<\/code> constructed via placement new into the <code>m_storage<\/code> (if it is small).<\/p>\n<p>It&#8217;s a simple idea, but the implementation gets kind of messy, because the <code>callable&lt;T&gt;<\/code>&#8216;s <code>clone<\/code> method has to do different things depending on whether it is a large or small object.<\/p>\n<pre>struct function\r\n{\r\n  ...\r\n\r\n  template&lt;typename T&gt;\r\n  function(T&amp;&amp; t)\r\n  {\r\n    if constexpr (sizeof(*new callable(t)) &lt;= storage_size) {\r\n      m_callable = new(storage()) callable(t);\r\n    } else {\r\n      m_callable = new callable(t);\r\n    }\r\n  }\r\n\r\n  ...\r\n};\r\n<\/pre>\n<p>To create a <code>function<\/code>, we take the functor object and construct it either in-place (if it is small) or on the heap (if it is large).<\/p>\n<pre>struct function\r\n{\r\n  ...\r\n\r\n  function(const function&amp; other)\r\n  {\r\n    if (other.m_callable) {\r\n      m_callable = other.m_callable.clone(m_storage);\r\n    }\r\n  }\r\n\r\n  ...\r\n};\r\n<\/pre>\n<p>To copy a <code>function<\/code>, we take the source&#8217;s <code>m_callable<\/code> and ask it to clone itself, using our storage if it can.<\/p>\n<pre>struct function\r\n{\r\n  ...\r\n\r\n  function(function&amp;&amp; other)\r\n  {\r\n    if (other.m_callable == other.storage()) {\r\n      m_callable = other.m_callable.move_to(m_storage);\r\n      exchange(other.m_callable, nullptr)-&gt;~callable();\r\n    } else if (other.m_callable) {\r\n      m_callable = exchange(other.m_callable, nullptr);\r\n    }\r\n  }\r\n\r\n  ...\r\n};\r\n<\/pre>\n<p>To move a <code>function<\/code>, there are three cases.<\/p>\n<ul>\n<li>If the callable is small (the <code>m_callable<\/code> points to its storage), then move it into the destination&#8217;s storage and destruct the original, leaving the source empty.<\/li>\n<li>If the callable is large (the <code>m_callable<\/code> is non-null but does not point to its storage), then transfer the pointer into the new object and leave the source empty.<\/li>\n<li>If there is no callable (<code>m_callable<\/code> is <code>nullptr<\/code>), then leave both source and destination empty.<\/li>\n<\/ul>\n<p>The order of operations in the <code>move_to<\/code> case is tricky. The <code>move_to<\/code> comes first, because we don&#8217;t want to make any changes to the source if the operation fails. Once the move succeeds, we need to null out the <code>other.m_callable<\/code> before destructing it, so that an exception during destruction will not result in double-destruction later.<\/p>\n<pre>struct function\r\n{\r\n  ...\r\n\r\n  ~function()\r\n  {\r\n    if (m_callable == storage()) {\r\n      m_callable-&gt;~callable();\r\n    } else {\r\n      delete m_callable;\r\n    }\r\n  }\r\n\r\n  ...\r\n};\r\n<\/pre>\n<p>To destruct a <code>function<\/code>, there are again three cases, but two of them collapse.<\/p>\n<ul>\n<li>If the callable is small (the <code>m_callable<\/code> points to our storage), then destruct it in place.<\/li>\n<li>Otherwise the callable is large or null. We take advantage of the fact that <code>delete<\/code>-ing a null pointer is harmless.<\/li>\n<\/ul>\n<p>The <code>operator()(int, char*)<\/code> is the same as before.<\/p>\n<p>We need to make some adjustments to <code>callable_base<\/code>:<\/p>\n<pre>struct callable_base\r\n{\r\n  callable_base() = default;\r\n  virtual ~callable_base() { }\r\n  virtual bool invoke(int, char*) = 0;\r\n  virtual callable_base* clone(storage_t&amp; storage) = 0;\r\n  virtual callable_base* move_to(storage_t&amp; storage) = 0;\r\n};\r\n<\/pre>\n<p>The <code>callable_base<\/code> has new <code>clone<\/code> and <code>move_to<\/code> methods.<\/p>\n<pre>template&lt;typename T&gt;\r\nstruct callable : callable_base\r\n{\r\n  T m_t;\r\n\r\n  callable(T const&amp; t) : m_t(t) {}\r\n  callable(T&amp;&amp; t) : m_t(move(t)) {}\r\n\r\n\r\n  bool invoke(int a, char* b) override\r\n  {\r\n    return m_t(a, b);\r\n  }\r\n\r\n  ...\r\n};\r\n<\/pre>\n<p>The constructors and <code>invoke<\/code> method haven&#8217;t changed.<\/p>\n<pre>template&lt;typename T&gt;\r\nstruct callable : callable_base\r\n{\r\n  ...\r\n\r\n  callable_base* clone(storage_t&amp; storage) override\r\n  {\r\n    if constexpr (sizeof(*this) &lt;= sizeof(storage)) {\r\n      return new(addressof(storage)) callable(m_t);\r\n    } else {\r\n      return new callable(m_t);\r\n    }\r\n  }\r\n\r\n  callable_base* move_to(storage_t&amp; storage) override\r\n  {\r\n    if constexpr (sizeof(*this) &lt;= sizeof(storage)) {\r\n      return new(addressof(storage)) callable(move(m_t));\r\n    } else {\r\n      return nullptr; \/\/ should not be called\r\n    }\r\n  }\r\n};\r\n<\/pre>\n<p>The <code>clone<\/code> and <code>move_to<\/code> methods use the provided storage if it is large enough. Otherwise, the <code>clone<\/code> allocates the cloned object on the heap. (On the other hand <code>move_to<\/code> should never be called for heap-allocated objects, for in those cases, we move the pointer to the callable instead of the callable itself.)<\/p>\n<p>I think that&#8217;s the necessary changes to keep track of the memory bookkeeping for our miniature <code>function<\/code>. I left out a bunch of methods that are required by the standard, particularly the assignment operators. I&#8217;m not even going to leave them as an exercise, because you should just be using the <code>std::function<\/code> implementation that came with your compiler.<\/p>\n<p>The purpose of this entry was to give you an idea of what&#8217;s happening under the hood. This will help you when you need to debug one of these things, and the techniques may come in handy later. Like next time, for instance. Stay tuned.<\/p>\n<p>\u00b9 Implementations typically do some tuning to decide how big this small buffer should be. You want the buffer to be small, so that the function object consumes less memory. But you want the buffer to be large enough to hold the most common callables. (Plus, of course, the two callables required by the standard.)<\/p>\n<p>\u00b2 Note that the same requirement <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20040209-00\/?p=40713\"> does not apply to pointers to member functions<\/a>.<\/p>\n<p>\u00b3 Technically, I guess, you could have an implementation where function pointers were different sizes, but where <code>reinterpret_cast<\/code> applies appropriate conversions to &#8220;squeeze out the air&#8221; from large function pointers so they can be recovered from small function pointers;\u2074 or an implementation where <code>reference_<\/code><code>wrapper<\/code> did something weird based on the type. But since <code>std::function<\/code> is part of the implementation, it can make these sorts of implementation-dependent assumptions.<\/p>\n<p>\u2074 For example, you might have &#8220;fast function pointers&#8221; which are fat (say, because they consist of a code pointer plus a table of contents) and &#8220;slow function pointers&#8221; which are smaller but slower (consisting of just the code pointer). The idea is that the bytes immediately before the code pointer hold the table of contents. Calling through a fast function pointer consists of loading the table of contents into the global pointer register, and calling the code pointer.<\/p>\n<pre>  ; assume fast function pointer is in r34\/r35\r\n  mov   gp, r34                 \/\/ set up global pointer\r\n  mov   b6, r35                 \/\/ set up branch target\r\n  br.call.dptk.many rp = b6 ;;  \/\/ call it\r\n<\/pre>\n<p>Calling through a slow function pointer requires the call site to recover the table of contents from memory before calling the target.<\/p>\n<pre>  ; assume slow function pointer is in r34\r\n  add   r30 = r34, -8           \/\/ get address of TOC\r\n  mov   b6, r34 ;;              \/\/ set up branch target\r\n  mov   gp, [r30]               \/\/ set up global pointer\r\n  br.call.dptk.many rp = b6 ;;  \/\/ call the target\r\n<\/pre>\n<p>The slow version costs a memory access and takes an extra cycle. It may also burn an extra cache line, because we have to access the data that comes before the start of the code, and the code may have to satisfy certain alignment requirements which cause it tend to appear at the start of a cache line.<\/p>\n<p>To convert a fast function pointer to a slow one, discard the table of contents pointer. To convert a slow function pointer to a fast one, fetch the table of contents pointer from the memory immediately before the code address.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Preallocating the storage yourself.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-103749","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Preallocating the storage yourself.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/103749","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=103749"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/103749\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=103749"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=103749"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=103749"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}