{"id":34833,"date":"2024-11-01T16:06:51","date_gmt":"2024-11-01T16:06:51","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cppblog\/?p=34833"},"modified":"2024-11-01T11:09:05","modified_gmt":"2024-11-01T11:09:05","slug":"analyzing-the-performance-of-the-proxy-library","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cppblog\/analyzing-the-performance-of-the-proxy-library\/","title":{"rendered":"Analyzing the Performance of the &#8220;Proxy&#8221; Library"},"content":{"rendered":"<p>Since the <a href=\"https:\/\/devblogs.microsoft.com\/cppblog\/announcing-the-proxy-3-library-for-dynamic-polymorphism\/\">recent announcement of the Proxy 3 library<\/a>, we have received much positive feedback, and there have been numerous inquiries regarding the library&#8217;s actual performance. Although the &#8220;Proxy&#8221; library is designed to be <strong>fast<\/strong>, fulfilling one of our six core missions, it is not immediately clear how fast &#8220;Proxy&#8221; can be across different platforms and scenarios.<\/p>\n<p>To better understand the performance of the &#8220;Proxy&#8221; library, we designed <a href=\"https:\/\/github.com\/microsoft\/proxy\/tree\/main\/benchmarks\">15 benchmarks<\/a>, tested in four different environments, and automated them in <a href=\"https:\/\/github.com\/microsoft\/proxy\/actions\/workflows\/pipeline-ci.yml\">our GitHub pipeline<\/a> to generate benchmarking reports for every code change in the future. Everyone can download the reports and raw benchmarking data attached to each build. The rest of this article delves into the benchmarking details. The numbers shown below were generated from <a href=\"https:\/\/github.com\/microsoft\/proxy\/actions\/runs\/11550031482#artifacts\">a recent CI build<\/a>.<\/p>\n<h2>Indirect Invocation<\/h2>\n<p>Both <a href=\"https:\/\/microsoft.github.io\/proxy\/docs\/proxy.html\"><code>proxy<\/code><\/a> objects and virtual functions can perform indirect invocations. However, since they have different semantics and memory layout, it should be interesting to see how they compare to each other.<\/p>\n<p>Because <a href=\"https:\/\/microsoft.github.io\/proxy\/docs\/make_proxy.html\"><code>make_proxy<\/code><\/a> can effectively place a small object alongside metadata (similar to &#8220;small buffer optimization&#8221; in some other C++ libraries), the benchmarks are divided into two categories: invocation on small objects (4 bytes) and on large objects (48 bytes). By invoking 1,000,000 object of 100 different types, we got the first two rows of the report:<\/p>\n<table>\n<thead>\n<tr>\n<th><\/th>\n<th>MSVC on Windows Server 2022 (x64)<\/th>\n<th>GCC on Ubuntu 24.04 (x64)<\/th>\n<th>Clang on Ubuntu 24.04 (x64)<\/th>\n<th>Apple Clang on macOS 15 (ARM64)<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Indirect invocation on small objects via <code>proxy<\/code> vs. virtual functions<\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>261.7% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>44.6% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>71.6% faster<\/strong><\/td>\n<td>\ud83d\udfe1<code>proxy<\/code> is about <strong>4.0% faster<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Indirect invocation on large objects via <code>proxy<\/code> vs. virtual functions<\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>186.1% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>15.5% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>17.0% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>10.5% faster<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>From the report, <code>proxy<\/code> is faster in all four environments, especially on Windows Server. This result is expected because the implementation of <code>proxy<\/code> directly stores the metadata of the underlying object, making it more cache-friendly.<\/p>\n<h2>Lifetime Management<\/h2>\n<p>In many applications, lifetime management of various objects can become a performance hotspot compared to indirect invocations. We benchmarked this scenario by creating 600,000 small or large objects within a single <code>std::vector<\/code> (with reserved space).<\/p>\n<p>Besides <code>proxy<\/code>, there are three typical standard options for storing arbitrary types: <code>std::unique_ptr<\/code>, <code>std::shared_ptr<\/code>, and <code>std::any<\/code>. <code>std::variant<\/code> is not included because it is essentially a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Tagged_union\">tagged union<\/a> and can only provide storage for a known set of types (though useful in data context management).<\/p>\n<p>For small objects, <code>proxy<\/code> and <code>std::any<\/code> usually won&#8217;t allocate additional storage. For large objects, <code>proxy<\/code> and <code>std::shared_ptr<\/code> offer allocator support (via <a href=\"https:\/\/microsoft.github.io\/proxy\/docs\/allocate_proxy.html\"><code>pro::allocate_proxy<\/code><\/a> and <a href=\"https:\/\/learn.microsoft.com\/cpp\/standard-library\/memory-functions?view=msvc-170#allocate_shared\"><code>std::allocate_shared<\/code><\/a>) to improve performance, while there is no direct API to customize <code>std::unique_ptr<\/code> or <code>std::any<\/code>.<\/p>\n<p>Here are the types we used in the benchmarks:<\/p>\n<table>\n<thead>\n<tr>\n<th>Small types<\/th>\n<th>Large types<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>int<\/code><\/td>\n<td><code>std::array&lt;char, 100&gt;<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>std::shared_ptr&lt;int&gt;<\/code><\/td>\n<td><code>std::array&lt;std::string, 3&gt;<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>std::unique_lock&lt;std::mutex&gt;<\/code><\/td>\n<td><code>std::unique_lock&lt;std::mutex&gt;<\/code> + <code>void*[15]<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>By comparing <code>proxy<\/code> with other solutions, we got the following numbers:<\/p>\n<table>\n<thead>\n<tr>\n<th><\/th>\n<th>MSVC on Windows Server 2022 (x64)<\/th>\n<th>GCC on Ubuntu 24.04 (x64)<\/th>\n<th>Clang on Ubuntu 24.04 (x64)<\/th>\n<th>Apple Clang on macOS 15 (ARM64)<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Basic lifetime management for small objects with <code>proxy<\/code> vs. <code>std::unique_ptr<\/code><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>467.0% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>413.0% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>430.1% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>341.1% faster<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Basic lifetime management for small objects with <code>proxy<\/code> vs. <code>std::shared_ptr<\/code> (without memory pool)<\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>639.2% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>509.3% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>492.5% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>484.2% faster<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Basic lifetime management for small objects with <code>proxy<\/code> vs. <code>std::shared_ptr<\/code> (with memory pool)<\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>198.4% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>696.1% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>660.0% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>188.5% faster<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Basic lifetime management for small objects with <code>proxy<\/code> vs. <code>std::any<\/code><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>55.3% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>311.0% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>323.0% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>18.3% faster<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Basic lifetime management for large objects with <code>proxy<\/code> (without memory pool) vs. <code>std::unique_ptr<\/code><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>17.4% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>14.8% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>29.7% faster<\/strong><\/td>\n<td>\ud83d\udd34<code>proxy<\/code> is about <strong>6.3% slower<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Basic lifetime management for large objects with <code>proxy<\/code> (with memory pool) vs. <code>std::unique_ptr<\/code><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>283.6% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>109.6% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>204.6% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>88.6% faster<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Basic lifetime management for large objects with <code>proxy<\/code> vs. <code>std::shared_ptr<\/code> (both without memory pool)<\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>29.2% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>6.4% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>6.5% faster<\/strong><\/td>\n<td>\ud83d\udfe1<code>proxy<\/code> is about <strong>4.8% faster<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Basic lifetime management for large objects with <code>proxy<\/code> vs. <code>std::shared_ptr<\/code> (both with memory pool)<\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>10.8% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>9.9% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>8.3% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>53.2% faster<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Basic lifetime management for large objects with <code>proxy<\/code> (without memory pool) vs. <code>std::any<\/code><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>13.4% faster<\/strong><\/td>\n<td>\ud83d\udfe1<code>proxy<\/code> is about <strong>1.3% slower<\/strong><\/td>\n<td>\ud83d\udfe1<code>proxy<\/code> is about <strong>0.9% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>9.5% faster<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Basic lifetime management for large objects with <code>proxy<\/code> (with memory pool) vs. <code>std::any<\/code><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>270.7% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>80.1% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>136.9% faster<\/strong><\/td>\n<td>\ud83d\udfe2<code>proxy<\/code> is about <strong>120.4% faster<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>From the benchmarking results:<\/p>\n<ul>\n<li><code>proxy<\/code> is much faster than any other 3 when the underlying object is small, or managed with memory pools.<\/li>\n<li><code>proxy<\/code> is slightly slower than <code>std::unique_ptr<\/code> when the underlying object is large and not managed with a memory pool.<\/li>\n<li>The performance of <code>std::any<\/code> varies in different environments, but is generally slower than <code>proxy<\/code>.<\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>Although the test environments (<a href=\"https:\/\/docs.github.com\/en\/actions\/using-github-hosted-runners\/using-github-hosted-runners\/about-github-hosted-runners\">GitHub-hosted runners<\/a>) may differ from actual production environments, the test results show significant performance advantages of <code>proxy<\/code> in both indirect invocations and lifetime management. If you have more ideas for benchmarking the &#8220;Proxy&#8221; library, we welcome contributions to <a href=\"https:\/\/github.com\/microsoft\/proxy\">our GitHub repository<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This article analyzes the performance of the &#8220;Proxy&#8221; library in various scenarios, demonstrating its significant advantages in indirect invocations and lifetime management across different platforms.<\/p>\n","protected":false},"author":98503,"featured_media":35994,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,256],"tags":[],"class_list":["post-34833","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cplusplus","category-experimental"],"acf":[],"blog_post_summary":"<p>This article analyzes the performance of the &#8220;Proxy&#8221; library in various scenarios, demonstrating its significant advantages in indirect invocations and lifetime management across different platforms.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/34833","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/users\/98503"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/comments?post=34833"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/34833\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media\/35994"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media?parent=34833"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/categories?post=34833"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/tags?post=34833"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}