November 1st, 2024

Analyzing the Performance of the “Proxy” Library

Mingxin Wang
Senior Software Engineer

Since the recent announcement of the Proxy 3 library, we have received much positive feedback, and there have been numerous inquiries regarding the library’s actual performance. Although the “Proxy” library is designed to be fast, fulfilling one of our six core missions, it is not immediately clear how fast “Proxy” can be across different platforms and scenarios.

To better understand the performance of the “Proxy” library, we designed 15 benchmarks, tested in four different environments, and automated them in our GitHub pipeline to generate benchmarking reports for every code change in the future. Everyone can download the reports and raw benchmarking data attached to each build. The rest of this article delves into the benchmarking details. The numbers shown below were generated from a recent CI build.

Indirect Invocation

Both proxy objects and virtual functions can perform indirect invocations. However, since they have different semantics and memory layout, it should be interesting to see how they compare to each other.

Because make_proxy can effectively place a small object alongside metadata (similar to “small buffer optimization” in some other C++ libraries), the benchmarks are divided into two categories: invocation on small objects (4 bytes) and on large objects (48 bytes). By invoking 1,000,000 object of 100 different types, we got the first two rows of the report:

MSVC on Windows Server 2022 (x64) GCC on Ubuntu 24.04 (x64) Clang on Ubuntu 24.04 (x64) Apple Clang on macOS 15 (ARM64)
Indirect invocation on small objects via proxy vs. virtual functions 🟢proxy is about 261.7% faster 🟢proxy is about 44.6% faster 🟢proxy is about 71.6% faster 🟡proxy is about 4.0% faster
Indirect invocation on large objects via proxy vs. virtual functions 🟢proxy is about 186.1% faster 🟢proxy is about 15.5% faster 🟢proxy is about 17.0% faster 🟢proxy is about 10.5% faster

From the report, proxy is faster in all four environments, especially on Windows Server. This result is expected because the implementation of proxy directly stores the metadata of the underlying object, making it more cache-friendly.

Lifetime Management

In many applications, lifetime management of various objects can become a performance hotspot compared to indirect invocations. We benchmarked this scenario by creating 600,000 small or large objects within a single std::vector (with reserved space).

Besides proxy, there are three typical standard options for storing arbitrary types: std::unique_ptr, std::shared_ptr, and std::any. std::variant is not included because it is essentially a tagged union and can only provide storage for a known set of types (though useful in data context management).

For small objects, proxy and std::any usually won’t allocate additional storage. For large objects, proxy and std::shared_ptr offer allocator support (via pro::allocate_proxy and std::allocate_shared) to improve performance, while there is no direct API to customize std::unique_ptr or std::any.

Here are the types we used in the benchmarks:

Small types Large types
int std::array<char, 100>
std::shared_ptr<int> std::array<std::string, 3>
std::unique_lock<std::mutex> std::unique_lock<std::mutex> + void*[15]

By comparing proxy with other solutions, we got the following numbers:

MSVC on Windows Server 2022 (x64) GCC on Ubuntu 24.04 (x64) Clang on Ubuntu 24.04 (x64) Apple Clang on macOS 15 (ARM64)
Basic lifetime management for small objects with proxy vs. std::unique_ptr 🟢proxy is about 467.0% faster 🟢proxy is about 413.0% faster 🟢proxy is about 430.1% faster 🟢proxy is about 341.1% faster
Basic lifetime management for small objects with proxy vs. std::shared_ptr (without memory pool) 🟢proxy is about 639.2% faster 🟢proxy is about 509.3% faster 🟢proxy is about 492.5% faster 🟢proxy is about 484.2% faster
Basic lifetime management for small objects with proxy vs. std::shared_ptr (with memory pool) 🟢proxy is about 198.4% faster 🟢proxy is about 696.1% faster 🟢proxy is about 660.0% faster 🟢proxy is about 188.5% faster
Basic lifetime management for small objects with proxy vs. std::any 🟢proxy is about 55.3% faster 🟢proxy is about 311.0% faster 🟢proxy is about 323.0% faster 🟢proxy is about 18.3% faster
Basic lifetime management for large objects with proxy (without memory pool) vs. std::unique_ptr 🟢proxy is about 17.4% faster 🟢proxy is about 14.8% faster 🟢proxy is about 29.7% faster 🔴proxy is about 6.3% slower
Basic lifetime management for large objects with proxy (with memory pool) vs. std::unique_ptr 🟢proxy is about 283.6% faster 🟢proxy is about 109.6% faster 🟢proxy is about 204.6% faster 🟢proxy is about 88.6% faster
Basic lifetime management for large objects with proxy vs. std::shared_ptr (both without memory pool) 🟢proxy is about 29.2% faster 🟢proxy is about 6.4% faster 🟢proxy is about 6.5% faster 🟡proxy is about 4.8% faster
Basic lifetime management for large objects with proxy vs. std::shared_ptr (both with memory pool) 🟢proxy is about 10.8% faster 🟢proxy is about 9.9% faster 🟢proxy is about 8.3% faster 🟢proxy is about 53.2% faster
Basic lifetime management for large objects with proxy (without memory pool) vs. std::any 🟢proxy is about 13.4% faster 🟡proxy is about 1.3% slower 🟡proxy is about 0.9% faster 🟢proxy is about 9.5% faster
Basic lifetime management for large objects with proxy (with memory pool) vs. std::any 🟢proxy is about 270.7% faster 🟢proxy is about 80.1% faster 🟢proxy is about 136.9% faster 🟢proxy is about 120.4% faster

From the benchmarking results:

  • proxy is much faster than any other 3 when the underlying object is small, or managed with memory pools.
  • proxy is slightly slower than std::unique_ptr when the underlying object is large and not managed with a memory pool.
  • The performance of std::any varies in different environments, but is generally slower than proxy.

Summary

Although the test environments (GitHub-hosted runners) may differ from actual production environments, the test results show significant performance advantages of proxy in both indirect invocations and lifetime management. If you have more ideas for benchmarking the “Proxy” library, we welcome contributions to our GitHub repository.

Author

Mingxin Wang
Senior Software Engineer

3 comments

Discussion is closed. Login to edit/delete existing comments.

  • Jean Gautier · Edited

    Hi Mingxin,

    You did not include variants in your benchmark because it is bound to a list of types and serves as a tagged union for storage.
    I can understand that.

    However, it is widely used in designs that strive for static polymorphism.
    In that regard, it would be really interesting to help evaluate the benefit of lib proxy to also be able to compare it to a std::variant approach.

    Thank you,
    Jean

    • Mingxin WangMicrosoft employee Author · Edited

      Hi Jean,

      Thank you for your interest in our work. Like you said, was excluded from the benchmarking because "it is bound to a (fixed) list of types". To avoid misleading, we decided not to compare the performance of due to the scope differences. However, it is true that virtual functions and can theoritically be replaced by within a single binary, at the cost of engineering efforts in managing templates.

      I forked the repository and created a branch to add benchmarks for with the existing configurations. For your reference, here's the result generated from my branch (copied...

      Read more
      • Jean Gautier

        Hi Mingxin,

        I am not surprised by the result and I feel the benchmark is more “complete” with all known alternatives to “classic” polymorphism.
        Proxy has a huge (IMO) benefit over variant is that it “decouples” the implementation and and the consumers of an ABI like the virtual functions do (and like std::variant do not).

        We are looking forward to use proxy in our experiments soon !
        Thank you very much for taking the time for this additional benchmark.
        Jean