Profile Guided Optimization (PGO) has long been one of the most powerful tools in the MSVC compiler’s arsenal for improving the runtime performance of C and C++ applications. By using execution profile data collected from representative workloads, PGO enables the compiler to make smarter decisions about inlining, code layout, and hot/cold code separation – decisions that are impossible to make from static analysis alone. In practice, PGO can deliver large performance improvements for C/C++ code.
Today, we’re introducing Sample Profile Guided Optimization (SPGO), a new approach to profile-guided optimization that makes it dramatically easier to bring PGO quality optimizations to your codebase without the overhead and complexity of traditional instrumented PGO. SPGO is available in all versions of Visual Studio 2022, and Visual Studio 2026.
Limitations of Traditional PGO
Traditional PGO (sometimes called “instrumented PGO”) works in three phases:
- Instrument: Compile your application with special instrumentation probes inserted at key points in the code.
- Train: Run the instrumented binary through representative workloads to collect execution counts.
- Optimize: Recompile using the collected profile data to guide optimizations.
While this approach produces high-quality profile data, it comes with significant practical challenges:
- Performance overhead: The instrumented binary runs significantly slower than the release build, making it impractical to deploy to production.
- Training burden: You need to create and maintain representative training scenarios that accurately reflect real-world usage patterns.
- Workflow complexity: The three-phase build process adds complexity to your CI/CD pipeline and release workflow.
- Staleness: Profile data can go stale as your code evolves, requiring frequent re-training.
- Deployment constraints: Instrumented binaries cannot be shipped to customers, so the training scenarios may not perfectly represent real customer workloads.
For many teams, these barriers mean that PGO, despite its significant performance benefits, remains out of reach.
Enter SPGO: Profile Guidance from Production Sampling
SPGO takes a fundamentally different approach. Instead of instrumenting your binary and running it through synthetic training scenarios, SPGO uses hardware performance counter sampling collected from your actual release binaries. Modern processors provide hardware sampling capabilities such as Last Branch Records (LBR) and retired instruction counters. These can be collected with negligible runtime overhead, making it practical to gather runtime profiles directly from production.
Because SPGO profiles release bits, not instrumented builds, it enables much more flexibility in where and how you collect data. You can gather runtime profiles from production servers, developer workstations, performance labs, or any combination. The upper end of SPGO’s performance range is unlocked by the quality, completeness, and consistency of the input data you provide.
The SPGO Workflow
The SPGO workflow is an iterative cycle of building, collecting, converting, and rebuilding. For those familiar with the legacy PGO approach, the key difference here is that there is no dedicated instrumentation step, and everything is done with fast, release binaries:
Step 1: Build with Link-Time Code Generation (LTCG) and SPGO Switch
Compile your application with LTCG and add the /spgo linker switch. During this build, the compiler produces a Sample Profile Database (SPD) file (one per binary) containing the static structure of your code: control flow graphs, block layouts, and inline expansion information. Save the SPD files alongside your build output. For example:
cl /EHsc /GL /O2 app.cpp /link /debug /spgo
Step 2: Collect Hardware Samples
Run your application under representative workloads with xperf hardware sampling enabled. The Windows Performance Toolkit (xperf) is commonly used for collection. There are two collection modes:
1. IP (Instruction Pointer) sampling – periodic snapshots of where the CPU is executing. Effective on all platforms. For example:
xperf -on LOADER+PROC_THREAD+PROFILE -MinBuffers 4096 -BufferSize 4096 -setProfInt Timer 1221 -stackwalk profile
2. LBR (Last Branch Records) – records of recently taken branches, providing consecutive source/destination address pairs. LBR data is preferred because it reveals edge frequencies in the control flow graph, not just block hit counts. Last Branch Records (LBR) are performance counters provided on Intel Haswell CPUs (4th gen Core, 2013) or later; AMD Zen 4 (2022) or later, ARM64 ARMv9.2-A (2020) or later. For example:
xperf -on LOADER+PROC_THREAD+PMC_PROFILE -MinBuffers 4096 -MaxBuffers 4096 -BufferSize 4096 -pmcprofile BranchInstructionRetired -LastBranch PmcInterrupt -setProfInt BranchInstructionRetired 16384
A typical collection session runs for around 5 minutes on a representative workload and generates an Event Trace Log (ETL) file with the results. The raw ETL trace is then converted to Sample Profile Trace (SPT) format. It is common to collect many SPT files from different scenarios and machines.
Step 3: Convert with SPDConvert
The SPDConvert tool correlates the raw hardware samples in the SPT files against the static code structure in the SPD:
spdconvert app.spd scenario1.spt scenario2.spt scenario3.spt
This step performs sample correlation, flow smoothing and size/speed decisions, producing an enriched SPD file with execution counts annotated on the flow graph.
You can combine data from multiple sources in a single conversion – lab benchmarks, internal monitoring, and production profile counts can all contribute SPT files. To emphasize the importance of a particular scenario, you can specify its SPT file multiple times (e.g., listing a critical benchmark SPT three times effectively triples its weight).
Step 4: Optimized Rebuild
Rebuild your application, passing the enriched SPD to the linker:
cl /O2 /GL source.cpp /link /LTCG /spdin:app.spd
The compiler uses the sample-derived profile data to guide optimization decisions. Importantly, the build also produces a new SPD file; save this for the next iteration of the cycle.
Iterate
The workflow is designed for continuous improvement. Each build produces a fresh SPD that can be enriched with new samples from the latest version. Over time, your profile data becomes increasingly comprehensive and reflective of real usage.
Data Collection Strategies
The best performance results come from thoughtful investment in profile data collection. SPGO provides a means of combining data from multiple sources over time to get incremental benefits.
Lab Benchmarks
Performance labs are an important source of high-quality samples. Many teams already have benchmarks in place for key scenarios. Ensure that these benchmarks run long enough and with sufficient sample density to produce a meaningful profile. Increasing the sample rate (using the xperf setProfInt switch) or extending the test runtime can improve coverage.
Internal Monitoring
Monitoring utilization by your own team, especially on release candidate builds, provides valuable real-world data that correlates to a single binary version. This can be done by collecting profile data during normal development activity.
Production Profile Counts
Collecting from production workloads provides the most representative data, especially for large-scale services. In cloud environments, sampling can be distributed across many nodes with very small per-node overhead, making it effectively undetectable. For products with a large user base, opted-in profile counts collection can be distributed across millions of machines.
Blending Sources
The overall best performance is obtained by blending profiles from production usage with targeted lab benchmarks. Data can be scaled to adjust the significance of each source. Start with what you have easily available and incrementally add more scenarios over time.
How the Compiler Improves Optimization using Profile Data
The compiler uses the collected sample data to populate counts on each block and edge in the program’s control flow graph. These counts drive the same powerful optimizations as traditional PGO:
- Profile-guided inlining: Aggressively inline hot call sites while avoiding code bloat from inlining cold paths.
- Hot/cold code separation: Move rarely executed code to different sections of the binary, improving instruction cache utilization and paging behavior.
- Function layout: Place functions that call each other frequently near each other in the binary, reducing page faults and improving locality. Optimized functions are organized into high affinity COFF groups in the binary.
- Size/speed decisions: Compile hot functions for speed and cold functions for size. Routines with no observed profile hits may be compiled for size rather than speed, limiting optimizations like inlining and loop unrolling in those cold paths.
- Speculative devirtualization: When sampling reveals that an indirect call consistently targets the same function, SPGO can speculate on that target and inline it, with a fallback for the uncommon case.
For applications with an important steady-state, the most valuable data comes from that steady-state; collecting on warm-up or shutdown is generally not useful.
Handling Sparse and Evolving Data
Sampling data is inherently statistical. SPGO addresses this with sophisticated algorithms:
- Minimum Cost Flow (MCF) smoothing reconstructs consistent flow graph counts from sparse sample data, ensuring that block and edge counts are flow consistent.
- Interprocedural light-up propagates liveness information across the call graph, so even functions with too few samples to be directly observed can benefit from being called by sampled functions.
- Identical Code Folding (ICF) awareness correctly handles cases where the linker has merged identical function bodies, redistributing sample counts to all original functions.
SPGO also handles code staleness gracefully. When your source code changes between profile collection and the optimized build, SPGO continues to use existing profile data for unchanged routines and for routines where the control flow graph is unaltered. The ideal approach is to recollect fresh data regularly, but partial staleness does not invalidate your entire profile. The linker reports a “Total Dynamic Instructions Optimized using profile data %” metric that indicates how much of your existing profile remains applicable. Even a small drop (3-5%) in this value can signal that it’s time to refresh your profiles.
For teams adopting SPGO incrementally, there is also a linker option to avoid penalizing functions that lack profile data, compiling them with standard LTCG optimizations instead. This is particularly useful during early adoption when profile coverage is still growing.
Practical Tips
- Store SPD files as versioned packages (e.g., NuGet). This enables rollback, debugging with specific versions, and tracking profile evolution over time.
- Use
/EMBEDSPDÂ as a linker option to embed the SPD inside the PDB for convenient distribution alongside debug information. Be aware this increases PDB size. - SPGO matches by PDB GUID and age – always use the SPD from the exact build that produced your binary. Mismatches will cause
spdconvertto report a warning and skip the data. - Emphasizing important runtime scenarios – add SPT files multiple times on the SPDConvert step to underscore their importance.
SPGO and Cloud Cost Optimization
One of the most compelling use cases for SPGO is reducing the cost of goods sold (COGS) for large-scale cloud services. When you’re running thousands of instances of a service, even a 5-15% improvement in CPU efficiency translates directly to significant cost savings. Cloud environments are ideally suited for SPGO: you can collect lightweight production profiles continuously, aggregate data from many nodes, and feed enriched profiles back into your build pipeline – creating a virtuous cycle of measurement and optimization.
Summary
SPGO brings the performance benefits of profile-guided optimization to a much wider audience by removing the need for instrumented builds and synthetic training scenarios. By leveraging hardware sampling from release binaries, SPGO provides:
- 5-15% performance improvements for C/C++ applications, with the upper range unlocked by high-quality profile data
- Real-world profile data reflecting actual customer and production usage
- Near-zero collection overhead safe for production deployment
- Flexible data blending from labs, internal monitoring, and production profile counts
- Graceful handling of sparse and stale data through sophisticated smoothing and staleness management
- The same powerful optimizations as traditional PGO: inlining, code layout, hot/cold separation, and speculative devirtualization
- An iterative workflow that improves over time as profiles are refreshed and enriched
SPGO is available now in MSVC Build Tools version 14.51 as part of Visual Studio 2026 version 18.6. We believe it represents a significant step forward in making profile-guided optimization accessible and practical for C++ developers building high-performance applications with MSVC. We would love to hear your feedback! You can share your comments below, or, after Installing Visual Studio, via Help > Send Feedback > Report a Problem in-product or on Visual Studio Developer Community.
0 comments
Be the first to start the discussion.