We’ve blogged before about the benefits of Profile Guided Optimization. One of the biggest pieces of feedback we’ve received is that the instrumented binaries are too slow – making it very difficult to train certain classes of applications, such as games.
This is something we’ve tried to address in VS 2015 – there were a bunch of behind the scene changes to code generation of the instrumented binary and the associated PGO runtime to increase runtime performance and minimize the overhead of the instrumentation. We’ve seen up to a 30% throughput increase in some scenarios. Everyone gets this for free – you don’t have to do anything other than use PGO on VS 2015. But what I wanted to talk about today was a few optimizations we couldn’t quite turn on by default and why, and their associated command line options.
To review, to use PGO in VS 2013 you pass /LTCG:PGI to the linker to produce an instrumented binary, and /LTCG:PGU to the linker to produce a PGO optimized binary.
In VS 2015, the PGO specific options have been centralized to a top level switch to link.exe, with several of their own subswitches. From link /?:
/GENPROFILE[:{COUNTER32 |COUNTER64 |EXACT |MEMMAX=# |MEMMIN=# |NOEXACT |NOPATH |NOTRACKEH| PATH| PGD=filename| TRACKEH}]
It was necessary it make it a top level switch in order to give it subswitches. The first rule of GENPROFILE: all default behaviors are identical to VS 2013. Passing /GENPROFILE, with no subswitches, is exactly the same as /LTCG:PGI in VS 2013 – and in VS 2015 as well, for that matter: we still accept the old switches for compatibility reasons.
COUNTER32 vs. COUNTER64: COUNTER64 is the default – use a 64 bit value for the probe counters, COUNTER32 will use a 32 bit value. This is obviously important if any single probe value gets close to or exceeds 2^32 – but it turns out, almost no probes ever do. The overhead of a 64 bit increment versus a 32 bit increment might not seem too much, but remember there are A LOT of probes in an instrumented build, approximately one for every two basic blocks, so the overhead in both codesize and perf adds up on x86.
So how do you know when it’s safe to use COUNTER32? Well, we added some helpful output to pgomgr /summary:
C:\temp>pgomgr foo.pgd /summary
Microsoft (R) Profile Guided Optimization Manager 14.00.23022.0
Copyright (C) Microsoft Corporation. All rights reserved.
PGD File: foo.pgd
03/05/2014 00:20:07Module Count: 1 Function Count: 11362 Arc Count: 12256 Value Count: 377
Phase Name:
Max Probe Counter: 0x0000000000DE0467 (0.34%)
Consider /GENPROFILE:COUNTER32 for increased training performance.
It tells us that the maximum probe counter value in the training scenario was DE0467 (14 million), which is 0.34% of the 32 bit counter space (4 billion). It’s not even close. Based on this, you see the output recommending COUNTER32.
The vast majority of training scenarios will be perfectly fine using COUNTER32 – in fact, internally I have never seen one which isn’t. However, you can imagine the consequences of overflowing a 32 bit counter are very bad, and existing external customers may very well have training scenarios requiring a 64 bit counter – so COUNTER64 needs to remain the default.
EXACT vs. NOEXACT: NOEXACT is the default. This is a renamed version of the old /POGOSAFEMODE switch, which has been deprecated. What EXACT means is use threadsafe interlocked increments for probes, and when NOEXACT is specified we do not. EXACT is a good idea if you have a heavily multithreaded program and your training quality is being harmed as a result. /POGOSAFEMODE is still accepted for compatibility reasons.
MEMMAX=# and MEMMIN=#: These values specify the maximum and minimum memory reservation size in bytes for the training data in memory. Internally PGO uses a heuristic to estimate the amount of memory needed and reserves the space. Because it is unlikely to be able to expand the space later (the reservation needs to be contiguous and stable), this initial reservation is very aggressive. In some scenarios, especially when multiple instrumented binaries are present in the same process, this can result in running out of address space and eventually crashing with out of memory errors.
MEMMAX and MEMMIN provide a way to specify a ceiling and floor to the heuristic used internally by PGO when estimating the needed memory. PGO will still make it’s estimation, but respect the MEMMAX and MEMMIN values as appropriate.
So how do you know what value to use here? We added some helpful output here as well, but this time at merge time:
C:\temp>pgomgr /merge foo.pgd
Microsoft (R) Profile Guided Optimization Manager 14.00.23022.0
Copyright (C) Microsoft Corporation. All rights reserved.
Merging foo!1.pgc
foo!1.pgc: Used 14.7% (3608 / 24576) of total space reserved. 0.0% of the counts were dropped due to overflow.
In this small example, the memory reservation size was 24576 bytes, of which only 3608 bytes were needed for training. If these values are consistent among all PGC files, you’d be safe in specifying a lower MAXMEM size when producing the instrumented binary. The other output estimates how much data was lost if the amount of available space filled up. If this value ever stopped being 0%, you might want to specify a higher MEMMIN size.
I suspect not many people will ever need this option, but if you find yourself running into memory issues during training, it’s something to look into. It was added because the only other option when hitting memory issues is to split up the training of multiple binaries into multiple separate training runs, which has a manpower cost associated with it.
PATH vs. NOPATH: PATH is the default. Path profiling is when PGO keeps a separate set of counters for each unique path to a function, allowing for better inline decisions and more accurate profile data after inline decisions have been made. This leads to better overall code generation.
So why would you ever turn this off? Well, the memory overhead here is high: imagine all the different unique callpaths to a given function in your program – with path profiling we keep a separate set of counters for each! With NOPATH, we only keep one. In addition to the memory cost there is runtime overhead in looking up the correct set of counters during a functions prolog.
If your memory usage is so high and runtime perf of the instrumented binaries is so bad that you’re considering not using PGO at all, try NOPATH.
We really like path profiling and measure significant gains over non-path profiling, so we’re not willing to turn it off by default. However, we do want people to use PGO, and non-path profiling still gives significant wins over LTCG. So it’s here as a last resort.
TRACKEH vs. NOTRACKEH: TRACKEH is the default. What this basically means is every callsite has two counters around it, one before and one after – to keep an accurate counts in the event that the call throws an exception and control flow resumes somewhere else. If your binary doesn’t typically use EH or doesn’t use EH during the training scenarios you’re running, you can safely turn this off to omit these call probes for a minor codesize and speed win. This isn’t on by default because training accuracy is harmed in the presence of EH with this option on.
PGD=path: Similar to /EXACT, this is the old /PGD switch demoted from a top level switch to a subswitch of /GENPROFILE. /PGD is still accepted for compatibility reasons.
So that covers /GENPROFILE. You might notice another switch that looks very similar, /FASTGENPROFILE:
/FASTGENPROFILE[:{COUNTER32 |COUNTER64 |EXACT |MEMMAX=# |MEMMIN=# |NOEXACT |NOPATH |NOTRACKEH |PATH |PGD=filename |TRACKEH}]
In fact, it is exactly the same: the only difference is the default values. GENPROFILE defaults to COUNTER64, NOEXACT, PATH, TRACKEH (the equivalent of VS 2013 behavior) where as FASTGENPROFILE defaults to COUNTER32, NOEXACT, NOPATH, and NOTRACKEH.
For using a profile, we have a new /USEPROFILE switch:
/USEPROFILE[:PGD=filename]
This is the equivalent of /LTCG:PGU in VS 2013 (and, as you’d expect, /LTCG:PGU is still accepted for compatibility). The PGD option here is the same as for /GENPROFILE, which is to say it is the old /PGD switch in VS 2013.
If you are currently or planning to use PGO from the IDE: We have currently not updated our property pages to accept these new Profile Guided Optimization switches and they still point to the ones we had for VS2013. This is currently on our radar and these changes to property files should pop-up in a VS 2015 update. For now, please use the Linker command line property pages.
So there you have it. In VS 2015 we cleaned up the mix of PGO switches and provided an array of options for controlling the code generation and training fidelity of instrumented PGO builds. There were a ton of behind the scene changes which didn’t affect training quality that were also implemented. So give PGO in VS 2015 a try, we’d love to hear your feedback!
0 comments