Inlining Decisions in Visual Studio

Terry Mahaffey

Terry

Introduction

My name is Terry Mahaffey and I work on the code generation team in MSVC. Lately I’ve been doing some work on our inliner and I wanted to give a brief introduction to it before later diving into some of the changes we’ll be shipping.

Inlining is perhaps the most important optimization a compiler performs. In addition to removing call overhead, an inline decision is most useful when it exposes additional optimization opportunities not present in either the caller or callee by themselves. For example:

It is a really good idea to inline bar into foo in this example; once that is done the compiler is able to evaluate the entire function, and the final code generation of foo will directly return 16.

Contrast that with this example:

Here it is less clear that inlining bar into foo, baz, and zoo is a win. Because the parameter passed in is not constant, the compiler won’t be able to figure out the final value in each case at compile time. Inlining does avoid the call overhead, but that must be weighed against the fact that the body of bar will appear at least 4 times in the final program increasing code size and hurting cache locality.

The goal of a good inliner is therefore to estimate the “benefit” from inlining at a call site and weigh it against the “cost” (typically measured in code growth). In addition, in MSVC we have a notion of a global budget where we’ll stop inlining regardless of additional benefit to prevent code growth explosion.

In Visual Studio we have been working on expanding the capability of our inliner to both be smarter about what we inline (be able to realize there are benefits in places we didn’t before) and more aggressive (raising the inlining budget, lowering the threshold). We’ll have more to say in upcoming blog posts. But for now, I wanted to give some context to our inliner in MSVC and what makes it different from inliners in other compilers.

We inline pre-optimized code

In contrast to other compilers, when MSVC inlines a call it actually inlines pre-optimized code by reading in the original unoptimized version of the function for that callee again, even though the callee usually has been compiled and optimized by this point. This also means that it makes (and repeats) inline decisions in each context a function might be called in, perhaps leading to different results. For example, consider a function f which calls g which calls h, where h is inlined into g. If g is inlined into f, h will not necessarily be inlined into f as well. A compiler which inlines post-optimized code (here, a version of g which already has h inlined into it) implicitly “replay” inline decisions made for callees.

We feel this is a strength, as it might not necessarily be best to replay the same inline decisions in each context. However, this has a big compile-time cost as frequently the same inline decisions are made and the same code is optimized over again. We’re currently exploring a middle option where we can replay some of the obvious post-optimized versions of functions.

We inline first

Inlining is the first optimization done by the compiler. So not only is the compiler inlining pre-optimized versions of callees, it is also inlining into a pre-optimized version of the caller. One consequence is it currently doesn’t realize there are some obvious optimization opportunities. Revisiting the first example the compiler does great at realizing bar should be inlined into foo, but if foo were changed to look like this:

The MSVC inliner would currently consider “x+1” as a non-constant parameter, and not apply a bonus inside the inline heuristic based on the parameter usage in bar.

Another consequence is that indirect and virtual calls which could have been converted to direct calls via constant propagation haven’t been optimized to do so yet, so we do not inline through them. So frequently you’ll see an indirect call to a small function converted to a direct call and emitted in the final binary as such, leaving the programmer to wonder why the compiler missed such an obviously good inlining opportunity. It’s an ordering issue; sometimes the optimizer performs the optimizations the inliner needs after the inliner has already run.

These are also issues we’d like to address in the near future by performing a limited set of optimizations before or during inlining.

A word about our implementation

The MSVC inliner, at a high level, looks like this:

  1. Identify all inline candidates (first set of legality checks)
  2. For each candidate,
    1. Read the body of the candidate, run a second series of legality checks
    2. Run a series of inlining heuristics
    3. If it looks like a good inline candidate, recursively inline into the candidate
    4. Run a final series of legality checks

First, note that it is a “depth first” inliner. Moving towards a breadth first approach is an area that is on the roadmap to explore in the future. There are advantages and disadvantages to each approach.

These legality checks and heuristics are a set of tables of function pointers we iterate over. If any legality check fails, inlining is aborted for that candidate. If any heuristic check succeeds, inlining moves forward.
The three legality steps occur first based on only what we know about the potential inlinee before reading it in, second after it has read in the inlinee, and finally after we’re recursively expanded into the callee.

Legality checks tend to speak to limitations in the inliner, typically corner cases which were never implemented. Things like arguments with complex user types passed by value, inlining across different user defined parts of a binary, inlining functions with try blocks, inlining functions with setjmp, a inlining depth check where we have a hard limit on how deep we inline, etc.

The heuristics are not all created equal. There is one heuristic in particular called “callgraph decision” which is what I consider the “real” inline decision maker. It is where all of the benefit estimating code around constant parameters described above is implemented. A call graph decision depends on bottom up compilation order, because certain information is gathered about the callee during it’s compilation (such as its use of its parameters) which is then used during inlining. There are other simple heuristics such as the inlinee being a forceinline function, a very small function, and a “simple decision” heuristic for cases where a call graph decision can’t be made.

This framework is flexible and easy to understand. Adding a new legality check or heuristic is as simple as adding an entry into a table. Profile Guided Optimization, or PGO, utilizes its own inlining decision engine based on profiling data, and it implements this simply by having its own entry in the table. Similarly, for instrumented builds PGO prefers no inlining occur to help gather the most accurate set of counts possible. PGO implements turning off inlining for instrumented builds by a simple legality check which always says “no”.

If you want to see this in action, run your build with the /d2inlinestats switch. This will print out a table of what legality checks failed and how often, as well as what heuristics are driving the successful inline instances.

Conclusion

I hope you found this helpful. Over the next few months I plan on writing a few more blog posts to give some pointers on how to open up the hood even more and get more visibility into what specifically is happening with our inliner, as well as talk about features we have in development to address some of the problems. And if there are any inlining topics you’d like to see addressed, please leave a message in the comments below!

We’d love for you to download Visual Studio 2019 and give it a try. As always, we welcome your feedback. We can be reached via the comments below or via email (visualcpp@microsoft.com). If you encounter problems with Visual Studio or MSVC, or have a suggestion for us, please let us know through Help > Send Feedback > Report A Problem / Provide a Suggestion in the product, or via Developer Community. You can also find us on Twitter (@VisualC) and Facebook (msftvisualcpp).

Terry Mahaffey
Terry Mahaffey

Principal Software Engineer, Visual C++

Follow Terry   

3 comments

  • Avatar
    akil mieres

    decent explanation, helpful since I’m not originally a Comp Sci or Elec/Comp Eng major, it is really helpful. I will pay more attention to it since I am lazy with comments in code and not inlining helps me to read my my code when I leave it for a few months. Still in hobbyist transition mode. I have on wuestion though on the use of function pointers and table for inlining optimization. Would inline functions be easier to implement using F# instead of C++ (i’m presuming it is written in C++ of course), since it (F#) is allegedly a functional Pr.Lang and not originally intended for OOP. I’ve only dabbled with F# and honestly my flute doesn’t play the F# note for some odd reason at lower octaves. Hypothetically, F# would ideal for certain scenarios (from my limited understanding and experience with the Pr.Lang.) e.g. mem virtualization, mem. management even shadow copies based on the simplicity of its (alledgedly) functional execution paradigm. I am correct in saying that? If I am, it is a possible way to increase the over all performance of all Windows machines/clones/phones to match or even surpass Apple products. right? that IS the point? this is PERSONAL, since my paternal unit is still cracking jokes about Windows. All in good fun, but I have to win this one. Not asking to redesign the OS, it’s perfect as it is as far as I’m concerned, but if I am even close to being correct on this, I’m sure a high performance version for industrial/Return on Core (CPU/GPU) Investment (R.O.C.I.) uses (example cryptocurrency mining) would spice up the Lasagna a bit

    • Avatar
      akil mieres

      pls do not misunderstand my interjection of F#, I know my facebook [profile is quite controversial. I don’t like what facebook or social netowrking has become. an abuse of the privacy act really, when you consider that any excuse is used to remotely monitor some one (vis a vis truman show, sliver). dangerous ground being tread like that. However, the languages supported by VS & MS (C, C++, C# and F#) are inspired by musical notes, hence the name Visual STUDIO, so in my lay person’s opinion the OS should function using all of the languages in their sphere of speciality, that is the only reason I even thought of F#’s functionality in increasing performance for memory/ALU intensive calls which is what your inline application seems to be doing. No disrespect or cheekiness meant, I am actually currently sharpening my C++ skills to back up my C# and eventually C and F# as well. I am looking into the effects of 802.11 standard on atmospheric reactions (e.g. Ozone, NOXs, SOXs). From a Quantum Chemical perspective it is possibile that equillibrium states of atmospheric gases may be affected by inefficiencies in wireless protocols and the OSI stack in general which possibly can lead to an increased acidity of rain (sick rain) and global warming. Of course it is quite difficult as a freelance reseacher without funding yadda yadda yadda. At worst, I might have to write and energy efficient wireless 802.11 network driver or a connection manager that specifically selects wireless channels to reduce energy content/requirement of a single wireless transmission. I’m working alone so I’ll be definitely following your blog :). Save the trees right?

  • Avatar
    akil mieres

    Note: I base my assumptions about the F#/C++ debate on the fact that C/C++ uses/abuses the curly brackets and that F originallly didn’t really depend on or relied on objectification of code, a fact which is still a feature of F# correct? Hence, it’s prolonged uses in Linux/Unix for engineering/biochemical/chemical and mathematical research. Not so? 

Leave a comment