Header files and the preprocessor – Can't Live With 'em, Can't Live Without 'em
Hello, my name is Richard Russo. I’m the newest member of the Visual C++ compiler front-end developer team, having started in January. I’m not new to MS though as I spent the last three years working on Windows Vista. I’m excited to be on the front-end team because compiler development has been a hobby of mine for a few years.
Most posts on this blog discuss new features that are added or what daily work is like here in various positions. Instead of that, I’d like to discuss my thoughts on a particular aspect of C/C++ that I think deserves more design attention: header files and the preprocessor. I’ll probably be discussing a few hypothetical features here, and I just want to say up front that this does not necessarily mean the VisualC++ team will be working on or delivering these features. In writing up these ideas, mainly what I’m hoping to do is spark discussion and get your feedback. There should be a link below to leave comments so please do!
First off, what are header files really for? Well, without doing research into the design rationale of Kernighan and Ritchie, the most basic purpose it serves is to allow us to maintain units of code for separate compilation in a declare-before-use language. You can imagine that without the preprocessor we’d have to declare the same things in every source file — we’d quickly get tired of that and probably hack something together that looks a lot like the current preprocessor. Sure, we use the preprocessor for lots of clever things, but to me that is its essential purpose.
What are some frustrations with header files? Well, they seem to contribute to really long build times. Have you ever used the preprocessor modes of cl.exe? You can access those with /E (preprocess to stdout), /P (preprocess to a file), and /EP (preprocess to a file, but don’t produce #line directives). Give it a try, for instance “cl /P foo.cpp” will produce “foo.i”. On my system, I wrote a quick Windows “hello world” with the MessageBox function. When I preprocessed this 5-line program it wound up being roughly 200,000 lines of source code for the compiler to parse. Now imagine what it is like in your project if every source file includes windows.h. You can imagine that parsing 200k lines of declarations slows down the compiler a bit. What else? Well, header files seem redundant. We have to type class names, method names, parameter lists, etc. twice. While that’s certainly not the most costly part of developing a C++ project, it does probably slow your thought process down a little. It also creates additional maintenance. If you change a parameter list in one place you need to change it in another. Again, not a huge cost but a real life cost. We’re still not done yet – I can add a few more potential issues. Header files change depending on the context in which they are preprocessed — or to say it more succinctly they have isolation problems. What if you are integrating two libraries that both have a foo.h and both use the guard macros FOO_H? Well, that’s something you as a coder have to take time to deal with. Along those same lines, if you have a really big project without carefully designed headers, you might notice differences in compile-time (or even potentially run-time) behavior depending on the order in which you include headers. It’s not the end of the world, but it has a cost that you have to pay to investigate and fix the problem. I think most C/C++ coders would agree that the preprocessor comes with a price tag and a long list of potential pitfalls.
Well, it’s not all bad right? Certainly. The first benefit I’m thinking of is the most interesting to me. We have a situation in C/C++ where we have to declare before use, and we put all those shared declarations in header files. We can easily produce separate compilation units that are mutually-dependent in this way. But because the mutual dependencies are satisfied by the header files, in general all of our source files can be compiled in parallel. I think most people would agree that this parallelism is a good thing. Take a look at counter-examples to this. Say you’re compiling a C# program. First off, you probably rarely compile that program one source file at a time. You pass a collection of source files to the compiler, the same way you pass those files to the C/C++ compiler. But the C# compiler must treat that batch of files differently. The C/C++ compiler can compile them all separately and then finally consider them as a unit for linking purposes. For the C# case, the compiler has to consider all of those source files somewhat simultaneously because they can have inter-dependencies: you can refer to a class in another source file without having to give a declaration of it first. In short, in C/C++ a translation unit is always a single source file and headers, whereas in C# the translation unit is potentially multiple source files. At the very least, that makes it seem more difficult to parallelize the C# build process. No doubt it is doable, but there will be some amount of overhead associated with this issue. You can think of this as the C/C++ coder paying the cost of this overhead by maintaining header files. No doubt there are other benefits you can think of to header files. For instance you can use the macro preprocessor to add rudimentary syntax to our language, to help “automate” some tasks.
What can we do about it? I’m going to discuss two proposals. One of them is not mine at all and regards modules for C++. The other is perhaps a more practical tool that might help you diagnose issues with header files in your codebase.
The word “module” has a lot of different meanings, even in the context of the coding world. I would say that many programming systems out there (notice this does not include C and C++) provide module functionality which is some joining of the concepts of separate compilation and namespaces. C and C++ give us facilities for separate compilation, and C++ gives us namespaces, but it does not give us a unified feature that encapsulates these together, allowing us to refer to a separately compiled module and import selected symbols from it. Enter Vandevoorde’s Modules for C++ proposal (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2073.pdf). This gives us a mechanism by which we refer to potentially previously-compiled modules instead of including source code of declarations in our projects. This has some important benefits in terms of the caveats pointed out above; for example it does have isolation properties and it does not have the maintenance burden associated with the redundancy of header files. I can see usage of this feature negatively impacting the parallelism property if used in certain ways. For example (and similar to the C# case), if each source file in your project is a separate module, the compiler will have to analyze dependencies before attempting to compile the files in parallel. Luckily, the “import” statements give it some clues about these dependencies and can probably be scanned quickly, so there are no doubt ways to solve that problem but there will be some amount of overhead. I don’t have much else to add to Vandevoorde’s discussion in that paper, it is very thorough, and if you’re interested in the design of such a module system and potential issues I encourage you to read it.
As fun as it is to consider redesigning the world, what could we potentially do today, without changing the infrastructure of header files, to make things better? I can envision tools that generate headers or scan your header files and greater code-base and attempts to diagnose problems for you. Most of what I argued above as problems with header files had to do with a cost involving work for a human programmer. Well, maybe an automated tool can help reduce or eliminate that cost in some cases.
The first suggestion is around “automagically” generating headers. Think of a compiler which analyzes your C and/or C++ source and extracts just declarations into a minimal header file. You might have a source file which includes windows.h and then declares several classes. The header file would need to contain declarations of those classes, and probably just a few of the typedefs declared in windows.h and the various headers it includes. The result would most likely be fairly short; perhaps all you need to declare those classes is a few typedefs from windows.h like HMODULE, HWND, etc. This tool could be potentially integrated in to your IDE so that when you change the source file it automatically updates the header if necessary. For efficiency, the tool could assume that system headers like windows.h don’t change. Another suggestion for efficiency might be to have the tool generate a header for an entire library instead of the single-translation-unit level.
Such a tool seems like it could potentially help with the maintenance and build time issues. That whole windows.h header would need to be parsed by the header-generating tool, but that cost is amortized. It is paid only when you modify that source file, and you reap the rewards every time you compile another source file that includes the associated header. I see potential problems in this area, but if you have a code-base that is partitioned the right way, you might even be able to check in those generated headers and save even more compile time throughout your development team.
What about other issues, such as isolation? The header-generating tool might be able to help with some of the issues there by creating “well behaved” header files. For instance it might generate preprocessor code that saves and restores the state of macros inside the context of the file, or generating headers that don’t use any preprocessor features other than perhaps for guard macros at the beginning and end of the file. But what would likely be more useful is some sort of tool to analyze your headers. I’ll give a few simple examples. It could look for isolation issues such as two different header files in your INCLUDE path with the same file name or that use the same guard macro. It could find headers that don’t have proper guard macros. It could build a dependency graph that shows you how one header’s definitions impact the definitions another, and what the include graph looks like for particular headers and source files.
Such an analysis tool might help you get your build times down as well. For instance it could scan your source files and tell you that it is unnecessary to include a particular header because none of the declarations are used. Or it might tell you that instead of including foo.h which includes bar.h, you could just include the latter because this particular source file only uses declarations from bar.h.
I’m afraid I’ve used up my time and space for this post, but I hope it was an interesting read, and got you thinking about what features regarding the preprocessor and header files might make you more productive in your C and C++ codebase. Please leave comments if you liked these ideas, didn’t like them, or have your own.
Thanks for reading!