Tag Parsing C++
Hello, my name is Thierry Miceli and I am a developer on the Visual C++ Compiler Front End team. Although our team is mostly known for writing and maintaining the part of the C++ compiler that analyzes your source code and builds an internal representation from it, a great deal of our effort in the last few years has been directed into servicing the IDE and improving the intellisense experience (refreshers here, here, and here).
Today, I am going to write about a new parser that has been specifically created to provide a fast and scalable way to extract information from C++ source code. This parser is one of our new additions to Visual Studio 2010 and we call it the “tag parser”.
The tag parser is used in Visual Studio 2010 to populate the SQL database that supersedes the NCB file. All of the browsing features of VC++ rely in some way on results provided by the tag parser. These include Class View, Call Hierarchy, Go To Definition/Declaration, Get All References, Quick Search, the Navigation Bar and VCCodeModel.
A Fuzzy Parser
It is a fuzzy parser, which means that instead of trying to strictly recognize and validate the full C++ syntax (we have an excellent compiler front-end to do that) it lazily matches an input stream of tokens with some patterns. This parser doesn’t populate a symbol table during parsing, it has no notion of types apart from built-in ones, it doesn’t build a full macro context and its unit of translation is a single file (i.e. it doesn’t follow through #include directives). But nevertheless, the parser is able to deal with all of C++, C++/CLI and IDL.
High level of tolerance to incomplete code and errors.
The tag parser doesn’t try to make sense of every symbol or identifier in the source code. It will be satisfied with being able to recognize the different parts of a declaration and their positions in the source file. If a name in the type specification of a declaration couldn’t get resolved by our C++ compiler this would not prevent the tag parser from recognizing the declaration and it will show up in Class View.
The tag parser is somewhat analogous to a human reader of the source code that would just be looking at one unique declaration without knowing much about the rest of the project. He may not know what most identifiers actually represent but he can tell with a high level of confidence what the declaration is and locate its subparts.
In addition to the tolerance to ‘semantically’ incorrect code which is a property of fuzzy parsers, the tag parser has heuristic based error recovery for the most common causes of erroneous code during editing. For example, it will try to detect incomplete declarations or unclosed body of functions definitions.
Dealing with preprocessor conditional directives.
The tag parser’s main role is to extract information from the source code that is then consumed by the IDE browsing features. Because browsing features closely relate to the editing experience it is more useful that the tag parser generates a structured representation of the full source code as it appears in the editor rather than a representation of the code that would get compiled under a specific project configuration.
The tag parser deals with preprocessor conditional directives (#if, #ifdef, #ifndef, #else, #elif, #endif) in a special way. It incorporates the full code in each of the branches of preprocessor conditional directives but still only parses complete declarations. For example, both the inactive and active branches are parsed and Class View shows both function declarations.
The tag parser is also able to deal with more complex cases where a declaration is interrupted by one or more preprocessor conditional directives. For example, both of the declarations that can be induced by 2 branches are parsed and reported.
Faster and scalable
Tag parsing scales because it is incremental – it doesn’t need to re-parse hundreds (or thousands) of compilation units after a header file is changed, as is often the case in an actual build. It is also faster than a full compiler (despite its heuristics) because it is not burdened by macro expansion and full semantic resolution. Thus it is well suited to capture real-time information for even the largest projects.
No built-in semantic resolution
Since the tag parser operates strictly on a per-file basis, certain semantic resolutions are left to its clients. For example, since function declarations and definitions typically appear in separate files,the tag parser reports a function declaration and its definition separately without any binding information. Therefore Class View has to match a function declaration and its definition so that they appear as a single entry in the Class View tree.
The tag parser is light-weight and this comes with some responsibilities on the side of the consumers of the parser results. The good thing here is that clients only have to incur the cost of building the semantic knowledge that they need and they can dig into the data with SQL now.
We tried to make the tag parser as standalone as possible. It doesn’t need to know about any kind of project configuration (include paths, compiler switches, etc…). In many cases the tag parser could be invoked with a source file name as its only argument and it would do an excellent job at extracting detailed information about the code in this file. The only caveat is preprocessor macros that interfere with the C++ syntax so badly that fuzzy parsing and error recovery heuristics cannot make sense out of the code. One example of such macro is STDMETHOD, when expanded it will generate a member function signature from something like:
You’ll have a hard time guessing what the above line means if you don’t know what STDMETHOD is. Since the tag parser doesn’t follow through #include directives and doesn’t perform SQL lookups into the symbol database*, it cannot discover by itself macro definitions. Nevertheless, its macro state can be preconfigured with what we call a ‘hint file’. A hint file simply contains the definitions of macros that are needed for the tag parser to correctly recognize your source code in the presence of macros that fundamentally interfere with the C++ syntax.
If you have Beta1 installed, you will find a “cpp.hint” file in your Visual Studio 2010 install directory under vcvcpackages, this is the hint file for the VC and SDK library headers. Very often the tag parser will do just fine with only this preset hint file. Nevertheless, if your code or some third party library code you are using contains macros that tamper with the C++ syntax, you may need to setup your own hint file. The IDE will look for files named “cpp.hint” in the directory where your source files are located and in all the parent directories up to the root directory or until a file named “cpp.stop” is found. All the hint files that are found will be preprocessed to build the macro context before your files actually get to be parsed. I won’t go into more details about hint files for now but feel free to ask questions and, by the way, they will be thoroughly documented on MSDN.
Don’t worry too much if this machinery seems complex, most of the time you won’t have to define your own hint files or you’ll just need to drop a “cpp.hint” file with a few macro definitions in your project or solution directory.
In the future we are planning to work on tools that will help you decide where hint files are needed and possibly generate them for you. And we will also work on making the tag parser act smarter in the presence of macros so that fewer hints need to be added to a hint file.
*In theory the tag parser could query the database for macros definitions when additional information is needed to recognize or disambiguate a declaration, but a reliable implementation of symbol lookups (even if it was only for macros) would push the tag parser in the opposite direction of being light-weight, standalone, incremental and independent from project configurations.
This post was updated on March 25 2013 to remove broken links to missing images, and references in the blog post to those images.