October 23rd, 2018

Exploring Clang Tooling Part 2: Examining the Clang AST with clang-query

This post is part of a regular series of posts where the C++ product team and other guests answer questions we have received from customers. The questions can be about anything C++ related: MSVC toolset, the standard language and library, the C++ standards committee, isocpp.org, CppCon, etc.

Today’s post is by guest author Stephen Kelly, who is a developer at Havok, a contributor to Qt and CMake and a blogger. This post is part of a series where he is sharing his experience using Clang tooling in his current team.

In the last post, we created a new clang-tidy check following documented steps and encountered the first limitation in our own knowledge – how can we change both declarations and expressions such as function calls?

In order to create an effective refactoring tool, we need to understand the code generated by the create_new_check.py script and learn how to extend it.

Exploring C++ Code as C++ Code

When Clang processes C++, it creates an Abstract Syntax Tree representing the code. The AST needs to be able to represent all of the possible complexity that can appear in C++ code – variadic templates, lambdas, operator overloading, declarations of various kinds etc. If we can use the AST representation of the code in our tooling, we won’t be discarding any of the meaning of the code in the process, as we would if we limit ourselves to processing only text.

Our goal is to harness the complexity of the AST so that we can describe patterns in it, and then replace those patterns with new text. The Clang AST Matcher API and FixIt API satisfy those requirements respectively.

The level of complexity in the AST means that detailed knowledge is required in order to comprehend it. Even for an experienced C++ developer, the number of classes and how they relate to each other can be daunting. Luckily, there is a rhythm to it all. We can identify patterns, use tools to discover what makes up the Clang model of the C++ code, and get to the point of having an instinct about how to create a clang-tidy check quickly.

Exploring a Clang AST

Let’s dive in and create a simple piece of test code so we can examine the Clang AST for it:

 
int addTwo(int num) 
{ 
    return num + 2; 
} 

int main(int, char**) 
{ 
    return addTwo(3); 
} 

There are multiple ways to examine the Clang AST, but the most useful when creating AST Matcher based refactoring tools is clang-query. We need to build up our knowledge of AST matchers and the AST itself at the same time via clang-query.

So, let’s return to MyFirstCheck.cpp which we created in the last post. The MyFirstCheckCheck::registerMatchers method contains the following line:

 
Finder->addMatcher(functionDecl().bind("x"), this); 

The first argument to addMatcher is an AST matcher, an Embedded Domain Specific Language of sorts. This is a predicate language which clang-tidy uses to traverses the AST and create a set of resulting ‘bound nodes’. In the above case, a bound node with the name x is created for each function declaration in the AST. clang-tidy later calls MyFirstCheckCheck::check for each set of bound nodes in the result.

Let’s start clang-query passing our test file as a parameter and following it with two dashes. Similar to use of clang-tidy in Part 1, this allows us to specify compile options and avoid warnings about a missing compilation database.

This command drops us into an interactive interpreter which we can use to query the AST:

$ clang-query.exe testfile.cpp -- 

clang-query>

Type help for a full set of commands available in the interpreter. The first command we can examine is match, which we can abbreviate to m. Let’s paste in the matcher from MyFirstCheck.cpp:

clang-query> match functionDecl().bind("x") 

Match #1: 
 
testfile.cpp:1:1: note: "root" binds here 
int addTwo(int num) 
^~~~~~~~~~~~~~~~~~~ 
testfile.cpp:1:1: note: "x" binds here 
int addTwo(int num) 
^~~~~~~~~~~~~~~~~~~ 
 
Match #2: 
 
testfile.cpp:6:1: note: "root" binds here 
int main(int, char**) 
^~~~~~~~~~~~~~~~~~~~~ 
testfile.cpp:6:1: note: "x" binds here 
int main(int, char**) 
^~~~~~~~~~~~~~~~~~~~~ 
2 matches. 

clang-query automatically creates a binding for the root element in a matcher. This gets noisy when trying to match something specific, so it makes sense to turn that off if defining custom binding names:

clang-query> set bind-root false 
clang-query> m functionDecl().bind("x") 

Match #1: 

testfile.cpp:1:1: note: "x" binds here 
int addtwo(int num) 
^~~~~~~~~~~~~~~~~~~ 

Match #2: 

testfile.cpp:6:1: note: "x" binds here 
int main(int, char**) 
^~~~~~~~~~~~~~~~~~~~~ 
2 matches. 

So, we can see that for each function declaration that appeared in the translation unit, we get a resulting match. clang-tidy will later use these matches one at a time in the check method in MyFirstCheck.cpp to complete the refactoring.

Use quit to exit the clang-query interpreter. The interpreter must be restarted each time C++ code is changed in order for the new content to be matched.

Nesting matchers

The AST Matchers form a ‘predicate language’ where each matcher in the vocabulary is itself a predicate, and those predicates can be nested. The matchers fit into three broad categories as documented in the AST Matchers Reference.

functionDecl() is an AST Matcher which is invoked for each function declaration in the source code. In normal source code, there will be hundreds or thousands of results coming from external headers for such a simple matcher.

Let’s match only functions with a particular name:

clang-query> m functionDecl(hasName("addTwo")) 

Match #1: 

testfile.cpp:1:1: note: "root" binds here 
int addTwo(int num) 
^~~~~~~~~~~~~~~~~~~ 
1 match. 

This matcher will only trigger on function declarations which have the name “addTwo“. The middle column of the documentation indicates the name of each matcher, and the first column indicates the kind of matcher that it can be nested inside. The hasName documentation is not listed as being usable with the Matcher<FunctionDecl>, but instead with Matcher<NamedDecl>.

Here, a developer without prior experience with the Clang AST needs to learn that the FunctionDecl AST class inherits from the NamedDecl AST class (as well as DeclaratorDecl, ValueDecl and Decl). Matchers documented as usable with each of those classes can also work with a functionDecl() matcher. That familiarity with the inheritance structure of Clang AST classes is essential to proficiency with AST Matchers. The names of classes in the Clang AST correspond to “node matcher” names by making the first letter lower-case. In the case of class names with an abbreviation prefix CXX such as CXXMemberCallExpr, the entire prefix is lowercased to produce the matcher name cxxMemberCallExpr.

So, instead of matching function declarations, we can match on all named declarations in our source code. Ignoring some noise in the output, we get results for each function declaration and each parameter variable declaration:

clang-query> m namedDecl() 
... 
Match #8: 

testfile.cpp:1:1: note: "root" binds here 
int addTwo(int num) 
^~~~~~~~~~~~~~~~~~~ 

Match #9: 

testfile.cpp:1:12: note: "root" binds here 
int addTwo(int num) 
           ^~~~~~~ 

Match #10: 

testfile.cpp:6:1: note: "root" binds here 
int main(int, char**) 
^~~~~~~~~~~~~~~~~~~~~ 

Match #11: 

testfile.cpp:6:10: note: "root" binds here 
int main(int, char**) 
         ^~~ 

Match #12: 

testfile.cpp:6:15: note: "root" binds here 
int main(int, char**) 
              ^~~~~~

Parameter declarations are in the match results because they are represented by the ParmVarDecl class, which also inherits NamedDecl. We can match only parameter variable declarations by using the corresponding AST node matcher:

clang-query> m parmVarDecl() 

Match #1: 

testfile.cpp:1:12: note: "root" binds here 
int addTwo(int num) 
           ^~~~~~~ 

Match #2: 

testfile.cpp:6:10: note: "root" binds here 
int main(int, char**) 
         ^~~ 

Match #3: 

testfile.cpp:6:15: note: "root" binds here 
int main(int, char**) 
              ^~~~~~

clang-query has a code-completion feature, triggered by pressing TAB, which shows the matchers which can be used at any particular context. This feature is not enabled on Windows however.

Discovery Through Clang AST Dumps

clang-query gets most useful as a discovery tool when exploring deeper into the AST and dumping intermediate nodes.

Let’s query our testfile.cpp again, this time with the output set to dump:

clang-query> set output dump 
clang-query> m functionDecl(hasName(“addTwo”)) 

Match #1: 

Binding for "root": 
FunctionDecl 0x17a193726b8 <testfile.cpp:1:1, line:4:1> line:1:5 used addTwo 'int (int)' 
|-ParmVarDecl 0x17a193725f0 <col:12, col:16> col:16 used num 'int' 
`-CompoundStmt 0x17a19372840 <line:2:1, line:4:1> 
  `-ReturnStmt 0x17a19372828 <line:3:5, col:18>
      `-BinaryOperator 0x17a19372800 <col:12, col:18> 'int' '+' 
          |-ImplicitCastExpr 0x17a193727e8 <col:12> 'int' <LValueToRValue>
            | `-DeclRefExpr 0x17a19372798 <col:12> 'int' lvalue ParmVar 0x17a193725f0 'num' 'int' 
            `-IntegerLiteral 0x17a193727c0 <col:18> 'int' 2

There is a lot here to take in, and a lot of noise which is not relevant to what we are interested in to make a matcher, such as pointer addresses, the word used appearing inexplicably and other content whose structure is not obvious. For the sake of brevity in this blog post, I will elide such content in further listings of AST content.

The reported match has a FunctionDecl at the top level of a tree. Below that, we can see the ParmVarDecl nodes which we matched previously, and other nodes such as ReturnStmt. Each of these corresponds to a class name in the Clang AST, so it is useful to look them up to see what they inherit and know which matchers are relevant to their use.

The AST also contains source location and source range information, the latter denoted by angle brackets. While this detailed output is useful for exploring the AST, it is not as useful for exploring the source code. Diagnostic mode can be re-entered with set output diag for source code exploration. Unfortunately, both outputs (dump and diag) can not currently be enabled at once, so it is necessary to switch between them.

Tree Traversal

We can traverse this tree using the has() matcher:

clang-query> m functionDecl(has(compoundStmt(has(returnStmt(has(callExpr())))))) 

Match #1: 

Binding for "root": 
FunctionDecl <testfile.cpp:6:1, line:9:1> line:6:5 main 'int (int, char **)' 
|-ParmVarDecl <col:10> col:13 'int' 
|-ParmVarDecl <col:15, col:20> col:21 'char **' 
`-CompoundStmt <line:7:1, line:9:1> 
  `-ReturnStmt <line:8:5, col:20> 
      `-CallExpr <col:12, col:20> 'int' 
          |-ImplicitCastExpr <col:12> 'int (*)(int)'
            | `-DeclRefExpr <col:12> 'int (int)' 'addTwo'
            `-IntegerLiteral <col:19> 'int' 3      

With some distracting content removed, we can see that the AST dump contains some source ranges and source locations. The ranges are denoted by angle brackets, which have a beginning and possibly an end position. To avoid repeating the filename and the keywords line and col, only difference from the previously printed source location are printed. For example, <testfile.cpp:6:1, line:9:1> describes a span from line 6 column 1 in testfile.cpp to line 9 column 1 also in testfile.cpp. The range <col:15, col:20> describes the span from column 15 to column 20 in line 6 (from a few lines above) in testfile.cpp as that is the last filename printed.

Because each of the nested predicates match, the top-level functionDecl() matches and we get a binding for the result. We can additionally use a nested bind() call to add nodes to the result set:

clang-query> m functionDecl(has(compoundStmt(has(returnStmt(has(callExpr().bind("functionCall"))))))) 

Match #1: 

Binding for "functionCall": 
CallExpr <testfile.cpp:8:12, col:20> 'int' 
|-ImplicitCastExpr <col:12> 'int (*)(int)'
| `-DeclRefExpr <col:12> 'int (int)' 'addTwo'
`-IntegerLiteral <col:19> 'int' 3 

Binding for "root": 
FunctionDecl <testfile.cpp:6:1, line:9:1> line:6:5 main 'int (int, char **)' 
|-ParmVarDecl <col:10> col:13 'int' 
|-ParmVarDecl <col:15, col:20> col:21 'char **' 
`-CompoundStmt <line:7:1, line:9:1> 
  `-ReturnStmt <line:8:5, col:20> 
      `-CallExpr <col:12, col:20> 'int' 
          |-ImplicitCastExpr <col:12> 'int (*)(int)'
            | `-DeclRefExpr <col:12> 'int (int)' 'addTwo'
            `-IntegerLiteral <col:19> 'int' 3 

The hasDescendant() matcher can be used to match the same node as above in this case:

clang-query> m functionDecl(hasDescendant(callExpr().bind("functionCall")))

Note that over-use of the has() and hasDescendant() matchers – and their complements hasParent() and hasAncestor() – is usually an anti-pattern and can lead to unintended results, particularly while matching nested Expr subclasses in source code. Usually, higher-level matchers should be used instead. For example, while has() may be used to match a desired IntegerLiteral argument in the case above, it would not be possible to specify which argument we wish to match in a function which has multiple arguments. The hasArgument() matcher should be used in the case of callExpr() to resolve this issue, as it can specify which argument should be matched if there are multiple:

clang-query> m callExpr(hasArgument(0, integerLiteral()))

The above matcher will match on every function call whose zeroth argument is an integer literal.

Usually we want to use more narrowing criteria to only match on a particular category of matches. Most matchers accept multiple arguments and behave as though they have an implicit allOf() within them. So, we can write:

clang-query> m callExpr(hasArgument(0, integerLiteral()), callee(functionDecl(hasName("addTwo"))))

to match calls whose zeroth argument is an integer literal only if the function being called has the name “addTwo“.

A matcher expression can sometimes be obvious to read and understand, but harder to write or discover. The particular node types which may be matched can be discovered by examining the output of clang-query. However, the callee() matcher here may be difficult to independently discover because it did not appear to be referenced in the AST dumps from clang-query and it is only one matcher in the long list in the reference documentation. The code of the existing clang-tidy checks are educational both to discover matchers which are commonly used together, and to find a context where particular matchers should be used.

A nested matcher creating a binding in clang-query is another important discovery technique. If we have source code such as:

 
int add(int num1, int num2) 
{
  return num1 + num2; 
} 

int add(int num1, int num2, int num3) 
{
  return num1 + num2 + num3; 
} 

int main(int argc, char**) 
{ 
  int i = 42; 

  return add(argc, add(42, i), 4 * 7); 
}

and we intend to introduce a safe_int type to use instead of int in the signature of add. All existing uses of add must be ported to some new pattern of code.

The basic workflow with clang-query is that we must first identify source code which is exemplary of what we want to port and then determine how it is represented in the Clang AST. We will need to identify the locations of arguments to the add function and their AST types as a first step.

Let’s start with callExpr() again:

clang-query> m callExpr() 

Match #1: 

testfile.cpp:15:10: note: "root" binds here 
    return add(argc, add(42, i), 4 * 7); 
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

Match #2: 

testfile.cpp:15:20: note: "root" binds here 
    return add(argc, add(42, i), 4 * 7); 
                     ^~~~~~~~~~ 

This example uses various different arguments to the add function: the first argument is a parameter from a different function, then a return value of another call, then an inline multiplication. clang-query can help us discover how to match these constructs. Using the hasArgument() matcher we can bind to each of the three arguments, and using bind-root false for brevity:

clang-query> set bind-root false 
clang-query> m callExpr(hasArgument(0, expr().bind("a1")), hasArgument(1, expr().bind("a2")), hasArgument(2, expr().bind("a3"))) 

Match #1: 

testfile.cpp:15:14: note: "a1" binds here 
return add(argc, add(42, i), 4 * 7); 
           ^~~~ 

testfile.cpp:15:20: note: "a2" binds here 
return add(argc, add(42, i), 4 * 7); 
                 ^~~~~~~~~~ 

testfile.cpp:15:32: note: "a3" binds here 
return add(argc, add(42, i), 4 * 7); 
                             ^~~~~

Changing the output to dump and re-running the same matcher:

clang-query> set output dump 
clang-query> m callExpr(hasArgument(0, expr().bind("a1")), hasArgument(1, expr().bind("a2")), hasArgument(2, expr().bind("a3"))) 

Match #1: 

Binding for "a1": 
DeclRefExpr <testfile.cpp:15:14> 'int' 'argc'

Binding for "a2": 
CallExpr <testfile.cpp:15:20, col:29> 'int' 
|-ImplicitCastExpr <col:20> 'int (*)(int, int)' 
| `-DeclRefExpr <col:20> 'int (int, int)' 'add' 
|-IntegerLiteral <col:24> 'int' 42 
`-ImplicitCastExpr <col:28> 'int' 
  `-DeclRefExpr <col:28> 'int' 'i' 

Binding for "a3": 
BinaryOperator <testfile.cpp:15:32, col:36> 'int' '*' 
|-IntegerLiteral <col:32> 'int' 4 
`-IntegerLiteral <col:36> 'int' 7 

We can see that the top-level AST nodes of the arguments are DeclRefExpr, CallExpr and BinaryOperator respectively. When implementing our refactoring tool, we might want to wrap the argc as safe_int(argc), ignore the nested add() call, as its return type will be changed to safe_int, and change the BinaryOperator to some safe operation.

As we learn about the AST we are examining, we can also replace the expr() with something more specific to explore further. Because we now know the second argument is a CallExpr, we can use a callExpr() matcher to check the callee. The callee() matcher only works if we specify callExpr() instead of expr():

clang-query> m callExpr(hasArgument(1, callExpr(callee(functionDecl().bind("func"))).bind("a2"))) 

Match #1: 

Binding for "a2": 
CallExpr <testfile.cpp:15:20, col:29> 'int' 
|-ImplicitCastExpr <col:20> 'int (*)(int, int)'
| `-DeclRefExpr <col:20> 'int (int, int)' 'add'
|-IntegerLiteral <col:24> 'int' 42 
`-ImplicitCastExpr <col:28> 'int' 
  `-DeclRefExpr <col:28> 'int' 'i' 

Binding for "func": 
FunctionDecl <testfile.cpp:1:1, line:4:1> line:1:5 add 'int (int, int)' 
... etc 

1 match. 
clang-query> set output diag 
clang-query> m callExpr(hasArgument(1, callExpr(callee(functionDecl().bind("func"))).bind("a2"))) 

Match #1: 

testfile.cpp:15:20: note: "a2" binds here 
return add(argc, add(42, i), 4 * 7); 
                 ^~~~~~~~~~ 

testfile.cpp:1:1: note: "func" binds here 
int add(int num1, int num2) 
^~~~~~~~~~~~~~~~~~~~~~~~~~~ 

Avoiding the Firehose

Usually when you need to examine the AST it will make sense to run clang-query on your real source code instead of a single-file demo. Starting off with a callExpr() matcher will result in a firehose problem – there will be tens of thousands of results and you will not be able to determine how to make your matcher more specific for the lines of source code you are interested in. Several tricks can come to your aid in this case.

First, you can use isExpansionInMainFile() to limit the matches to only the main file, excluding all results from headers. That matcher can be used with Exprs, Stmts and Decls, so it is useful for everything you might want to start matching.

Second, if you still get too many results from your matcher, the has Ancestor matcher can be used to limit the results further.

Third, often particular names of variables can anchor your match to some particular piece of code of interest.

Exploring the AST of code such as

 
void myFuncName() 
{ 
  int i = someFunc() + Point(4, 5).translateX(9);   
} 

might start with a matcher which anchors to the name of the variable, the function it is in and the location in the main file:

varDecl(isExpansionInMainFile(), hasAncestor(functionDecl(hasName("myFuncName"))), hasName("i"))

This starting point will make it possible to explore how the rest of the line is represented in the AST without being drowned in noise.

Conclusion

clang-query is an essential asset while developing a refactoring tool with AST Matchers. It is a prototyping and discovery tool, whose input can be pasted into the implementation of a new clang-tidy check.

In this blog post, we explored the basic use of the clang-query tool – nesting matchers and binding their results – and how the output corresponds to the AST Matcher Reference. We also saw how to limit the scope of matches to enable easy creation of matchers in real code.

In the next blog post, we will explore the corresponding consumer of AST matcher results. This will be the actual re-writing of the source code corresponding to the patterns we have identified as refactoring targets.

Which AST Matchers do you think will be most useful in your code? Let us know in the comments below or contact the author directly via e-mail at stkelly@microsoft.com, or on Twitter @steveire.

I will be showing even more new and future developments in clang-query at code::dive in November. Make sure to put it in your calendar if you are attending!

Author

0 comments

Discussion are closed.