Exploring Clang Tooling Part 3: Rewriting Code with clang-tidy

Avatar

Stephen

In the previous post in this series, we used clang-query to examine the Abstract Syntax Tree of a simple source code file. Using clang-query, we can prototype an AST Matcher which we can use in a clang-tidy check to refactor code in bulk.

This time, we will complete the rewriting of the source code.

Let’s return to MyFirstCheck.cpp we generated earlier and update the registerMatchers method. First we can refactor it to port both function declarations and function calls, using the callExpr() and callee() matchers we used in the previous post:

Because Matchers are really C++ code, we can extract them into variables and compose them into multiple other Matchers, as done here with nonAwesomeFunction.

In this case, I have narrowed the declaration matcher to match only on function declarations which do not start with awesome_. That matcher is then used once with a binder addAwesomePrefix, then again to specify the callee() of a callExpr(), again binding the relevant expression to the name addAwesomePrefix.

Because large scale refactoring often involves primarily changing particular expressions, it generally makes sense to separately define the matchers for the declaration to match and the expressions referencing those declarations. In my experience, the matchers for declarations can get complicated for example with exclusions due to limitations of a reflection system, or with more specifics about functions with particular return types or argument types. Centralizing those cases helps keep your refactoring code maintainable.

Another change I have made is that I renamed the binding from x to addAwesomePrefix. This is notable because it uses verbs to describe what should be done with the matches. It should be clear from reading matcher bindings what the result of invoking the fix is to be. Binding names can then be seen as a weakly-typed string-based language interface between the matcher and the replacement code.
We can then implement MyFirstCheckCheck::check to consume the bindings. A first approximation might look like:

Perhaps a better implementation would reduce the duplication of the diagnostic code:

Because the FunctionDecl and the CallExpr do not share an inheritance hierarchy, we need separate casting conditions for each. Even if they did share an inheritance hierarchy, we need to call getLocation in one case, and getExprLoc in another. The reason for that is that Clang records many relevant locations for each AST node. The developer of the clang-tidy check needs to know which location accessor method is appropriate or required for each situation.
A further improvement is to change the casts to accept the relevant types of FunctionDecl and CallExprNamedDecl and Expr respectively.

This change enforces the idea that the names of bound nodes form a weakly-typed interface between the Matcher code and the Rewriter code. Because the Rewriter code now expects the addAwesomePrefix to be used with the base types NamedDecl and Expr, other Matcher code can take advantage of that. We can now re-use the addAwesomePrefix binding name to add a prefix to field declarations or member expressions for example because their corresponding Clang AST classes also inherit NamedDecl:

Notice that this code is comparable to the matchers we wrote for the functionDecl/callExpr pairing. Taking advantage of the binding name interface, we can continue extending our matcher code to port variable declarations without changing the rewriter side of that interface:

Location Location Location

Let’s return to the check implementation and examine it. This method is responsible for implementing the rewriting of the source code as described by the matchers and their bound nodes.
In this case, we have inserted code at the SourceLocation returned by either getLocation() or getExprLoc() of NamedDecl or Expr respectively. Clang AST classes have many methods returning SourceLocation which refer to various places in the source code related to particular AST nodes.
For example, the CallExpr has SourceLocation accessors getBeginLoc, getEndLoc and getExprLoc. It is currently difficult to discover how a particular position in the source code relates to a particular SourceLocation accessor.

clang::VarDecl represents variable declarations in the Clang AST. clang::ParmVarDecl inherits clang::VarDecl and represents parameter declarations. Notice that in all cases, end locations indicate the beginning of the last token, not the end of it. Note also that in the second example below, the source locations of the call used to initialize the variable are not part of the variable. It is necessary to traverse to the initialization expression to access those.

clang::FunctionDecl represents function declarations in the Clang AST. clang::CXXMethodDel inherits clang::FunctionDecl and represents method declarations. Note that the location of the return type is not always given by getBeginLoc in C++.

clang::CallExpr represents function calls in the Clang AST. clang::CXXMemberCallExpr inherits clang::CallExpr and represents method calls. Note that when calling free functions (represented by a clang::CallExpr), the getExprLoc and the getBeginLoc will be the same. Always chose the semantically correct location accessor, rather than a location which appears to indicate the correct position.

It is important to know that locations on AST classes point to the start of tokens in all cases. This can be initially confusing when examining end locations. Sometimes to get to a desired location, it is necessary to use getLocWithOffset() to advance or retreat a SourceLocation. Advancing to the end of a token can be achieved with Lexer::getLocForEndOfToken.

The source code locations of arguments to the function call are not accessible from the CallExpr, but must be accessed via AST nodes for the arguments themselves.

Every AST node has accessors getBeginLoc and getEndLoc. Expression nodes additionally have a getExprLoc, and declaration nodes have an additional getLocation accessor. More-specific subclasses have more-specific accessors for locations relevant to the C++ construct they represent. Source code locations in Clang are comprehensive, but accessing them can get complex as requirements become more advanced. A future blog post may explore this topic in more detail if there is interest among the readership.

Once we have acquired the locations we are interested in, we need to insert, remove or replace source code fragments at those locations.

Let’s return to MyFirstCheck.cpp:

diag is a method on the ClangTidyCheck base class. The purpose of it is to issue diagnostics and messages to the user. It can be called with just a source location and a message, causing a diagnostic to be emitted at the specified location:

Resulting in:

The diag method returns a DiagnosticsBuilder to which we can stream fix suggestions using FixItHint.

The CreateRemoval method creates a FixIt for removal of a range of source code. At its heart, a SourceRange is just a pair of SourceLocations. If we wanted to remove the awesome_ prefix from functions which have it, we might expect to write something like this:

The matcher part of this code is fine, but when we run clang-tidy, we find that the removal is applied to the entire function name, not only the awesome_ prefix. The problem is that Clang extends the end of the removal range to the end of the token pointed to by the end. This is symmetric with the fact that AST nodes have getEndLoc() methods which point to the start of the last token. Usually, the intent is to remove or replace entire tokens.

To make a replacement or removal in source code which extends into the middle of a token, we need to indicate that we are replacing a range of characters instead of a range of tokens, using CharSourceRange::getCharRange:

Conclusion

This concludes the mini-series about writing clang-tidy checks. This series has been an experiment to gauge interest, and there is a lot more content to cover in further posts if there is interest among the readership.

Further topics can cover topics that occur in the real world such as

  • Creation of compile databases
  • Creating a stand-alone buildsystem for clang-tidy checks
  • Understanding and exploring source locations
  • Completing more-complex tasks
  • Extending the matcher system with custom matchers
  • Testing refactorings
  • More tips and tricks from the trenches.

This would cover everything you need to know in order to quickly and effectively create and use custom refactoring tools on your codebase.

Do you want to see more! Let us know in the comments below or contact the author directly via e-mail at stkelly@microsoft.com, or on Twitter @steveire.

I will be showing even more new and future developments in clang-query and clang-tidy at code::dive tomorrow, including many of the items listed as future topics above. Make sure to schedule it in your calendar if you are attending code::dive!

Avatar
Stephen Kelly

Follow Stephen   

1 Comments
Avatar
Andreas Düring 2019-06-05 07:09:23
Thank you very much for this series. It helped me in writing a replacer (atol to strtol and similar).