Preserving comments when parsing and formatting code
How Gazelle keeps comments when modifying BUILD files
I've been tinkering with a new toy language this week, and I wanted to replicate a feature I like from some tools in the Bazel ecosystem. Gazelle, buildozer, and buildifier have the ability to preserve comments when formatting a BUILD
file, even after significant programmatic modifications. Gazelle can also easily read comments when walking the syntax tree, which is important for processing # keep
comments and directives.
In this article, I'll compare Go's parser and formatter with the library used by the Bazel tools. Both libraries can preserve comments, but the Bazel library does it better.
How go/ast
preserves comments
Of course any library that can format parsed code can preserve comments. No one would use a formatter that didn't. There are different ways to implement it though. Let's start by looking at the Go standard library package representing the Go syntax tree.
Here's one of the abstract syntax tree (AST) node types, AssignStmt
. I chose this type arbitrarily; most of them look similar.
type AssignStmt struct {
Lhs []Expr
TokPos token.Pos // position of Tok
Tok token.Token // assignment token, DEFINE
Rhs []Expr
}
This type has no place to record comments. Where are they? Here's the top-level type, File
:
type File struct {
Doc *CommentGroup // associated documentation; or nil
Package token.Pos // position of "package" keyword
Name *Ident // package name
Decls []Decl // top-level declarations; or nil
FileStart, FileEnd token.Pos // start and end of entire file
Scope *Scope // package scope (this file only)
Imports []*ImportSpec // imports in this file
Unresolved []*Ident // unresolved identifiers in this file
Comments []*CommentGroup // list of all comments in the source file
GoVersion string // minimum Go version required by //go:build or // +build directives
}
This has a Comments
field, which is a []*CommentGroup
. What is that?
type CommentGroup struct {
List []*Comment // len(List) > 0
}
type Comment struct {
Slash token.Pos // position of "/" starting the comment
Text string // comment text (excluding '\n' for //-style comments)
}
All the comments are stored in a list attached to the root node of the AST. Each Comment
records its original position in the file. This approach causes two significant problems for refactoring tools.
First, when walking the AST, it's difficult to find the comments related to any particular node in the tree. Some node types like FuncDecl
do have comments, which is helpful for documentation generation, but most node types lack comments.
Second, when the formatter converts the AST into text, it preserves comments by iterating over the AST and the comment list together. Both AST nodes and comments have positions, so before the formatter writes an AST node, it writes comments with earlier positions. This makes difficult to build a refactoring tool that modifies the AST, adding something in the middle of a file. It throws off all the written offsets. It's also hard to programmatically add comments for the same reason.
How the Bazel tools preserve comments
The Bazel tools I mentioned all use the same library to parse and format Starlark files: github.com/bazelbuild/buildtools/build
. This package takes a very different approach. Let's look again at one of the AST node types, AssignExpr
:
type AssignExpr struct {
Comments
LHS Expr
OpPos Position
Op string
LineBreak bool // insert line break between Op and RHS
RHS Expr
}
type Comments struct {
Before []Comment // whole-line comments before this expression
Suffix []Comment // end-of-line comments after this expression
// For top-level expressions only, After lists whole-line
// comments following the expression.
After []Comment
}
type Comment struct {
Start Position
Token string // without trailing newline
}
Every AST node type embeds the Comments
struct type to store comments. This allows programs like Gazelle to walk the AST and read comments in the context they were written instead of searching through a separate list. This is particularly important for Gazelle because it needs to recognize # keep
comments that tell it not to modify specific expressions. Each comment has a position, but that's not particularly important when formatting a file: the formatter can print comments around associated AST nodes easily. This is wonderful if you're adding comments or editing the AST, which is Gazelle's main purpose.
So how does the parser actually read the comments and attach them to the right AST nodes? The important details are in input.assignComments
.
Like most other parsers, the parser in build
is written in two phases: lexical analysis (splitting input text into tokens), and syntax analysis (building an AST from those tokens). The lexer identifies comments and saves them in two lists: line comments that appear by themselves on a line, and suffix comments that appear after other tokens. Comments are usually not included in the token stream, but the lexer does emit _COMMENT
tokens for top-level comments at the beginning of the file. These are handled by the parser as if they were part of the grammar (even though they are technically not).
In most cases though, comments are not part of token stream, so the parser can build the AST without considering them. Comments are attached to the correct AST nodes only after the whole tree is built. This is the clever part. The parser builds two lists of all AST nodes by walking the tree: one from a pre-order traversal (earlier, outer-most nodes come first), the other from a post-order traversal (later, outer-most nodes come last). The parser iterates forward over the pre-order list and attaches line comments to the outer-most nodes that they appear before. Afterward, it iterates backward over the post-order list and attaches suffix comments to the outer-most nodes that they appear after.
To understand this, let's work out a somewhat contrived example:
foo(
# line
a + b) # suffix
The AST looks like this:
call
/ \
foo +
/ \
a b
When we traverse the tree, we get these two lists:
pre-order: call, foo, +, a, b
post-order: foo, a, b, +, call
Each node in these lists has a known start and end position, as do the comments, so we can easily figure out where to attach the comments. For the # line
comment, the first node in the pre-order list with a later start position is the +
, so that's where we attach it. That's what we'd intuitively expect. For the # suffix
comment, the last node with a later end position is the call
node, which is also what we'd expect.
Let's modify the example a little bit:
foo(
# line
a + b, # suffix
)
The # suffix
comment now starts before the call
AST node ends, which means it gets attached to the +
AST node, again, just like we'd expect.
I hope this article is useful to someone. I thought this was clever. It makes refactoring tools a bit easier to write, which is a win in my book.