Structure of the Gypsum compiler (part 1)

Published on 2014-07-20
Tagged: compilers gypsum

Two weeks ago, I introduced Gypsum, a new experimental programming language I've been working on as a side project. Gypsum is a statically-typed, object-oriented language. I'm working on making it functional, too. Check it out over at GitHub!

Today, I'd like to describe how the Gypsum compiler itself works. This will be the first article in a series of three.

Introduction

Gypsum is an experimental language, so the compiler is designed to be very flexible, easy to change and extend. The nice thing about side projects is that you can spend some extra time making sure the code is clean and elegant. You don't have to take on any technical debt to meet deadlines.

I didn't intend this from the beginning though. My initial plan was to build an interpreter in Python as fast as possible, then build the real compiler on top of that. It quickly became clear though that this approach wouldn't scale and I wouldn't be able to experiment with language features I wasn't sure about. So I cleaned up the code and added lots of unit tests. I stuck with Python though; it's probably not the best language for writing a compiler, but it does a surprisingly good job.

Phases of the compiler

The compiler is divided into ten phases. Today, we'll cover the first three.

Lexical analysis

Lexical analysis (lexing) is the process of translating raw characters from source code into tokens: identifiers, keywords, operators, numbers, strings, and such. In the Gypsum compiler, each token consists of a tag (determines how the token is interpreted by the parser), a chunk of text (cut from the source code), and a location in source (used for error reporting later). The output of the lexer is a list of these tokens.

The lexing process is pretty simple. Basically, we have a table of regular expressions and tags, like this:

__expressions = [
  # Whitespace
  (r"\r?\n", NEWLINE),
  (r"[\t ]+", SPACE),
  (r"//[^\r\n]*", COMMENT),
  ...
  # Keywords
  (r"if", RESERVED),
  (r"else", RESERVED),
  (r"while", RESERVED),
  (r"break", RESERVED),
  ...
  # Literals
  (r"[+-]?[0-9]+(?:i[0-9]+)?", INTEGER),
  (r"[+-]?[0-9]+\.[0-9]*(?:[Ee][+-]?[0-9]+)?(?:f[0-9]+)?", FLOAT),
  (r'"(?:\\"|[^"])*"', STRING),
  ...
  # Identifiers
  (r"[!#%&*+\-/:<=>?@\\^|~]+", OPERATOR),
  (r"[A-Za-z_][A-Za-z0-9_-]*", SYMBOL),
]

We start with a "cursor" at the beginning of the source code. We try to match each regular expression in the table. The longest successful match wins, and we emit a token with the tag and matched text and advance the cursor to the end of the matched text. If there were multiple matches with the same length, the first matching entry in the table is used. If there was no match, we report an error. We keep matching and emitting tokens until we reach the end of the file.

I've written about lexing previously in more detail. If you're interested in finding out more, check out the first part of my tutorial on building an interpreter.

Layout analysis

Layout analysis modifies the token stream. It consumes whitespace tokens and injects implicit braces ({, }) and semicolons (;). This is what makes whitespace significant in Gypsum. The parser is not aware of whitespace at all. It expects all blocks to be delimited by braces and all definitions and statements to be terminated with semicolons. Layout analysis inserts this punctuation for you automatically, so you almost never have to write it out.

Here's an example of tokens inserted by layout analysis. Implicit tokens are colored.

class Foo {
  var x = 0;
  def get = x;
};

def factorial(n: i64) = {
  if (n < 1) {
    1;
  } else {
    n * factorial(n - 1);
  };
};

The layout analysis algorithm is kind of complicated. Basically, we track the amount of indentation at the beginning of each line. This is counted as a number of tabs followed by a number of spaces; mixed tabs and spaces are not allowed. Blank lines and comments are ignored. Then:

Note that the inserted tokens are marked as being implicit. The parser will only accept { and } delimiters if they are both explicit or both implicit. So you can't write out { and have the matching } inserted automatically. It would be confusing for anyone reading the code if this were allowed.

Some commentary: I'm not entirely sure I like layout analysis in its current form. I definitely want Gypsum to have significant whitespace because it enforces readability, and I hate typing out extra punctuation which doesn't add meaning. The current rules try to be intuitive and don't require you to mark continuations with \ or blocks with :, as Python does. However, I feel like programmers should be able to easily understand how the compiler interprets their programs, and these rules pretty complicated; for example, I completely glossed over the pattern matching above. I think the right approach will become clear over time, and this can be easily changed in the future.

Syntax analysis

Syntax analysis (parsing) consumes the list of tokens and produces an abstract syntax tree (AST). The AST describes the structure of the source code. It contains a node for each syntactic element of a source file. At the root of the tree is an AstModule node, which represents the whole file. The leafs usually correspond to individual tokens; for example, an AstIntegerLiteral node represents an INTEGER token). The nodes in between represent expressions, types, and other language constructs. Nearly all later stages of the compiler make use of the AST, so this stage is pretty important.

There are several syntactic elements in Gypsum:

To implement the parser, I reused a parser combinator library I originally wrote for the IMP interpreter tutorial. Basically, each parser is a function (actually an object with a __call__ method and some handy operators) which implements some part of the grammar by reading some tokens and producing a value. We start with a bunch of simple parsers that can process the most basic parts of the grammar like keywords, symbols, and literals. Then we use combinators to build more sophisticated parsers out of those. Combinators are higher order functions which build more advanced parsers out of basic ones.

I wrote an article on how to use parser combinators earlier, so I won't go into any more detail here. I will give some examples though.

Here are a couple basic parsers:

symbol = Tag(SYMBOL)
operator = Tag(OPERATOR)

def keyword(kw):
    return Reserved(RESERVED, kw)

There are some operators which let us conveniently invoke combinators. These are pretty intuitive. + is concatenation, and | is alternation. ^ lets us call a given function to produce a value from the matched tokens.

def booleanLiteral():
    return (keyword("true") ^ (lambda p: AstBooleanLiteral(True))) | \
           (keyword("false") ^ (lambda p: AstBooleanLiteral(False)))

def statement():
    return definition() | \
           ((expression() + semi) ^ (lambda parsed: parsed[0]))

def returnExpr():
    def process(parsed):
        (_, e) = parsed
        return AstReturnExpression(e)
    return keyword("return") + Opt(Lazy(expression)) ^ process

You can see some of the other combinators used up there, too. Opt makes its parser optional (returning None if it didn't match). Lazy helps resolve circular references between parsers.

Combinators also make it easy to build lists:

def blockExpr():
    return layoutBlock(Rep(Lazy(statement)) ^ \
           (lambda stmts: AstBlockExpression(stmts)))

def parameters():
    def process(parsed):
        return untangle(parsed)[1] if parsed else []
    return Opt(keyword("(") + \
           Rep1Sep(parameter(), keyword(",")) + \
           keyword(")")) ^ process

In the examples above, Rep runs its parser zero or more times, returning a list of the results. Rep1Sep matches one or more times and requires a separator between each match.

To be continued...

In this article, we covered how the grammar of Gypsum is implemented. Pretty much every compiler or interpreter (or really any program that reads formatted text) has a lexer and parser which fill the same roles as the Gypsum lexer and parser. There are a lot of different ways to implement these components. In my opinion, the approach used in Gypsum is the easiest and most flexible, but it does not give the best performance. We can't always have it all, unfortunately.

Next time, we'll get into the more interesting stages: declaration analysis, inheritance analysis, and type analysis.