Structure of the Gypsum compiler (part 3)

Published on 2014-09-27
Tagged: compilers gypsum

This article is part of the series "Structure of the Gypsum compiler".

Structure of the Gypsum compiler (part 1)
Structure of the Gypsum compiler (part 2)
Structure of the Gypsum compiler (part 3)

In case you missed my last two articles, I'm building a new programming language called Gypsum and a VM to run it called CodeSwitch. You can find both on GitHub. This is the last article in a series of three on how the Gypsum compiler works.

In the previous article, I discussed the intermediate representation, declaration analysis, inheritance analysis, and type / use analysis. Before that, I covered lexing, parsing, and layout analysis. This time, I'll cover the four remaining phases of the compiler:

Lexical analysis
Layout analysis
Syntax analysis
Declaration analysis
Inheritance analysis
Type analysis
Closure conversion
Class flattening
Semantic analysis
Serialization

Closure conversion

In the previous article, I mentioned that the Gypsum compiler's intermediate representation was a flat format. There are no functions within functions, or classes within classes; just primary definitions (functions, classes, globals, and type parameters) and secondary definitions used to build them (fields, variables). The Gypsum language supports nested definitions though. Here's a simple example:

def test-counter(n: i64, inc: i64) =
  def counter =
    var value = n
    n += inc
    value
  print(counter.to-string + "\n")
  print(counter.to-string + "\n")
  print(counter.to-string + "\n")

The function test-counter declares another function, counter, which returns a new value every time you call it. test-counter takes two parameters, n, a starting value, and inc, an amount to increment the counter by. Both of these parameters are referenced by counter. Since they are not defined in counter's scope, we say that these variables are captured, and counter is a closure.

The compiler flattens these definitions by converting them into something like the following:

class test-counter-context(n: i64, inc: i64)

class counter-closure(ctx: test-counter-context)
  def apply =
    var value = ctx.n
    ctx.n += ctx.inc
    value

def test-counter(n: i64, inc: i64) =
  var ctx = text-counter-context(n, inc)
  var counter = counter-closure(ctx)
  print(counter.apply.to-string + "\n")
  print(counter.apply.to-string + "\n")
  print(counter.apply.to-string + "\n")

Note: Gypsum doesn't yet support returning a function as a value. When it does, I'll amend this article with a better example where counter is returned and called elsewhere.

Some additional terminology is valuable here to understand what's happening.

A scope is an area of source code that may contain definitions and other inner scopes. Definitions may be accessed without qualification within the scope where they are defined and in inner scopes. Scopes belong to functions and classes. There's also a global scope, but that doesn't matter here.
A context object (or just context) contains definitions which are accessed in an inner scope.

For functions which contain captured definitions, a context object is constructed at the beginning of the function every time it is called.
For classes, any instance of the class acts as a context object, so no separate object is created.

A context class describes context objects.

For functions, context classes are generated automatically. They just have fields for captured variables.
Classes act as their own context classes.

A closure scope is a scope which contains references to definitions in outer scopes. These definitions are captured definitions. The scopes they are defined in are captured scopes.
A closure object (or just closure) contains a reference to a context object for each captured scope. This is how we can reference captured definitions.
A closure class describes closure objects.

For functions, closure classes are generated automatically, and a closure object is created when the function is declared. The closure object acts as the value of the function. The original function definition becomes a method of the closure class.
Class closures aren't supported yet, but will be soon.

In the example above, both n and inc are captured, since they are defined in test-counter and used in counter. We express this explicitly in the IR. First, the compiler auto-generates a context class for test-counter. In the example, I called it test-counter-context, but in reality, it doesn't have a human-friendly name. Fields are created for both variables. Second, the compiler auto-generates a closure class for counter with a field for a test-counter-context object. Later, during semantic analysis, code is generated to create a context object and a closure object at the beginning of test-context. Whenever n and inc are accessed, their fields are loaded from or stored to a test-counter-context object. The actual parameters are only read once to construct the context.

Closure conversion is the phase in the Gypsum compiler that makes all this happen. During use analysis (part of the previous phase), we annotate all symbolic references in the abstract syntax tree (AST). Closure conversion just iterates over these annotations, searching for references to definitions in outer scopes. When we find a definition that should be captured, we create a context class for the outer (defining) scope (if one doesn't already exist), and we create a closure class for the inner (using) scope (again, if one doesn't already exist). We add the context class to a list of scopes to be captured by the closure scope. If there are any scopes in between the inner and outer scopes, we handle them similarly, passing the context through every level. Finally, we convert the captured definition to a context-friendly form. For example, local variables will be converted into fields. We save information about every step so that the semantic analysis phase can generate the bytecode for constructing the context and closure objects and handle access to them.

Class flattening

Compared to closure conversion, the remaining phases are relatively simple.

In class flattening, we copy fields and methods from base classes into subclasses. At the end of the phase, each IR class has a complete list of fields and methods. This is necessary because bytecode accesses fields and methods by index. When we generate bytecode later, we need to know exactly where these definitions are within the class. Flattening is important for method overriding as well. A method in a subclass can replace a method in a base class, so every class could have a different method list.

There's nothing tricky about the implementation of this phase. We build a new inheritance graph and process classes in topological order. We can't reuse the graph from inheritance analysis, since we can introduce new classes to the IR during closure conversion. Topological order is important so we can avoid traversing every ancestor for each base class. Since we can guarantee we have already processed a class's ancestors, we just need to copy definitions from its immediate base.

CodeSwitch bytecode

Before we get into semantic analysis, let's stop and talk about CodeSwitch bytecode, which is what the Gypsum compiler generates. CodeSwitch is a stack-oriented virtual machine. This means that most instructions pop values from a stack, perform some computation, then push the result back onto the stack. For example, here's how we would evaluate a simple expression and store it into a variable:

// evaluate x = x * y + 4
ldlocal 0   // load x
ldlocal 1   // load y
muli64      // pop x, y, multiply them, push the product
i64     4   // push the integer 4
addi64      // pop the product and 4, add them, push the sum
stlocal 0   // pop the sum, store it into a local variable

You can see a full list of bytecode instructions in opcodes.yaml.

Since bytecode instructions need to be kept consistent in both Gypsum (written in Python) and CodeSwitch (written in C++), the list of opcodes is stored in a .yaml file, and code for them is auto-generated when needed.

Bytecode instructions are organized in basic blocks. A basic block is a linear sequence of instructions with one entry point (the first instruction) and one exit point (the last instruction). Function bodies are represented as a list of basic blocks in the IR (with the first block being the function's entry point). Here's a simple example of how we can count to 10 with a loop:

// var i = 0
// while (i < 10)
//   i += 1

0:
  i64      0  // push 0
  stlocal -1  // pop 0, store in x
  branch   1
1:            // loop condition
  ldlocal -1  // load x
  i64     10
  lti64       // i < 10?
  branchif 2, 3
2:            // loop body
  ldlocal -1  // load x
  i64      1
  addi64      // increment by 1
  stlocal -1  // store x
  branch   1  // go back to condition block
3:            // end of the loop
  ...

You may be wondering why I chose a stack-oriented representation instead of a register-oriented representation. Certainly this format is not a very good format for optimization or performance. However, stack-oriented compilers and interpreters are very easy to write. Keeping the format simple makes it easy for Gypsum (and other compilers in the future) to target CodeSwitch. Eventually, CodeSwitch will be able to translate the bytecode into its own IR, designed for optimization and native code generation. My vision is that compilers for many different languages will be written to target CodeSwitch. This will remove barriers for cross-language projects, since as far as CodeSwitch is concerned, it's all just bytecode. This is actually why I named the VM CodeSwitch: code switching is a linguistics term for when a person alternates between different languages in a single conversation.

Semantic analysis

In the semantic analysis phase, we generate bytecode instructions for each function. When most people think of compilers, they probably think of this phase. Surprisingly, this is one of the simpler phases, since we solved all the hard problems in earlier phases. At this point, we have skeletal definitions for everything in the program (created in declaration analysis), we have full type and use information for every AST node (from type analysis), and we have created classes and annotations to help us access captured definitions (closure conversion). Generating bytecode is all we have left to do.

Like other phases, semantic analysis is performed using the visitor pattern. Every kind of expression is implemented in a different method of CompileVisitor, which is called automatically as we traverse the graph.

Many expressions are treated as function calls by the compiler. This includes variable expressions (if they are nullary function calls), property expressions, this and super expressions, unary and binary operators, and, of course, call expressions. Regular function and method calls just generate callg (call global) and callv (call virtual) instructions. Operators for built-in types are an interesting case, since they frequently don't generate call instructions. Gypsum's built-in types are defined in builtins.yaml. In addition to built-in classes like Object and Exception, this file also describes primitive types like i64. Methods defined for these types may include a list of instructions for the compiler to emit instead of emitting a normal call instruction. For example, the + operator for the i64 type is implemented using an addi64 instruction. This isn't just limited to operators: calling the to-i32 method of i64 will cause the compiler to emit a trunci32 instruction.

Most expressions which don't involve function calls are control-flow expressions such as if, while, and try expressions. To implement these, the compiler creates new basic blocks wherever control flow converges or diverges and evaluates sub-expressions in the appropriate block. Here's an abbreviated and commented version of the implementation for the if expression:

def visitAstIfExpression(self, expr, mode):
  # Compile the condition, push on the stack
  self.visit(expr.condition, COMPILE_FOR_VALUE)

  # Create basic blocks.
  trueBlock = self.newBlock()
  falseBlock = self.newBlock()
  joinBlock = self.newBlock()

  # Branch to the true or false block, depending on the condition.
  self.branchif(trueBlock.id, falseBlock.id)

  # Compile the code on either side of the branch, and make sure
  # they meet up at the end.
  self.setCurrentBlock(trueBlock)
  self.visit(expr.trueExpr, mode)
  self.branch(joinBlock)

  self.setCurrentBlock(falseBlock)
  self.visit(expr.falseExpr, mode)
  self.branch(joinBlock)

  # Continue afterward.
  self.setCurrentBlock(joinBlock)

Serialization

After semantic analysis is done, we have a complete in-memory representation of a package. The only thing left to do is write it out to a file which can be consumed by CodeSwitch.

CodeSwitch uses a simple binary package format which matches the Gypsum IR. A package contains a list of strings, and lists of primary definitions (functions, classes, and type parameters). Globals aren't supported yet but will be soon. Primary definitions may refer to each other by index within these lists. Primary definitions may contain secondary definitions (just fields; variables are not written for now) and types which are not shared or referenced by other primary definitions. Most constants (instruction opcodes, builtin function ids, flags) are defined in yaml files which are shared with CodeSwitch.

Many bytecode instructions contain have immediate values. These are frequently small positive or negative integers used as indices (for example, which virtual method to call, which field to load). These numbers are written using a variable-byte encoding (abbreviated as VBN, loosely inspired by UTF-8). VBNs are 64-bit signed integers encoded using up to ten bytes. To write a number as a VBN, we split it into 7-bit pieces, and write them in little-endian order. The 8th bit of each byte is set if more bytes follow. In CodeSwitch, we decode VBNs by gluing all the pieces together and sign-extending to 64-bits. This allows us to encode the most commonly used integer constants as one or two bytes in the instruction stream.

Conclusion

I hope you enjoyed this series of articles on the Gypsum compiler. I plan to write more about the Gypsum language and the CodeSwitch VM in the future. I'm currently working on refactoring CodeSwitch, and I plan to write about how it manages memory when I'm done. In Gypsum, I'm experimenting with parametric polymorphism (also known as generics).

As mentioned earlier, Gypsum and CodeSwitch can be found on GitHub.

If you missed the earlier parts of this series of article, please check out part 1 and part 2.