Structure of the Gypsum compiler (part 2)
A few weeks ago, I introduced a new programming lanugage I've been working on, Gypsum. You can find Gypsum on GitHub. Now I'm writing a series of articles on how the Gypsum compiler actually works.
My last article discussed lexing, parsing, and layout analysis in the Gypsum compiler. In this article, I'll talk about the next three phases of the Gypsum compiler.
- Lexical analysis
- Layout analysis
- Syntax analysis
- Declaration analysis
- Inheritance analysis
- Type analysis
- Closure conversion
- Class flattening
- Semantic analysis
- Serialization
Intermediate representation
Before we dive into the process of compiling the abstract syntax tree (AST) we parsed earlier, let's talk about what the compiler will actually produce. The output of the Gypsum compiler is a CodeSwitch package (.csp) file, which can be executed by the CodeSwitch VM (part of the same project). CodeSwitch packages are binary files, which just contain a representation of the definitions in the program. Function bodies are represented with a stack-oriented bytecode (which I'll describe in the next article). Packages don't contain any native code and aren't targeted toward any particular machine. They aren't even specific to 32-bit or 64-bit architectures.
The Gypsum compiler's intermediate representation (IR) is pretty much exactly what goes into a package. The IR is designed to be very easy to generate, as opposed to being easy to analyze, transform, and optimize. The assumption is that the VM will have an optimizer which will translate packages into its own IR which will be better suited to optimization.
An important thing to realize about the IR is that it is flat: there are no functions within functions or classes within classes or scoping of any kind. Most of the compiler's job is to simplify the program into this flat representation. As a result, a package ends up being just a list of definitions.
At the top level, a package contains:
- Functions: basic unit of code. A function has a name, a list of type parameters, a return type, a list of parameter types, a list of variables, and a body, which is a list of basic blocks that contain bytecode.
- Classes: used to define new types of objects. A class has a name, a list of type parameters, a list of supertypes, an initializer function, a list of constructor functions, a list of fields, and a list of methods. Constructors, methods, and the initializer are just regular functions that take an instance of the class as their first argument. Note that classes don't contain functions; they reference them in the package's list of functions by index.
- Globals: used to store global variables and constants. These just have a name, a type, and an optional initial value. These are not fully supported yet.
- Type parameters: used to implement parametric polymorphism, also known as generics. These have a name, an upper bound type, and a lower bound type. There is only partial support for these right now.
There are some other kinds of definitions which help define the top-level definitions.
- Fields: Each class has a list of these. They are used to describe the data members of a class. Each field has a name and a type.
- Variables: Each function has a list of these. They are used to describe the parameters and non-captured local variables of a function. Each variable has a name, a type, and a kind (parameter or local). Note that variables are not stored in the generated package; they are only for the compiler's internal use.
- Types: A type describes the set of values a variable, field, or other definition can have. Gypsum is a statically-typed language, so types are used by all the other kinds of definitions.
The IR is defined in ir.py and ir_types.py.
Declaration analysis
Declaration analysis is performed right after the AST is parsed. It has two purposes. First, it creates IR definitions for each node in the AST that defines something. For example, when the analysis encounters an AstFunctionDefinition
, it creates a new Function
and adds it to the package being built. The newly created IR definitions are initially empty. Later phases will fill in details like types and bytecode. Some local definitions, such as variables and fields, are attached to their parent functions and classes as soon as they are defined though.
The second purpose of declaration analysis is to create scope information. The analysis creates a new Scope
object for each AST node that has its own scope, namely the top-level module, functions, classes, and block expressions. A Scope
object is a glorified dictionary. It manages bindings from symbols to the corresponding definitions so that the definitions can be looked up later. Scope
objects are hierarchical: they form a tree with the global scope on top, matching the lexical scope of the program.
While building scopes, we can also do some basic error checking: if two (or more) definitions have the same name in the same scope, we report an error. This is only allowed if all the named definitions are functions, since Gypsum support overloading.
Declaration analysis is performed in scope_analysis.py.
Aside: AST traversal
You will notice that declaration analysis and later passes all traverse the AST and do different things depending on what kind of nodes they are dealing with. Normally, this would involve a lot of boilerplate code. However, we apply the visitor pattern to abstract this traversal into the AstVisitor
class (which itself is a subclass of a more generic Visitor
class.
To use this functionality, we just create a subclass of AstVisitor
and implement methods like visitAstFunctionDefinition
(the emphasized part of the name is the class name of the node to be visited). Then we instantiate our new class and call the visit
method with the node to visit and some other arguments. visit
constructs a method name dynamically using the node's class name. If the method is implemented, visit
calls it.
You can see the full Visitor
class in visitor.py on GitHub. See ScopeVisitor
and DeclarationVisitor
in scope_analysis.py to see how it's used.
Inheritance analysis
Inheritance analysis is performed after scope analysis, since it requires all class definitions to be present. Its main purpose is to establish inheritance relations between classes, as the name would suggest. Every class definition contains an optional supertype. If a supertype is specified, inheritance analysis finds the class using the scope information constructed earlier and records the inheritance. If no supertype is specified, the class implicitly inherits from the root Object
class (much like Java).
While recording these inheritance relations, the analysis constructs an inheritance graph with classes as nodes and inheritances as edges. After traversing the AST, it checks this graph for cycles and reports an error if it finds one.
// This inheritance is not legal. // The compiler will report an error for this. class A <: B class B <: C class C <: A
Gypsum only supports single inheritance for now, but I intend to allow multiple inheritance of traits (like Scala) in the future.
Inheritance analysis is performed in scope_analysis.py.
Type analysis and use analysis
This phase is where things really start to come together. Two analyses are actually performed in the same pass. Use analysis uses the scope information constructed during declaration analysis to link uses of symbols (like variable expressions and class types) back to their original definitions. Type analysis associates a type with every AST node that needs one, including all expressions, patterns, types. We check correctness of types at the same, e.g., whether function arguments have the correct type, whether if
and while
conditions are Booleans, etc.
These analyses used to be performed in separate phases, but since function overloading was added to Gypsum, they need to be performed together. We need type information to figure out exactly which overloaded function we're calling. And we need to know which function we're calling to find its return type.
As with other phases, both type and use analysis are performed by traversing the AST using a visitor. The tree is traversed from the bottom up, i.e., for each expression, we find the types of all its sub-expressions, then use node-specific logic to find the type of the expression itself. Most expressions are syntactic sugar for function calls, so this usually just means we look up the function and use its return type. If the return type of a called function is implicit and has not been inferred yet, we push the current function onto a stack and recursively analyze the called function. If we end up analyzing a function already on the stack, the compiler reports an error. This can only happen for recursive functions or groups of mutually recursive functions, and an explicit return type is required in those cases for exactly this reason.
Type analysis is performed in type_analysis.py.
Conclusion
Hopefully, this article gives you some idea of how we go from a Gypsum abstract syntax tree to the intermediate representation. After these phases are complete, we have built a "skeleton" IR and annotated the AST with information we will need for semantic analysis. We have also checked for the most common kinds of errors: bad symbolic references and type mismatches.
In the next article, I'll cover the remaining phases, which "flesh out" the IR and serialize it in a format the VM understands.