# A simple interpreter from scratch in Python (part 3)

Published on 2011-05-08

Edited on 2014-04-11

Tagged:
compilers imp parsers python

This article is part of the series "A simple interpreter from scratch in Python".

So far, we've written a lexer and a parser combinator library for our interpreter. In this part, we'll create the abstract syntax tree (AST) data structures, and we'll write a parser using our combinator library which will convert the list of tokens returned by the lexer into an AST. Once we have parsed the AST, executing the program will be very easy.

## Defining the AST

Before we can actually start writing our parsers, we need to define the data structures that will be the output of our parser. We will do this with classes. Every syntactic element of IMP will have a corresponding class. Objects of these classes will represent nodes in the AST.

There are three kinds of structures in IMP: arithmetic expressions (used to compute numbers), Boolean expressions (used to compute conditions for if- and while-statements), and statements. We will start with arithmetic expressions, since the other two elements depend on them.

An arithmetic expression can take one of three forms:

- Literal integer constants, such as
`42`

- Variables, such as
`x`

- Binary operations, such as
`x + 42`

. These are made out of other arithmetic expressions.

We can also group expressions together with parenthesis (like `(x + 2) * 3`

. This isn't a different kind of expression so much as a different way to parse the expression.

We will define three classes for these forms, plus a base class for arithmetic expressions in general. For now, the classes won't do much except contain data. We will include a `__repr__`

method, so we can print out the AST for debugging. All AST classes will subclass `Equality`

so we can check if two AST objects are the same. This helps with testing.

from equality import * class Aexp(Equality): pass class IntAexp(Aexp): def __init__(self, i): self.i = i def __repr__(self): return 'IntAexp(%d)' % self.i class VarAexp(Aexp): def __init__(self, name): self.name = name def __repr__(self): return 'VarAexp(%s)' % self.name class BinopAexp(Aexp): def __init__(self, op, left, right): self.op = op self.left = left self.right = right def __repr__(self): return 'BinopAexp(%s, %s, %s)' % (self.op, self.left, self.right)

Boolean expressions are a little more complicated. There are four kinds of Boolean expressions.

- Relational expressions (such as
`x < 10`

) **And**expressions (such as`x < 10 and y > 20`

)**Or**expressions**Not**expressions

The left and right sides of a relational expression are arithmetic expressions. The left and right sides of an "and", "or", or "not" expression are Boolean expressions. Restricting the type like this helps us avoid nonsensical expressions like "x < 10 and 30".

class Bexp(Equality): pass class RelopBexp(Bexp): def __init__(self, op, left, right): ... class AndBexp(Bexp): def __init__(self, left, right): ... class OrBexp(Bexp): def __init__(self, left, right): ... class NotBexp(Bexp): def __init__(self, exp): ...

Statements can contain both arithmetic and Boolean expressions. There are four kinds of statements: assignment, compound, conditional, and loop.

class Statement(Equality): pass class AssignStatement(Statement): def __init__(self, name, aexp): ... class CompoundStatement(Statement): def __init__(self, first, second): ... class IfStatement(Statement): def __init__(self, condition, true_stmt, false_stmt): ... class WhileStatement(Statement): def __init__(self, condition, body): ...

## Primitives

Now that we have our AST classes and a convenient set of parser combinators, it's time to write our parser. When writing a parser, it's usually easiest to start with the most basic structures of the language and work our way up.

The first parser we'll look at is the `keyword`

parser. This is just a specialized version of the `Reserved`

combinator using the `RESERVED`

tag that all keyword tokens are tagged with. Remember, `Reserved`

matches a single token where both the text and tag are the same as the ones given.

def keyword(kw): return Reserved(kw, RESERVED)

`keyword`

is actually a combinator because it is a function that returns a parser. We will use it directly in other parsers though.

The `id`

parser is used to match variable names. It uses the `Tag`

combinator, which matches a token with the specified tag.

id = Tag(ID)

The `num`

parser is used to match integers. It works just like `id`

, except we use the `Process`

combinator (actually the `^`

operator, which calls `Process`

) to convert the token into an actual integer value.

num = Tag(INT) ^ (lambda i: int(i))

## Parsing arithmetic expressions

Parsing arithmetic expressions is not particularly simple, but since we need to parse arithmetic expressions in order to parse Boolean expressions or statements, we will start there.

We'll first define the `aexp_value`

parser, which will convert the values returned by `num`

and `id`

into actual expressions.

def aexp_value(): return (num ^ (lambda i: IntAexp(i))) | \ (id ^ (lambda v: VarAexp(v)))

We're using the `|`

operator here, which is a shorthand for the `Alternate`

combinator. So this will attempt to parse an integer expression first. If that fails, it will try to parse a variable expression.

You'll notice that we defined `aexp_value`

as a zero-argument function instead of a global value, like we did with `id`

and `num`

. We'll do the same thing with all the other parsers. The reason is that we don't want the code for each parser to be evaluated right away. If we defined every parser as a global, each parser would not be able to reference parsers that come after it in the same source file, since they would not be defined yet. This would make recursive parsers impossible to define, and the source code would be awkward to read.

Next, we'd like to support grouping with parenthesis in our arithmetic expressions. Although grouped expressions don't require their own AST class, they do require another parser to handle them.

def process_group(parsed): ((_, p), _) = parsed return p def aexp_group(): return keyword('(') + Lazy(aexp) + keyword(')') ^ process_group

`process_group`

is a function we use with the `Process`

combinator (`^`

operator). It just discards the parenthesis tokens and returns the expression in between. `aexp_group`

is the actual parser. Remember, the `+`

operator is shorthand for the `Concat`

combinator. So this will parse `'('`

, followed by an arithmetic expression (parsed by `aexp`

, which we will define soon), followed by `')'`

. We need to avoid calling `aexp`

directly since `aexp`

will call `aexp_group`

, which would result in infinite recursion. To do this, we use the `Lazy`

combinator, which defers the call to `aexp`

until the parser is actually applied to some input.

Next, we combine `aexp_value`

and `aexp_group`

using `aexp_term`

. An `aexp_term`

expression is any basic, self-contained expression where we don't have to worry about operator precedence with respect to other expressions.

def aexp_term(): return aexp_value() | aexp_group()

Now we come to the tricky part: operators and precedence. It would be easy for us just to define another kind of `aexp`

parser and throw it together with `aexp_term`

. This would result in a simple expression like:

1 + 2 * 3

being parsed incorrectly as:

(1 + 2) * 3

The parser needs to be aware of operator precedence, and it needs to group together operations with higher precedence.

We will define a few helper functions in order to make this work.

def process_binop(op): return lambda l, r: BinopAexp(op, l, r)

`process_binop`

is what actually constructs `BinopAexp`

objects. It takes any arithmetic operator and returns a function that combines a pair of expressions using that operator.

`process_binop`

will be used with the `Exp`

combinator (`*`

operator). `Exp`

parses a list of expressions with a separator between each pair of expressions. The left operand of `Exp`

is a parser that will match individual elements of the list (arithmetic expressions in our case). The right operand is a parser that will match the separators (operators). No matter what separator is matched, the right parser will return a function which, given the matched separator, returns a combining function. The combining function takes the parsed expressions to the left and right of the separator and returns a single, combined expression. Confused yet? We'll go through the usage of `Exp`

shortly. `process_binop`

is actually what will be returned by the right parser.

Next, we'll define our precedence levels and a combinator to help us deal with them.

def any_operator_in_list(ops): op_parsers = [keyword(op) for op in ops] parser = reduce(lambda l, r: l | r, op_parsers) return parser aexp_precedence_levels = [ ['*', '/'], ['+', '-'], ]

`any_operator_in_list`

takes a list of keyword strings and returns a parser that will match any of them. We will call this on `aexp_precedence_levels`

, which contains a list of operators for each precedence level (highest precedence first).

def precedence(value_parser, precedence_levels, combine): def op_parser(precedence_level): return any_operator_in_list(precedence_level) ^ combine parser = value_parser * op_parser(precedence_levels[0]) for precedence_level in precedence_levels[1:]: parser = parser * op_parser(precedence_level) return parser

`precedence`

is the real meat of the operation. Its first argument, `value_parser`

is a parser than can read the basic parts of an expression: numbers, variables, and groups. That will be `aexp_term`

. `precedence_levels`

is a list of lists of operators, one list for each level. We'll use `aexp_precedence_levels`

for this. `combine`

will take a function which, given an operator, returns a function to build a larger expression out of two smaller expressions. That's `process_binop`

.

Inside `precedence`

, we first define `op_parser`

which, for a given precedence level, reads any operator in that level and returns a function which combines two expressions. `op_parser`

can be used as the right-hand argument of `Exp`

. We start by calling `Exp`

with `op_parser`

for the highest precedence level, since these operations need to be grouped together first. We then use the resulting parser as the element parser (`Exp`

's left argument) at the next level. After the loop finishes, the resulting parser will be able to correctly parse any arithmetic expression.

How does this work in practice? Let's work it through.

E_{0}= value_parser E_{1}= E_{0}* op_parser(precedence_levels[0]) E_{2}= E_{1}* op_parser(precedence_levels[1])

`E`

is the same as _{0}`value_parser`

. It can parse numbers, variables, and groups, but no operators. E_{1} can parse expressions containing anything E_{0} can match, separated by operators in the first precedence level. So `E`

could match _{1}`a * b / c`

, but it would raise an error as soon as it encountered a `+`

operator. `E`

can match expressions _{2}`E`

can match, separated by operators in the next precedence level. Since we only have two precedence levels, _{2}`E`

can match any arithmetic expression we support._{2}

Let's do an example. We'll take a complicated arithmetic expression, and gradually replace each part by what matches it.

4 * a + b / 2 - (6 + c) E_{0}(4) * E_{0}(a) + E_{0}(b) / E_{0}(2) - E_{0}(6+c) E_{1}(4*a) + E_{1}(b/2) - E_{1}(6+c) E_{2}((4*a)+(b/2)-(6+c))

We use `precedence`

to directly define `aexp`

:

def aexp(): return precedence(aexp_term(), aexp_precedence_levels, process_binop)

We probably could have defined precedence in a less abstract way, but its strength is that it applies to any situation where operator precedence is a problem. We will use it again to parse Boolean expressions.

## Parsing Boolean expressions

With arithmetic expressions out of the way, we can move on to Boolean expressions. Boolean expressions are actually simpler than arithmetic expressions, and we won't need any new tools to parse them. We'll start with the most basic Boolean expression, the relation:

def process_relop(parsed): ((left, op), right) = parsed return RelopBexp(op, left, right) def bexp_relop(): relops = ['<', '<=', '>', '>=', '=', '!='] return aexp() + any_operator_in_list(relops) + aexp() ^ process_relop

`process_relop`

is just a function that we use with the `Process`

combinator. It just takes three concatenated values and creates a `RelopBexp`

out of them. In `bexp_relop`

, we parse two arithmetic expressions (`aexp`

), separated by a relational operator. We use our old friend `any_operator_in_list`

so we don't have to write a case for every operator. There is no need to use combinators like `Exp`

or `precedence`

since relational expressions can't be chained together in IMP like they can in other languages.

Next, we define the `not`

expression. `not`

is a unary operation with high precedence. This makes it much easier to parse than `and`

and `or`

expressions.

def bexp_not(): return keyword('not') + Lazy(bexp_term) ^ (lambda parsed: NotBexp(parsed[1]))

Here, we just concatenate the keyword `not`

with a Boolean expression term (which we will define next). Since `bexp_not`

will be used to define `bexp_term`

, we need to use the `Lazy`

combinator to avoid infinite recursion.

def bexp_group(): return keyword('(') + Lazy(bexp) + keyword(')') ^ process_group def bexp_term(): return bexp_not() | \ bexp_relop() | \ bexp_group()

We define `bexp_group`

and `bexp_term`

pretty much the same way we did for the arithmetic equivalents. Nothing new here.

Next, we need to define expressions which include the `and`

and `or`

operators. These operators actually work the same way arithmetic operators do; both are evaluated left to right, and `and`

has higher precedence.

bexp_precedence_levels = [ ['and'], ['or'], ] def process_logic(op): if op == 'and': return lambda l, r: AndBexp(l, r) elif op == 'or': return lambda l, r: OrBexp(l, r) else: raise RuntimeError('unknown logic operator: ' + op) def bexp(): return precedence(bexp_term(), bexp_precedence_levels, process_logic)

Just like `process_binop`

, `process_logic`

is intended to be used with the `Exp`

combinator. It takes an operator and returns a function which combines two sub-expressions into an expression using that operator. We pass this along to `precedence`

, just like we did with `aexp`

. Writing generic code pays off here, since we don't have to repeat the complicated expression processing code.

## Parsing statements

With `aexp`

and `bexp`

finished, we can start parsing IMP statements. We'll start with the humble assignment statement.

def assign_stmt(): def process(parsed): ((name, _), exp) = parsed return AssignStatement(name, exp) return id + keyword(':=') + aexp() ^ process

Nothing particularly interesting for this one. Next, we'll look at `stmt_list`

. This will parse a series of statements separated by semicolons.

def stmt_list(): separator = keyword(';') ^ (lambda x: lambda l, r: CompoundStatement(l, r)) return Exp(stmt(), separator)

Remember, we need to use the `Exp`

combinator here instead of doing something simple like `stmt() + keyword(';') + stmt()`

to avoid left recursion.

Next up is the `if`

-statement:

def if_stmt(): def process(parsed): (((((_, condition), _), true_stmt), false_parsed), _) = parsed if false_parsed: (_, false_stmt) = false_parsed else: false_stmt = None return IfStatement(condition, true_stmt, false_stmt) return keyword('if') + bexp() + \ keyword('then') + Lazy(stmt_list) + \ Opt(keyword('else') + Lazy(stmt_list)) + \ keyword('end') ^ process

The only complexity here is the fact that the `else`

-clause is optional. This makes the `process`

function a little more complicated.

Finally, we have the `while`

-loop:

def while_stmt(): def process(parsed): ((((_, condition), _), body), _) = parsed return WhileStatement(condition, body) return keyword('while') + bexp() + \ keyword('do') + Lazy(stmt_list) + \ keyword('end') ^ process

We'll wrap it all up with `stmt`

:

def stmt(): return assign_stmt() | \ if_stmt() | \ while_stmt()

You'll notice that both the `if`

and `while`

statements use `stmt_list`

for their bodies rather than `stmt`

. `stmt_list`

is actually our top level definition. We can't have `stmt`

depend directly on `stmt_list`

, since this would result in a left recursive parser. And since we want `if`

and `while`

statements to be able to have compound statements for their bodies, we use `stmt_list`

directly.

## Putting it all together

We now have a parser for every part of the language. We just need to make a couple top level definitions:

def parser(): return Phrase(stmt_list())

`parser`

will parse an entire program. A program is just a list of statements, but the `Phrase`

combinator ensures that we use every token in the file, rather than ending prematurely if there is are garbage tokens at the end.

def imp_parse(tokens): ast = parser()(tokens, 0) return ast

`imp_parse`

is the function that clients will call to parse a program. It takes the full list of tokens, calls `parser`

, starting with the first token, and returns the resulting AST.

To test out our parsers (in addition to unit tests), here's a simple driver program:

import sys from imp_parser import * if __name__ == '__main__': if len(sys.argv) != 3: sys.stderr.write('usage: %s filename parsername' % sys.argv[0]) sys.exit(1) filename = sys.argv[1] file = open(filename) characters = file.read() file.close() tokens = imp_lex(characters) parser = globals()[sys.argv[2]]() result = parser(tokens, 0) print result

This program will read a file (first argument) and parse it with one of the parsers in imp_parse.py (second argument). For example:

$ echo '1 + 2 * 3' >foo.imp $ python imp_parser_driver.py foo.imp aexp Result(BinopAexp(+, IntAexp(1), BinopAexp(*, IntAexp(2), IntAexp(3))), 5)

This should provide a good sandbox for experimentation.

## Conclusion

In this article, we built a parser combinator library from scratch and used it to write a parser for IMP. In the next and final article in the series, we'll write an evaluator for our parsed AST.

Once again, all the source code for the interpreter is available for download.