A simple interpreter from scratch in Python (part 3)
So far, we've written a lexer and a parser combinator library for our interpreter. In this part, we'll create the abstract syntax tree (AST) data structures, and we'll write a parser using our combinator library which will convert the list of tokens returned by the lexer into an AST. Once we have parsed the AST, executing the program will be very easy.
Defining the AST
Before we can actually start writing our parsers, we need to define the data structures that will be the output of our parser. We will do this with classes. Every syntactic element of IMP will have a corresponding class. Objects of these classes will represent nodes in the AST.
There are three kinds of structures in IMP: arithmetic expressions (used to compute numbers), Boolean expressions (used to compute conditions for if- and while-statements), and statements. We will start with arithmetic expressions, since the other two elements depend on them.
An arithmetic expression can take one of three forms:
- Literal integer constants, such as
42
- Variables, such as
x
- Binary operations, such as
x + 42
. These are made out of other arithmetic expressions.
We can also group expressions together with parenthesis (like (x + 2) * 3
. This isn't a different kind of expression so much as a different way to parse the expression.
We will define three classes for these forms, plus a base class for arithmetic expressions in general. For now, the classes won't do much except contain data. We will include a __repr__
method, so we can print out the AST for debugging. All AST classes will subclass Equality
so we can check if two AST objects are the same. This helps with testing.
from equality import * class Aexp(Equality): pass class IntAexp(Aexp): def __init__(self, i): self.i = i def __repr__(self): return 'IntAexp(%d)' % self.i class VarAexp(Aexp): def __init__(self, name): self.name = name def __repr__(self): return 'VarAexp(%s)' % self.name class BinopAexp(Aexp): def __init__(self, op, left, right): self.op = op self.left = left self.right = right def __repr__(self): return 'BinopAexp(%s, %s, %s)' % (self.op, self.left, self.right)
Boolean expressions are a little more complicated. There are four kinds of Boolean expressions.
- Relational expressions (such as
x < 10
) - And expressions (such as
x < 10 and y > 20
) - Or expressions
- Not expressions
The left and right sides of a relational expression are arithmetic expressions. The left and right sides of an "and", "or", or "not" expression are Boolean expressions. Restricting the type like this helps us avoid nonsensical expressions like "x < 10 and 30".
class Bexp(Equality): pass class RelopBexp(Bexp): def __init__(self, op, left, right): ... class AndBexp(Bexp): def __init__(self, left, right): ... class OrBexp(Bexp): def __init__(self, left, right): ... class NotBexp(Bexp): def __init__(self, exp): ...
Statements can contain both arithmetic and Boolean expressions. There are four kinds of statements: assignment, compound, conditional, and loop.
class Statement(Equality): pass class AssignStatement(Statement): def __init__(self, name, aexp): ... class CompoundStatement(Statement): def __init__(self, first, second): ... class IfStatement(Statement): def __init__(self, condition, true_stmt, false_stmt): ... class WhileStatement(Statement): def __init__(self, condition, body): ...
Primitives
Now that we have our AST classes and a convenient set of parser combinators, it's time to write our parser. When writing a parser, it's usually easiest to start with the most basic structures of the language and work our way up.
The first parser we'll look at is the keyword
parser. This is just a specialized version of the Reserved
combinator using the RESERVED
tag that all keyword tokens are tagged with. Remember, Reserved
matches a single token where both the text and tag are the same as the ones given.
def keyword(kw): return Reserved(kw, RESERVED)
keyword
is actually a combinator because it is a function that returns a parser. We will use it directly in other parsers though.
The id
parser is used to match variable names. It uses the Tag
combinator, which matches a token with the specified tag.
id = Tag(ID)
The num
parser is used to match integers. It works just like id
, except we use the Process
combinator (actually the ^
operator, which calls Process
) to convert the token into an actual integer value.
num = Tag(INT) ^ (lambda i: int(i))
Parsing arithmetic expressions
Parsing arithmetic expressions is not particularly simple, but since we need to parse arithmetic expressions in order to parse Boolean expressions or statements, we will start there.
We'll first define the aexp_value
parser, which will convert the values returned by num
and id
into actual expressions.
def aexp_value(): return (num ^ (lambda i: IntAexp(i))) | \ (id ^ (lambda v: VarAexp(v)))
We're using the |
operator here, which is a shorthand for the Alternate
combinator. So this will attempt to parse an integer expression first. If that fails, it will try to parse a variable expression.
You'll notice that we defined aexp_value
as a zero-argument function instead of a global value, like we did with id
and num
. We'll do the same thing with all the other parsers. The reason is that we don't want the code for each parser to be evaluated right away. If we defined every parser as a global, each parser would not be able to reference parsers that come after it in the same source file, since they would not be defined yet. This would make recursive parsers impossible to define, and the source code would be awkward to read.
Next, we'd like to support grouping with parenthesis in our arithmetic expressions. Although grouped expressions don't require their own AST class, they do require another parser to handle them.
def process_group(parsed): ((_, p), _) = parsed return p def aexp_group(): return keyword('(') + Lazy(aexp) + keyword(')') ^ process_group
process_group
is a function we use with the Process
combinator (^
operator). It just discards the parenthesis tokens and returns the expression in between. aexp_group
is the actual parser. Remember, the +
operator is shorthand for the Concat
combinator. So this will parse '('
, followed by an arithmetic expression (parsed by aexp
, which we will define soon), followed by ')'
. We need to avoid calling aexp
directly since aexp
will call aexp_group
, which would result in infinite recursion. To do this, we use the Lazy
combinator, which defers the call to aexp
until the parser is actually applied to some input.
Next, we combine aexp_value
and aexp_group
using aexp_term
. An aexp_term
expression is any basic, self-contained expression where we don't have to worry about operator precedence with respect to other expressions.
def aexp_term(): return aexp_value() | aexp_group()
Now we come to the tricky part: operators and precedence. It would be easy for us just to define another kind of aexp
parser and throw it together with aexp_term
. This would result in a simple expression like:
1 + 2 * 3
being parsed incorrectly as:
(1 + 2) * 3
The parser needs to be aware of operator precedence, and it needs to group together operations with higher precedence.
We will define a few helper functions in order to make this work.
def process_binop(op): return lambda l, r: BinopAexp(op, l, r)
process_binop
is what actually constructs BinopAexp
objects. It takes any arithmetic operator and returns a function that combines a pair of expressions using that operator.
process_binop
will be used with the Exp
combinator (*
operator). Exp
parses a list of expressions with a separator between each pair of expressions. The left operand of Exp
is a parser that will match individual elements of the list (arithmetic expressions in our case). The right operand is a parser that will match the separators (operators). No matter what separator is matched, the right parser will return a function which, given the matched separator, returns a combining function. The combining function takes the parsed expressions to the left and right of the separator and returns a single, combined expression. Confused yet? We'll go through the usage of Exp
shortly. process_binop
is actually what will be returned by the right parser.
Next, we'll define our precedence levels and a combinator to help us deal with them.
def any_operator_in_list(ops): op_parsers = [keyword(op) for op in ops] parser = reduce(lambda l, r: l | r, op_parsers) return parser aexp_precedence_levels = [ ['*', '/'], ['+', '-'], ]
any_operator_in_list
takes a list of keyword strings and returns a parser that will match any of them. We will call this on aexp_precedence_levels
, which contains a list of operators for each precedence level (highest precedence first).
def precedence(value_parser, precedence_levels, combine): def op_parser(precedence_level): return any_operator_in_list(precedence_level) ^ combine parser = value_parser * op_parser(precedence_levels[0]) for precedence_level in precedence_levels[1:]: parser = parser * op_parser(precedence_level) return parser
precedence
is the real meat of the operation. Its first argument, value_parser
is a parser than can read the basic parts of an expression: numbers, variables, and groups. That will be aexp_term
. precedence_levels
is a list of lists of operators, one list for each level. We'll use aexp_precedence_levels
for this. combine
will take a function which, given an operator, returns a function to build a larger expression out of two smaller expressions. That's process_binop
.
Inside precedence
, we first define op_parser
which, for a given precedence level, reads any operator in that level and returns a function which combines two expressions. op_parser
can be used as the right-hand argument of Exp
. We start by calling Exp
with op_parser
for the highest precedence level, since these operations need to be grouped together first. We then use the resulting parser as the element parser (Exp
's left argument) at the next level. After the loop finishes, the resulting parser will be able to correctly parse any arithmetic expression.
How does this work in practice? Let's work it through.
E0 = value_parser E1 = E0 * op_parser(precedence_levels[0]) E2 = E1 * op_parser(precedence_levels[1])
E0
is the same as value_parser
. It can parse numbers, variables, and groups, but no operators. E1 can parse expressions containing anything E0 can match, separated by operators in the first precedence level. So E1
could match a * b / c
, but it would raise an error as soon as it encountered a +
operator. E2
can match expressions E2
can match, separated by operators in the next precedence level. Since we only have two precedence levels, E2
can match any arithmetic expression we support.
Let's do an example. We'll take a complicated arithmetic expression, and gradually replace each part by what matches it.
4 * a + b / 2 - (6 + c) E0(4) * E0(a) + E0(b) / E0(2) - E0(6+c) E1(4*a) + E1(b/2) - E1(6+c) E2((4*a)+(b/2)-(6+c))
We use precedence
to directly define aexp
:
def aexp(): return precedence(aexp_term(), aexp_precedence_levels, process_binop)
We probably could have defined precedence in a less abstract way, but its strength is that it applies to any situation where operator precedence is a problem. We will use it again to parse Boolean expressions.
Parsing Boolean expressions
With arithmetic expressions out of the way, we can move on to Boolean expressions. Boolean expressions are actually simpler than arithmetic expressions, and we won't need any new tools to parse them. We'll start with the most basic Boolean expression, the relation:
def process_relop(parsed): ((left, op), right) = parsed return RelopBexp(op, left, right) def bexp_relop(): relops = ['<', '<=', '>', '>=', '=', '!='] return aexp() + any_operator_in_list(relops) + aexp() ^ process_relop
process_relop
is just a function that we use with the Process
combinator. It just takes three concatenated values and creates a RelopBexp
out of them. In bexp_relop
, we parse two arithmetic expressions (aexp
), separated by a relational operator. We use our old friend any_operator_in_list
so we don't have to write a case for every operator. There is no need to use combinators like Exp
or precedence
since relational expressions can't be chained together in IMP like they can in other languages.
Next, we define the not
expression. not
is a unary operation with high precedence. This makes it much easier to parse than and
and or
expressions.
def bexp_not(): return keyword('not') + Lazy(bexp_term) ^ (lambda parsed: NotBexp(parsed[1]))
Here, we just concatenate the keyword not
with a Boolean expression term (which we will define next). Since bexp_not
will be used to define bexp_term
, we need to use the Lazy
combinator to avoid infinite recursion.
def bexp_group(): return keyword('(') + Lazy(bexp) + keyword(')') ^ process_group def bexp_term(): return bexp_not() | \ bexp_relop() | \ bexp_group()
We define bexp_group
and bexp_term
pretty much the same way we did for the arithmetic equivalents. Nothing new here.
Next, we need to define expressions which include the and
and or
operators. These operators actually work the same way arithmetic operators do; both are evaluated left to right, and and
has higher precedence.
bexp_precedence_levels = [ ['and'], ['or'], ] def process_logic(op): if op == 'and': return lambda l, r: AndBexp(l, r) elif op == 'or': return lambda l, r: OrBexp(l, r) else: raise RuntimeError('unknown logic operator: ' + op) def bexp(): return precedence(bexp_term(), bexp_precedence_levels, process_logic)
Just like process_binop
, process_logic
is intended to be used with the Exp
combinator. It takes an operator and returns a function which combines two sub-expressions into an expression using that operator. We pass this along to precedence
, just like we did with aexp
. Writing generic code pays off here, since we don't have to repeat the complicated expression processing code.
Parsing statements
With aexp
and bexp
finished, we can start parsing IMP statements. We'll start with the humble assignment statement.
def assign_stmt(): def process(parsed): ((name, _), exp) = parsed return AssignStatement(name, exp) return id + keyword(':=') + aexp() ^ process
Nothing particularly interesting for this one. Next, we'll look at stmt_list
. This will parse a series of statements separated by semicolons.
def stmt_list(): separator = keyword(';') ^ (lambda x: lambda l, r: CompoundStatement(l, r)) return Exp(stmt(), separator)
Remember, we need to use the Exp
combinator here instead of doing something simple like stmt() + keyword(';') + stmt()
to avoid left recursion.
Next up is the if
-statement:
def if_stmt(): def process(parsed): (((((_, condition), _), true_stmt), false_parsed), _) = parsed if false_parsed: (_, false_stmt) = false_parsed else: false_stmt = None return IfStatement(condition, true_stmt, false_stmt) return keyword('if') + bexp() + \ keyword('then') + Lazy(stmt_list) + \ Opt(keyword('else') + Lazy(stmt_list)) + \ keyword('end') ^ process
The only complexity here is the fact that the else
-clause is optional. This makes the process
function a little more complicated.
Finally, we have the while
-loop:
def while_stmt(): def process(parsed): ((((_, condition), _), body), _) = parsed return WhileStatement(condition, body) return keyword('while') + bexp() + \ keyword('do') + Lazy(stmt_list) + \ keyword('end') ^ process
We'll wrap it all up with stmt
:
def stmt(): return assign_stmt() | \ if_stmt() | \ while_stmt()
You'll notice that both the if
and while
statements use stmt_list
for their bodies rather than stmt
. stmt_list
is actually our top level definition. We can't have stmt
depend directly on stmt_list
, since this would result in a left recursive parser. And since we want if
and while
statements to be able to have compound statements for their bodies, we use stmt_list
directly.
Putting it all together
We now have a parser for every part of the language. We just need to make a couple top level definitions:
def parser(): return Phrase(stmt_list())
parser
will parse an entire program. A program is just a list of statements, but the Phrase
combinator ensures that we use every token in the file, rather than ending prematurely if there is are garbage tokens at the end.
def imp_parse(tokens): ast = parser()(tokens, 0) return ast
imp_parse
is the function that clients will call to parse a program. It takes the full list of tokens, calls parser
, starting with the first token, and returns the resulting AST.
To test out our parsers (in addition to unit tests), here's a simple driver program:
import sys from imp_parser import * if __name__ == '__main__': if len(sys.argv) != 3: sys.stderr.write('usage: %s filename parsername' % sys.argv[0]) sys.exit(1) filename = sys.argv[1] file = open(filename) characters = file.read() file.close() tokens = imp_lex(characters) parser = globals()[sys.argv[2]]() result = parser(tokens, 0) print result
This program will read a file (first argument) and parse it with one of the parsers in imp_parse.py (second argument). For example:
$ echo '1 + 2 * 3' >foo.imp $ python imp_parser_driver.py foo.imp aexp Result(BinopAexp(+, IntAexp(1), BinopAexp(*, IntAexp(2), IntAexp(3))), 5)
This should provide a good sandbox for experimentation.
Conclusion
In this article, we built a parser combinator library from scratch and used it to write a parser for IMP. In the next and final article in the series, we'll write an evaluator for our parsed AST.
Once again, all the source code for the interpreter is available for download.