A simple interpreter from scratch in Python (part 3)

Published on 2011-05-09
Edited on 2014-04-12
Tagged: compilers imp parsers python

View All Posts

This article is part of the series "A simple interpreter from scratch in Python".

So far, we've written a lexer and a parser combinator library for our interpreter. In this part, we'll create the abstract syntax tree (AST) data structures, and we'll write a parser using our combinator library which will convert the list of tokens returned by the lexer into an AST. Once we have parsed the AST, executing the program will be very easy.

Defining the AST

Before we can actually start writing our parsers, we need to define the data structures that will be the output of our parser. We will do this with classes. Every syntactic element of IMP will have a corresponding class. Objects of these classes will represent nodes in the AST.

There are three kinds of structures in IMP: arithmetic expressions (used to compute numbers), Boolean expressions (used to compute conditions for if- and while-statements), and statements. We will start with arithmetic expressions, since the other two elements depend on them.

An arithmetic expression can take one of three forms:

We can also group expressions together with parenthesis (like (x + 2) * 3. This isn't a different kind of expression so much as a different way to parse the expression.

We will define three classes for these forms, plus a base class for arithmetic expressions in general. For now, the classes won't do much except contain data. We will include a __repr__ method, so we can print out the AST for debugging. All AST classes will subclass Equality so we can check if two AST objects are the same. This helps with testing.

from equality import *

class Aexp(Equality):
    pass

class IntAexp(Aexp):
    def __init__(self, i):
        self.i = i

    def __repr__(self):
        return 'IntAexp(%d)' % self.i

class VarAexp(Aexp):
    def __init__(self, name):
        self.name = name

    def __repr__(self):
        return 'VarAexp(%s)' % self.name

class BinopAexp(Aexp):
    def __init__(self, op, left, right):
        self.op = op
        self.left = left
        self.right = right

    def __repr__(self):
        return 'BinopAexp(%s, %s, %s)' % (self.op, self.left, self.right)

Boolean expressions are a little more complicated. There are four kinds of Boolean expressions.

The left and right sides of a relational expression are arithmetic expressions. The left and right sides of an "and", "or", or "not" expression are Boolean expressions. Restricting the type like this helps us avoid nonsensical expressions like "x < 10 and 30".

class Bexp(Equality):
    pass

class RelopBexp(Bexp):
    def __init__(self, op, left, right):
        ...

class AndBexp(Bexp):
    def __init__(self, left, right):
        ...
class OrBexp(Bexp):
    def __init__(self, left, right):
        ...

class NotBexp(Bexp):
    def __init__(self, exp):
        ...

Statements can contain both arithmetic and Boolean expressions. There are four kinds of statements: assignment, compound, conditional, and loop.

class Statement(Equality):
    pass

class AssignStatement(Statement):
    def __init__(self, name, aexp):
         ...

class CompoundStatement(Statement):
    def __init__(self, first, second):
        ...

class IfStatement(Statement):
    def __init__(self, condition, true_stmt, false_stmt):
        ...

class WhileStatement(Statement):
    def __init__(self, condition, body):
        ...

Primitives

Now that we have our AST classes and a convenient set of parser combinators, it's time to write our parser. When writing a parser, it's usually easiest to start with the most basic structures of the language and work our way up.

The first parser we'll look at is the keyword parser. This is just a specialized version of the Reserved combinator using the RESERVED tag that all keyword tokens are tagged with. Remember, Reserved matches a single token where both the text and tag are the same as the ones given.

def keyword(kw):
    return Reserved(kw, RESERVED)

keyword is actually a combinator because it is a function that returns a parser. We will use it directly in other parsers though.

The id parser is used to match variable names. It uses the Tag combinator, which matches a token with the specified tag.

id = Tag(ID)

The num parser is used to match integers. It works just like id, except we use the Process combinator (actually the ^ operator, which calls Process) to convert the token into an actual integer value.

num = Tag(INT) ^ (lambda i: int(i))

Parsing arithmetic expressions

Parsing arithmetic expressions is not particularly simple, but since we need to parse arithmetic expressions in order to parse Boolean expressions or statements, we will start there.

We'll first define the aexp_value parser, which will convert the values returned by num and id into actual expressions.

def aexp_value():
    return (num ^ (lambda i: IntAexp(i))) | \
           (id  ^ (lambda v: VarAexp(v)))

We're using the | operator here, which is a shorthand for the Alternate combinator. So this will attempt to parse an integer expression first. If that fails, it will try to parse a variable expression.

You'll notice that we defined aexp_value as a zero-argument function instead of a global value, like we did with id and num. We'll do the same thing with all the other parsers. The reason is that we don't want the code for each parser to be evaluated right away. If we defined every parser as a global, each parser would not be able to reference parsers that come after it in the same source file, since they would not be defined yet. This would make recursive parsers impossible to define, and the source code would be awkward to read.

Next, we'd like to support grouping with parenthesis in our arithmetic expressions. Although grouped expressions don't require their own AST class, they do require another parser to handle them.

def process_group(parsed):
    ((_, p), _) = parsed
    return p

def aexp_group():
    return keyword('(') + Lazy(aexp) + keyword(')') ^ process_group

process_group is a function we use with the Process combinator (^ operator). It just discards the parenthesis tokens and returns the expression in between. aexp_group is the actual parser. Remember, the + operator is shorthand for the Concat combinator. So this will parse '(', followed by an arithmetic expression (parsed by aexp, which we will define soon), followed by ')'. We need to avoid calling aexp directly since aexp will call aexp_group, which would result in infinite recursion. To do this, we use the Lazy combinator, which defers the call to aexp until the parser is actually applied to some input.

Next, we combine aexp_value and aexp_group using aexp_term. An aexp_term expression is any basic, self-contained expression where we don't have to worry about operator precedence with respect to other expressions.

def aexp_term():
    return aexp_value() | aexp_group()

Now we come to the tricky part: operators and precedence. It would be easy for us just to define another kind of aexp parser and throw it together with aexp_term. This would result in a simple expression like:

1 + 2 * 3

being parsed incorrectly as:

(1 + 2) * 3

The parser needs to be aware of operator precedence, and it needs to group together operations with higher precedence.

We will define a few helper functions in order to make this work.

def process_binop(op):
    return lambda l, r: BinopAexp(op, l, r)

process_binop is what actually constructs BinopAexp objects. It takes any arithmetic operator and returns a function that combines a pair of expressions using that operator.

process_binop will be used with the Exp combinator (* operator). Exp parses a list of expressions with a separator between each pair of expressions. The left operand of Exp is a parser that will match individual elements of the list (arithmetic expressions in our case). The right operand is a parser that will match the separators (operators). No matter what separator is matched, the right parser will return a function which, given the matched separator, returns a combining function. The combining function takes the parsed expressions to the left and right of the separator and returns a single, combined expression. Confused yet? We'll go through the usage of Exp shortly. process_binop is actually what will be returned by the right parser.

Next, we'll define our precedence levels and a combinator to help us deal with them.

def any_operator_in_list(ops):
    op_parsers = [keyword(op) for op in ops]
    parser = reduce(lambda l, r: l | r, op_parsers)
    return parser

aexp_precedence_levels = [
    ['*', '/'],
    ['+', '-'],
]

any_operator_in_list takes a list of keyword strings and returns a parser that will match any of them. We will call this on aexp_precedence_levels, which contains a list of operators for each precedence level (highest precedence first).

def precedence(value_parser, precedence_levels, combine):
    def op_parser(precedence_level):
        return any_operator_in_list(precedence_level) ^ combine
    parser = value_parser * op_parser(precedence_levels[0])
    for precedence_level in precedence_levels[1:]:
        parser = parser * op_parser(precedence_level)
    return parser

precedence is the real meat of the operation. Its first argument, value_parser is a parser than can read the basic parts of an expression: numbers, variables, and groups. That will be aexp_term. precedence_levels is a list of lists of operators, one list for each level. We'll use aexp_precedence_levels for this. combine will take a function which, given an operator, returns a function to build a larger expression out of two smaller expressions. That's process_binop.

Inside precedence, we first define op_parser which, for a given precedence level, reads any operator in that level and returns a function which combines two expressions. op_parser can be used as the right-hand argument of Exp. We start by calling Exp with op_parser for the highest precedence level, since these operations need to be grouped together first. We then use the resulting parser as the element parser (Exp's left argument) at the next level. After the loop finishes, the resulting parser will be able to correctly parse any arithmetic expression.

How does this work in practice? Let's work it through.

E0 = value_parser
E1 = E0 * op_parser(precedence_levels[0])
E2 = E1 * op_parser(precedence_levels[1])

E0 is the same as value_parser. It can parse numbers, variables, and groups, but no operators. E1 can parse expressions containing anything E0 can match, separated by operators in the first precedence level. So E1 could match a * b / c, but it would raise an error as soon as it encountered a + operator. E2 can match expressions E2 can match, separated by operators in the next precedence level. Since we only have two precedence levels, E2 can match any arithmetic expression we support.

Let's do an example. We'll take a complicated arithmetic expression, and gradually replace each part by what matches it.

4 * a + b / 2 - (6 + c)
E0(4) * E0(a) + E0(b) / E0(2) - E0(6+c)
E1(4*a) + E1(b/2) - E1(6+c)
E2((4*a)+(b/2)-(6+c))

We use precedence to directly define aexp:

def aexp():
    return precedence(aexp_term(),
                      aexp_precedence_levels,
                      process_binop)

We probably could have defined precedence in a less abstract way, but its strength is that it applies to any situation where operator precedence is a problem. We will use it again to parse Boolean expressions.

Parsing Boolean expressions

With arithmetic expressions out of the way, we can move on to Boolean expressions. Boolean expressions are actually simpler than arithmetic expressions, and we won't need any new tools to parse them. We'll start with the most basic Boolean expression, the relation:

def process_relop(parsed):
    ((left, op), right) = parsed
    return RelopBexp(op, left, right)

def bexp_relop():
    relops = ['<', '<=', '>', '>=', '=', '!=']
    return aexp() + any_operator_in_list(relops) + aexp() ^ process_relop

process_relop is just a function that we use with the Process combinator. It just takes three concatenated values and creates a RelopBexp out of them. In bexp_relop, we parse two arithmetic expressions (aexp), separated by a relational operator. We use our old friend any_operator_in_list so we don't have to write a case for every operator. There is no need to use combinators like Exp or precedence since relational expressions can't be chained together in IMP like they can in other languages.

Next, we define the not expression. not is a unary operation with high precedence. This makes it much easier to parse than and and or expressions.

def bexp_not():
    return keyword('not') + Lazy(bexp_term) ^ (lambda parsed: NotBexp(parsed[1]))

Here, we just concatenate the keyword not with a Boolean expression term (which we will define next). Since bexp_not will be used to define bexp_term, we need to use the Lazy combinator to avoid infinite recursion.

def bexp_group():
    return keyword('(') + Lazy(bexp) + keyword(')') ^ process_group

def bexp_term():
    return bexp_not()   | \
           bexp_relop() | \
           bexp_group()

We define bexp_group and bexp_term pretty much the same way we did for the arithmetic equivalents. Nothing new here.

Next, we need to define expressions which include the and and or operators. These operators actually work the same way arithmetic operators do; both are evaluated left to right, and and has higher precedence.

bexp_precedence_levels = [
    ['and'],
    ['or'],
]

def process_logic(op):
    if op == 'and':
        return lambda l, r: AndBexp(l, r)
    elif op == 'or':
        return lambda l, r: OrBexp(l, r)
    else:
        raise RuntimeError('unknown logic operator: ' + op)

def bexp():
    return precedence(bexp_term(),
                      bexp_precedence_levels,
                      process_logic)

Just like process_binop, process_logic is intended to be used with the Exp combinator. It takes an operator and returns a function which combines two sub-expressions into an expression using that operator. We pass this along to precedence, just like we did with aexp. Writing generic code pays off here, since we don't have to repeat the complicated expression processing code.

Parsing statements

With aexp and bexp finished, we can start parsing IMP statements. We'll start with the humble assignment statement.

def assign_stmt():
    def process(parsed):
        ((name, _), exp) = parsed
        return AssignStatement(name, exp)
    return id + keyword(':=') + aexp() ^ process

Nothing particularly interesting for this one. Next, we'll look at stmt_list. This will parse a series of statements separated by semicolons.

def stmt_list():
    separator = keyword(';') ^ (lambda x: lambda l, r: CompoundStatement(l, r))
    return Exp(stmt(), separator)

Remember, we need to use the Exp combinator here instead of doing something simple like stmt() + keyword(';') + stmt() to avoid left recursion.

Next up is the if-statement:

def if_stmt():
    def process(parsed):
        (((((_, condition), _), true_stmt), false_parsed), _) = parsed
        if false_parsed:
            (_, false_stmt) = false_parsed
        else:
            false_stmt = None
        return IfStatement(condition, true_stmt, false_stmt)
    return keyword('if') + bexp() + \
           keyword('then') + Lazy(stmt_list) + \
           Opt(keyword('else') + Lazy(stmt_list)) + \
           keyword('end') ^ process

The only complexity here is the fact that the else-clause is optional. This makes the process function a little more complicated.

Finally, we have the while-loop:

def while_stmt():
    def process(parsed):
        ((((_, condition), _), body), _) = parsed
        return WhileStatement(condition, body)
    return keyword('while') + bexp() + \
           keyword('do') + Lazy(stmt_list) + \
           keyword('end') ^ process

We'll wrap it all up with stmt:

def stmt():
    return assign_stmt() | \
           if_stmt()     | \
           while_stmt()

You'll notice that both the if and while statements use stmt_list for their bodies rather than stmt. stmt_list is actually our top level definition. We can't have stmt depend directly on stmt_list, since this would result in a left recursive parser. And since we want if and while statements to be able to have compound statements for their bodies, we use stmt_list directly.

Putting it all together

We now have a parser for every part of the language. We just need to make a couple top level definitions:

def parser():
    return Phrase(stmt_list())    

parser will parse an entire program. A program is just a list of statements, but the Phrase combinator ensures that we use every token in the file, rather than ending prematurely if there is are garbage tokens at the end.

def imp_parse(tokens):
    ast = parser()(tokens, 0)
    return ast

imp_parse is the function that clients will call to parse a program. It takes the full list of tokens, calls parser, starting with the first token, and returns the resulting AST.

To test out our parsers (in addition to unit tests), here's a simple driver program:

import sys
from imp_parser import *

if __name__ == '__main__':
    if len(sys.argv) != 3:
        sys.stderr.write('usage: %s filename parsername' % sys.argv[0])
        sys.exit(1)
    filename = sys.argv[1]
    file = open(filename)
    characters = file.read()
    file.close()
    tokens = imp_lex(characters)
    parser = globals()[sys.argv[2]]()
    result = parser(tokens, 0)
    print result

This program will read a file (first argument) and parse it with one of the parsers in imp_parse.py (second argument). For example:

$ echo '1 + 2 * 3' >foo.imp
$ python imp_parser_driver.py foo.imp aexp
Result(BinopAexp(+, IntAexp(1), BinopAexp(*, IntAexp(2), IntAexp(3))), 5)

This should provide a good sandbox for experimentation.

Conclusion

In this article, we built a parser combinator library from scratch and used it to write a parser for IMP. In the next and final article in the series, we'll write an evaluator for our parsed AST.

Once again, all the source code for the interpreter is available for download.