How Python parses white space

Published on 2017-06-17
Tagged: compilers python

One of the most distinct and remarkable things about Python's syntax is its use of white space. If you've ever wanted to write a parser for Python or a similar language, you might find the grammar a little confusing. I recently replaced Gypsum's layout analysis with a Python-style white space parser, so I'll share what I've learned.

What white space means in Python

Python uses white space for two things: newlines terminate logical lines, and changes in indentation delimit blocks. Both of these behaviors are somewhat contextual.

Python distinguishes physical lines from logical lines. A physical line is an actual line of text: a string of characters, terminated by a newline. A logical line is a syntactic entity: a statement or part of a definition. Multiple physical lines may be joined together into a logical line if the newlines separating them are escaped with backslashes or if the newlines occur within a nest of parentheses or brackets.

# Two logical lines
x = foo()
y = bar()

# One logical line, escaped with backslash
x = foo() + \
    bar()

# One logical line, nested newline.
x = (foo() +
     bar())

Indentation at the beginning of a logical line is syntactically significant. If the space at the beginning of a line is longer than the space at the beginning of the previous line, that signifies the beginning of a block. If the space is less, that signifies the end of possibly multiple blocks. Only the space at the beginning of a logical line matters; continued physical lines can't start or end a block. When ending a block, the indentation must match some earlier indentation level.

if a:     # outer block
  if b:   # inner block
    pass  # inner-inner block
else:     # outer block again (dedent two levels)
  pass

Pros and cons of white space

Much has been said this, and I don't want to get into a holy war, but I do want to highlight a couple points. Significant white space means there's no need for semicolons and balanced pairs of braces. In my opinion, this makes code easier to read because there's less syntactic noise. It's nice to get rid of long slopes of closing braces. It's a matter of taste though. More importantly, it's difficult to write badly indented code because the interpreter will reject it. For teams that don't enforce a uniform coding style, this may be the only thing keeping code readable.

There are some disadvantages, of course. It's more difficult to use auto-formatting tools in Python, since you need to actually type white space correctly. In Go, you can type a bunch of code without worrying about whether it lines up, then let gofmt take care of it when you save the file. For me though, the biggest problem with Python is the lack of multi-line lambdas. If you're doing some functional programming or if you want to pass a callback to an asynchronous function, you will want to pass a lambda value as an argument. In other languages, it's not unusual for this lambda to be several lines long.

# Does not work in Python
list = [1, 2, 3, 4, 5]
map(lst, lambda x:
  if x % 3 == 0:
    return "fizz"
  elif x % 5 == 0:
    return "buzz"
  elif x % 15 == 0:
    return "fizzbuzz"
  else:
    return x)

Some languages (Kotlin, Swift, Ruby) have special syntax for passing a block as the last argument to a function. It would be difficult to add similar syntactic sugar to Python though, since code like this would be unreadable without braces.

How Python is parsed

As in most systems, Python code is read with a lexer and a parser. The lexer splits the code into tokens (keywords, identifiers, numbers, etc.), and the parser assembles the tokens into an abstract syntax tree. Most of the white space magic is in the lexer, which emits three special tokens: NEWLINE, INDENT, and DEDENT. The parser reads these as the end of a logical line, and the beginning and end of a block, respectively.

So how does the lexer work? It's implemented in tokenizer.c if you'd like see the details. tok_get is the primary function.

Let's talk about newlines first, since they're pretty straightforward. The lexer interprets an LF character or a CR LF sequence as a physical newline (CR alone is not valid). If the newline comes at the end of a line with only white space or a comment, the lexer will ignore it and proceed to the next line. The lexer also keeps a count (level) of the number of brackets ('(', '[', '{' characters) it has seen without closing brackets. If this count is non-zero, the lexer will ignore the newline. The lexer will also ignore a newline if it is escaped with a backslash. If the newline is not ignored for any of the reasons above, the lexer will emit a NEWLINE token, which marks the end of a logical line.

Let's talk about indentation next. atbol is a flag that indicates whether the lexer is at the beginning of a line. This flag is set when the lexer is initialized and after each NEWLINE token is emitted. If atbol is set when tok_get is called, the lexer will count the white space columns at the beginning of the line. For indentation purposes, blank lines are ignored but comments are not. Space characters count as one column. Tab characters count as one, then round up to the nearest multiple of eight. The lexer compares the column count to the top element of an indentation stack, which has a count for each nested block. If the count is equal, there was no change in indentation. If the count is greater, the lexer emits an INDENT token and pushes the column count on the stack. If the count is less, the lexer pops the stack until it finds an equal count. If there is no equal count, an error is reported. The lexer will emit a DEDENT token for each count popped from the stack.

You might wonder how Python enforces consistency between tabs and spaces. After all, many editors don't treat tabs as 8 columns, so what Python might interpret as properly formed code could look like nonsense in some environments. Python actually maintains two parallel column counts and indentation stacks. col is the count described above. altcol is an alternate count that treats tabs as 1 column instead of 8. When the indentation of the current line is compared to the stack, both col and altcol need to match. There can still be some inconsistency. For example, Python counts eight spaces followed by one tab the same as one tab followed by eight spaces. Personally, I think a better approach would be to store white space strings in a stack and compare those literally. Gypsum's lexer does this now.

Commentary

I've been a fan of the way Python incorporates white space into its syntax since I first learned the language. I implemented similar white space handling into Gypsum, and I've always been surprised more languages didn't do the same. I never actually understood why anyone would be against it until I started researching this article and read about multi-line lambdas.

The real problem is the way the newline-in-brackets rule interacts with indentation. This prevents you from having anything resembling multi-line lambdas as function call arguments, which is where you really want them. I haven't written a great deal of asynchronous Python code, so I was never really bitten by this, but it seems like a very important use case, and I'm not sure I'm willing to sacrifice that in Gypsum. If you want to be able to emit NEWLINE, INDENT, and DEDENT inside brackets, you need some kind of syntax awareness, which is what Gypsum's layout analysis was. However, this led to a bunch of complicated pattern-matching rules that produced a lot of hard-to-understand errors. The Zen of Python says "Explicit is better than implicit", and "Simple is better than complex". Maybe that applies here.