One of the keys steps in the interpretation/compilation process of a language is the parsing step. This takes the source code of your program (a string) and turns it into what is called an Abstract Syntax Tree (AST). The AST is a tree like data structure representing the programming instructions contained in your source code. Interpreters, compilers, code analyzers, and any other software handling source code will generally work with an AST representation of that code.
What we'll be implementing over the course of this article is a parser for a small dynamic programming language. It will consist of only a few types but they'll be more then enough to get a good taste of how a parser works. Our language will have the following:
I picked Go as a language for this project not because it is particularly suited for this job but because I wanted to try a new language and Go seemed like a decent choice. It seems to be gaining popularity and I wanted to give it a shot. It's not a particularly mind blowing language but I personally enjoyed using it.
You can find the full source code for this project on github.
There are various parsing algorithms available. Some popular examples include hand coded recursive descent parsers, auto-generated LL parsers, or auto-generated LR parsers. However, the parser we will be implementing is called a top down operator precedence parser. This is a method proposed by Vaughan Pratt in 1973 and is an improved recursive descent parser. It takes less code then recursive descent and is faster.
An AST is made of nodes where each node is either a token itself or some other object associated with a token. A token is a basic abstraction built on top of the characters contained in the source code of a program. A lexer (alternatively a scanner) is a program which takes in a string of characters and produces a stream of tokens according to a set of predefined rules. A line of code like:
var x = 5;
will be tokenized into a list of objects looking something like:
[var-stmt, identifier, assignment, number-literal, semicolon]
This process is actually fairly simple and I won't go much into it. A lexer simply goes through the source code character by character and returns a token object when it identifies one. In essence it boils down to things like if (c == '+') return new PlusToken();
. You can see my lexer implementation for this project on github.
What's interesting however about our tokens and our TDOP parser is that our parsing rules are in fact associated with our tokens. This is one of the core ideas of TDOP parsing.
We define a token type as a struct:
type nudFn func(*token, *parser) *token
type ledFn func(*token, *parser, *token) *token
type stdFn func(*token, *parser) *token
type token struct {
sym string
value string
line int
col int
bindingPower int
nud nudFn
led ledFn
std stdFn
children []*token
}
The first four pieces of information it contains are somewhat self explanatory. They are the unique symbol which defines the token, "sym", (i.e. "*", "-", "var", etc), its value (a string representation of its data), and its line and column index in the source file.
A token will also be in fact a node in our AST so it maintains references to children nodes. As an example, a plus node will have two children: the expression on the left and the one on the right. The plus node in A-B+C will have A-B as its first child, and C as its second child. A-B will in turn be a minus node with A and B as its two children. This AST looks like this:
+
- C
A B
The other four values (binding power, nud, led, std) contained in this struct are the more interesting bits of information. They describe the parsing rules of our language. Take the example above: A-B+C. How can a parser know whether to parse this as (A-B)+C or as A-(B+C)? The solution is in the binding power of each token.
The higher the binding power of a token the tighter it binds to tokens next to it. In our example both plus and minus will have the same binding power. Because the binding power of plus is not strictly greater than the binding power of minus, the minus token will be the one to bind to both A and B. Thus the parse tree will be (A-B)+C, i.e. the right one.
Let's now take A-B*C as an example. We need this to be parsed as A-(B*C). To achieve this we simply set the binding power of "*" to be greater than the binding power of minus. The multiplication operator will then have priority over minus when binding to B.
In code, this precedence behavior is captured by the heart of a TDOP parser, the expression function. Let's define a parser as a struct holding a reference to a lexer:
type parser struct {
lexer *lexer
}
and the expression function as a method on the parser struct:
func (self *parser) expression(rbp int) *token {
var left *token
t := self.lexer.next()
left = t.nud(t, self)
for rbp < self.lexer.peek().bindingPower {
t := self.lexer.next()
left = t.led(t, self, left)
}
return left
}
It's in this function that you'll also notice two other elements we've yet to address: the nud and the led functions. These two functions and the binding power idea combine to generate the short but very powerful expression function.
The nud method on a token is available on values (variables and literals) and prefix operators. The led method is available on infix and suffix operators. A nud method does not care about the tokens to the left. A led method does. Let's see how these come into play via an example.
Let's walk through A-B*C again. Let's say minus has a binding power of 10 and the multiplication operator 20. A, B, and C are variables and have nud methods defined. - and * are infix operators and have led methods defined. The first expression call is always expression(0). In simplified terms the function calls then happen as follows:
expression(0):
expression(10):
expression(20):
Back in expression(10), C becomes a children of _. The for loop exits because 10 will not be smaller than 0, the binding power of the now next token, the end of the string. In exp(10) left becomes the _ node and is returned back in exp(0). In exp(0), in a similar manner, the return value of exp(10), the * node, becomes a child of the minus node, and the for loop exits. The parsing is complete and the minus node, the top node is returned.
This is the gist of a top down operator precedence parser and if you can wrap your head around it everything else from now on is just more of the same. We will set up some infrastructure to make the declaration of tokens easier but that's pretty much it.
Note that it's via this expression function that you can also detect when the syntax is bad. Say you have "*A+B". Inside the expression function you end up trying to call the nud method of "*". The multiplication operator however is an infix one and does not have a nud method defined. This is a syntax error and an exception/error of some sort can be thrown at this point.
To make both the creation and declaration of tokens easier we will create a token registry:
type tokenRegistry struct {
symTable map[string]*token
}
A register method will add token definitions to the symbol table mapping tokens to their identifying string symbol (we call this method to define tokens).
func (self *tokenRegistry) register(sym string, bp int, nud nudFn, led ledFn, std stdFn) {
if val, ok := self.symTable[sym]; ok {
if nud != nil && val.nud == nil {
val.nud = nud
}
if led != nil && val.led == nil {
val.led = led
}
if std != nil && val.std == nil {
val.std = std
}
if bp > val.bindingPower {
val.bindingPower = bp
}
} else {
self.symTable[sym] = &token{bindingPower: bp, nud: nud, led: led, std: std}
}
}
A token method will then generate new instances of these tokens whenever they are requested (the lexer calls this method to generate tokens).
func (self *tokenRegistry) token(sym string, value string, line int, col int) *token {
return &token{
sym: sym,
value: value,
line: line,
col: col,
bindingPower: self.symTable[sym].bindingPower,
nud: self.symTable[sym].nud,
led: self.symTable[sym].led,
std: self.symTable[sym].std,
}
}
Note how in the register method we supplied nud, led, std, and bindingPower values. These are common across different token instances and saved in tokens inside the symTable. The token method takes the other pieces of information as arguments, combines them with what already exists in the symTable and returns a new token instance.
Because the desired behavior of most infix and prefix operators is usually the same we can define the following helper functions having default led and nud function supplied:
// a prefix token has a single children, the expression that follows
func (self *tokenRegistry) prefix(sym string) {
self.register(sym, 0, func(t *token, p *parser) *token {
t.children = append(t.children, p.expression(100))
return t
}, nil, nil)
}
// an infix token has two children, the exp on the left and the one that follows
func (self *tokenRegistry) infix(sym string, bp int) {
self.register(sym, bp, nil, func(t *token, p *parser, left *token) *token {
t.children = append(t.children, left)
t.children = append(t.children, p.expression(t.bindingPower))
return t
}, nil)
}
Some tokens will simply be consumed by the parser (like ";") while others will simply return a reference to themselves (literals, variable names, etc). These situations are covered by the following helper functions:
func (self *tokenRegistry) symbol(sym string) {
self.register(sym, 0, func(t *token, p *parser) *token {
return t
}, nil, nil)
}
func (self *tokenRegistry) consumable(sym string) {
self.register(sym, 0, nil, nil, nil)
}
This is all the infrastructure we need to define large portions of our language.
First off, the basic types, strings, numbers, booleans, nulls, as well as variable names and a few consumables.
t := &tokenRegistry{symTable: make(map[string]*token)}
t.symbol("(IDENT)")
t.symbol("(NUMBER)")
t.symbol("(STRING)")
t.symbol("true")
t.symbol("false")
t.symbol("none")
t.consumable(";")
t.consumable(")")
t.consumable("]")
t.consumable(",")
t.consumable("}")
With four more lines we can add support for most numerical operations.
t.infix("+", 50)
t.infix("-", 50)
t.infix("*", 60)
t.infix("/", 60)
A few more lines and we have comparisons as well:
t.infix("<", 30)
t.infix(">", 30)
t.infix("<=", 30)
t.infix(">=", 30)
t.infix("==", 30)
Let's add support for a Python like ternary operator using a custom led function:
t.infixLed("if", 20, func(token *token, p *parser, left *token) *token {
cond := p.expression(0)
token.children = append(token.children, cond)
p.advance("else")
token.children = append(token.children, left)
token.children = append(token.children, p.expression(0))
return token
})
Here's an example of what this looks like:
num = 0 if x < 100 else 1;
The advance method on the parser simply consumes a token of the given type. If the next token is not what it expects it panics as this represents a syntax error.
func (self *parser) advance(sym string) *token {
line := self.lexer.line
col := self.lexer.col
token := self.lexer.next()
if token.sym != sym {
panic(fmt.Sprint("EXPECTED ", sym, " AT ", line, ":", col))
}
return token
}
Some prefix operators:
t.prefix("-")
t.prefix("not")
All infix operators are left associative. We can make right associative operators by reducing the right binding power (notice the minus one):
func (self *tokenRegistry) infixRight(sym string, bp int) {
self.register(sym, bp, nil, func(t *token, p *parser, left *token) *token {
t.children = append(t.children, left)
t.children = append(t.children, p.expression(t.bindingPower-1))
return t
}, nil)
}
Some right associative operators:
t.infixRight("and", 25)
t.infixRight("or", 25)
t.infixRight("=", 10)
t.infixRight("+=", 10)
t.infixRight("-=", 10)
Tuples are comma delimited values inside parentheses. As parentheses can also be used to group expressions, we need to differentiate between them. If there's nothing between the parentheses then we treat that as an empty tuple (this is later used in function declarations). If there's a comma then it is a tuple as well, i.e. (a,) or (a, b, c) etc. Otherwise it is a simple grouping and the parentheses get ignored.
t.prefixNud("(", func(t *token, p *parser) *token {
comma := false
if p.lexer.peek().sym != ")" {
for {
if p.lexer.peek().sym == ")" {
break
}
t.children = append(t.children, p.expression(0))
if p.lexer.peek().sym != "," {
break
}
comma = true
p.advance(",")
}
}
p.advance(")")
if len(t.children) == 0 || comma {
t.sym = "()"
t.value = "TUPLE"
return t
} else {
return t.children[0]
}
})
In our language function declarations look like this:
sum = (a, b) -> a + b;
add_one = x -> x + 1;
mult_and_sum = (a, b) -> {
m = a * b;
return (m, a+b);
}
print_hello = () -> {
print("Hello World!!");
}
To parse functions we declare a new infix operator "->" with a custom led function. The function will also check that the left hand side of the operator is valid for a function declaration (only identifiers, empty tuples and tuples containing identifiers are allowed).
t.infixRightLed("->", 10, func(token *token, p *parser, left *token) *token {
if left.sym != "()" && left.sym != "(IDENT)" {
panic(fmt.Sprint("INVALID FUNC DECLARATION TUPLE", left))
}
if left.sym == "()" && len(left.children) != 0 {
named := true
for _, child := range left.children {
if child.sym != "(IDENT)" {
named = false
break
}
}
if !named {
panic(fmt.Sprint("INVALID FUNC DECLARATION TUPLE", left))
}
}
token.children = append(token.children, left)
if p.lexer.peek().sym == "{" {
token.children = append(token.children, p.block())
} else {
token.children = append(token.children, p.expression(0))
}
return token
})
The std method of a token handles statements. This is like the nud method on prefix operators with the only difference being it can only be called at the beginning of a statement (and thus not nestable inside expressions). We use the following helper function to declare statements:
func (self *TokenRegistry) stmt(sym string, std stdFn) {
self.register(sym, 0, nil, nil, std)
}
The if statement allows for conditional execution of a block of statements. If there's an else symbol after the if block we either parse the next block or the next if statement:
t.stmt("if", func(t *Token, p *Parser) *Token {
t.children = append(t.children, p.expression(0))
t.children = append(t.children, p.block())
next := p.lexer.peek()
if next.value == "else" {
p.lexer.next()
next = p.lexer.peek()
if next.value == "if" {
t.children = append(t.children, p.statement())
} else {
t.children = append(t.children, p.block())
}
}
return t
})
An if statement example:
if a == true and b == not true {
do_stuff();
} else if not a and b {
do_other_stuff();
} else {
do_this_stuff();
}
The block method on the parser simply consumes a block of statements via the std method defined on the "{" token. It panics if the next token is not a "{" token.
func (self *parser) block() *token {
tok := self.lexer.next()
if tok.sym != "{" {
panic(fmt.Sprint("WAS LOOKING FOR BLOCK START", tok))
}
block := tok.std(tok, self)
return block
}
The block start token, "{", is registered as:
t.stmt("{", func(t *token, p *parser) *token {
t.children = append(t.children, p.statements()...)
p.advance("}")
return t
})
The statements method parses all statements up to the end of the file, or to a block closing token, "}".
func (self *parser) statements() []*token {
stmts := []*token{}
next := self.lexer.peek()
for next.sym != "(EOF)" && next.sym != "}" {
stmts = append(stmts, self.statement())
next = self.lexer.peek()
}
return stmts
}
The statement (not statements) method parses either the next statement or the next expression. Statements are parsed by calling the std method of a token. If this std method is not defined an expression is parsed instead.
func (self *parser) statement() *token {
tok := self.lexer.peek()
if tok.std != nil {
tok = self.lexer.next()
return tok.std(tok, self)
}
res := self.expression(0)
self.advance(";")
return res
}
A while statement executes a block of statements in a loop as long as a condition is true. The while keyword is followed by a condition and a block of statements. This we can define as:
t.stmt("while", func(t *Token, p *Parser) *Token {
t.children = append(t.children, p.expression(0))
t.children = append(t.children, p.block())
return t
})
Arrays are a list of comma delimited expressions found inside square brackets. We register the "[" token as a prefix token with a custom nud function.
t.prefixNud("[", func(t *token, p *parser) *token {
if p.lexer.peek().sym != "]" {
for {
if p.lexer.peek().sym == "]" {
break
}
t.children = append(t.children, p.expression(0))
if p.lexer.peek().sym != "," {
break
}
p.advance(",")
}
}
p.advance("]")
t.sym = "[]"
t.value = "ARRAY"
return t
})
Take the following program:
x = 100;
y = 200;
sum = (a, b) -> a + b;
sum(x, y);
This will generate the following tree:
(=
(sum)
(->
(TUPLE
(a)
(b))
(+
(a)
(b))))
((
(sum)
(100)
(290))
Use the indentation levels to orient yourself. Two statements/expressions have been parsed. The first one is an assignment expression. The assignment operator has two children. One of them is an identifier, "sum", the other one is a function token. The function operator has two children. One of them is the tuple of formal arguments and the other is an addition expression. The second statement is a function call. The top node is the infix operator "(". Its first child is the name of the function. The other children are arguments that have been passed to the function.
Another example, a self-executing function:
(s -> {
text = "Hello " + s + "!!!";
print(text);
})("John Doe");
gets parsed into
((
(->
(s)
({
(=
(text)
(+
(+
(Hello )
(s))
(!!!)))
((
(print)
(text))))
(John Doe))
Note the operator at the top is the infix function call operator, "(". So "((" is one parenthesis for the lisp like grouping and another for the actual "(" operator.
An if block example:
if a == true and b == not true {
do_stuff();
} else if a == not true and b == true {
do_other_stuff();
} else {
do_this_stuff();
}
gets the following AST:
(if
(and
(==
(a)
(true))
(==
(b)
(not
(true))))
({
((
(do_stuff)))
(if
(and
(==
(a)
(not
(true)))
(==
(b)
(true)))
({
((
(do_other_stuff)))
({
((
(do_this_stuff)))))
Note an if token has three children. The first one is the condition to be checked. The second one is the block to be executed if the condition is true. The third one is the block to be executed if the condition is false.
A while loop example like
i = 0;
list = [1, 2, 3];
while i < len(list) {
do_stuff(list[i]);
i += 1;
}
gets parsed into
(=
(i)
(0))
(=
(list)
(ARRAY
(1)
(2)
(3)))
(while
(<
(i)
((
(len)
(list)))
({
((
(do_stuff)
([
(list)
(i)))
(+=
(i)
(1))))
Note a while loop has two children. The condition to be evaluated and the block to be executed if the condition is true.
Top down operator precedence parsing is a very cool algorithm. It's easy to follow, easy to understand and implement, and it accomplishes a lot with very little code.
This article has been a review of my own implementation of the method. The entire source code is available on github. Note that in order to keep them shorter and more readable some code snippets present in this article have been truncated. I've also not included everything. I've skipped some tokens and completely ignored the lexer.