Intro Link to heading
In this chapter we’re going to build a parser for our Tiger language. First, let’s do a quick recap of some important concepts of the theory behind programming language parsers:
- Parser generator is the most common type of compiler-compiler’s. It takes some formal grammar (typically it is a context-free grammar in BNF or EBNF form), that defines a syntax of a programming language.
Tiger grammar with Menhir Link to heading
While reading the current paragraph I highly recommend consulting the Tiger Language Reference Manual that has a precise description (along with a BNF notation) of everything we’re going to define below.
Menhir is an LR(1) parser generator library for OCaml. It
integrates with Dune quiet nicely. All we need to do is to add
the menhir
stanza to our dune
file (the one from the
previous chapter). So the whole file will look like this:
(menhir
(modules parser)
(flags ("--dump" "--explain")))
(ocamllex
(modules lexer))
(library
(name ch3)
(inline_tests)
(libraries core stdio)
(preprocess (pps ppx_inline_test ppx_expect)))
(env (dev (flags (:standard -warn-error -A))))
Notice the --dump
and --explain
switches:
- The
--dump
switch means to write a description of the automaton to the.automaton
file. - The
--explain
switch helps us to understand severe conflicts in terms of a grammar (rather than in terms of automaton), enabling it means to write a textual explanation of detected shift-reduce conflicts to the.conflicts
file. Here is the example of how it looks.
See the conflicts part of the Menhir manual for details.
We’ll use the new syntax for rules (despite the fact that it is considered experimental). First, let’s define some token aliases, priorities and associativity levels. Basically these are the same tokens we used in our lexer.
Base keywords:
%token TYPE "type"
%token VAR "var"
%token FUNCTION "function"
%token BREAK "break"
%token OF "of"
%token END "end"
%token IN "in"
%token NIL "nil" (* nil denotes a value belonging to every record type *)
%token LET "let"
%token ARRAY "array"
Loop-related keywords:
%token DO "do"
%token TO "to"
%token FOR "for"
%token WHILE "while"
Keywords for conditional expression:
%token IF "if"
%token THEN "then"
%token ELSE "else"
Operator tokens:
(* General *)
%token ASSIGN ":="
(* Logical *)
%token OR "|"
%token AND "&"
(* Comparison *)
%token GE ">="
%token GT ">"
%token LE "<="
%token LT "<"
%token NEQ "<>"
%token EQ "="
(* Arithmetics *)
%token DIVIDE "/"
%token TIMES "*"
%token PLUS "+"
%token MINUS "-"
Tokens for separators:
%token DOT "."
%token LBRACE "{"
%token RBRACE "}"
%token LBRACK "["
%token RBRACK "]"
%token LPAREN "("
%token RPAREN ")"
%token SEMICOLON ";"
%token COLON ":"
%token COMMA ","
Strings, numbers, identifiers and the EOF
token:
%token <string> STRING "string"
%token <int> INT "int"
%token <string> ID "id"
%token EOF
Associativity of operators:
%nonassoc "of"
%nonassoc "then"
%nonassoc "else"
%nonassoc "do"
%nonassoc ":="
%left "|"
%left "&"
%nonassoc ">=" ">" "<=" "<" "<>" "="
%left "+" "-"
%left "*" "/"
The grammar rules:
%start <unit> main
%%
let main :=
~ = expr; EOF; <>
Top-level expression:
let expr :=
| primitive
| "nil"
| "break"
| create_rec
| create_arr
| lvalue
| assignment
| local
| conditional
| loop
| fun_call
| unary
| binary
| seq
It might not be obvious from the expr
definition, but it also
includes a thing called no value
(an expression that yields no
value). So when looking at the top-level expression and
comparing it to the language reference manual, please note that
in our grammar no_val := "(" ")"
(empty seq
).
There are two built-in (predefined) primitive types: int
and
string
. The grammar rule for those is pretty straightforward:
let primitive :=
| "string"; { () }
| "int"; { () }
In our language we have only one unary operator – minus:
let unary := "-"; expr
But we have two kinds of binary operators: boolean
and
“logical + arithmetic”, which we’ll simply call bin
:
let binary :=
| bin
| boolean
Scroll down to see the definitions of bin
and boolean
.
We have 2 kinds of loops: the while
loop and the for
loop:
let loop :=
| while_loop
| for_loop
let while_loop := "while"; expr; "do"; expr
let for_loop := "for"; "id"; ":="; expr; "to"; expr; "do"; expr
Nothing new or unusual. And the conditional expression rule is also quite trivial:
let conditional :=
| "if"; expr; "then"; expr; "else"; expr
| "if"; expr; "then"; expr
Syntax for local bindings is going to be exactly like in OCaml or SML:
let local := "let"; decs; "in"; expr_seq; "end"
Now, a slightly more complex part – declarations.
As stated in the Tiger language reference manual, a declaration-sequence is a sequence of type, value, and function declarations. No punctuation separates or terminates individual declarations.
In terms of Menhir we can express it like this:
let decs := list(dec); { () }
let dec :=
| ty_dec (* type *)
| var_dec (* value *)
| fun_dec (* function declaration *)
Ok, we need to define a separate rule for each declaration mentioned above.
To declare a data type in the Tiger language we start with the
keyword type
, after which comes the identifier, the equality
token and the type declaration body:
let ty_dec := "type"; "id"; "="; ty
The type declaration could be a record, an array or a type alias (any identifier):
let ty :=
| braced(ty_fields) (* records *)
| "array"; "of"; "id"; { () } (* arrays *)
| "id"; { () }
let ty_fields := separated_list(",", ty_field); { () }
let ty_field := "id"; ty_ann
As per language reference manual, there are two possible ways to declare a variable: with or without type annotation. The following rule reflects this:
let var_dec :=
| "var"; "id"; ":="; expr
| "var"; "id"; ty_ann; ":="; expr
The record and array creation rules below are also considered top-level expressions:
let create_rec := "id"; braced(init_rec_fields)
let create_arr := "id"; bracketed(expr); "of"; expr
let init_rec_fields := separated_list(",", init_rec_field); { () }
let init_rec_field := "id"; "="; expr
Next, we need a function declaration rule:
let fun_dec :=
(* procedures doesn't return values *)
| fun_head; "="; fun_body
(* functions return values and the type is specified after the colon *)
| fun_head; ":"; "id"; "="; fun_body
let fun_head := "function"; "id"; fun_params
let fun_body := expr
let fun_params := parenthesized(ty_fields)
According to our language spec:
> An l-value
is a location, whose value may be read or assigned.
> Variables, procedure parameters, fields or records, and
elements are all l-values
.
let lvalue :=
(* variable *)
| "id"; { () }
(* everything else *)
| lvalue_t
let lvalue_t :=
(* simple record field *)
| "id"; "."; "id"; { () }
(* compound record field *)
| lvalue_t; "."; "id"; { () }
(* simple array subscript *)
| "id"; bracketed(expr)
(* compund array subscript *)
| lvalue_t; bracketed(expr)
Assignment of an expression to lvalue
:
let assignment := lvalue; ":="; expr
Sequence of expressions delimited by semicolon:
let expr_seq := separated_list(";", expr); { () }
let seq := parenthesized(expr_seq)
Function call rule:
let fun_call := "id"; parenthesized(fun_args)
let fun_args := separated_list(",", expr); { () }
Arithmetic, comparison and equality expressions:
let bin := expr; bin_op; expr
let bin_op == "+" | "-" | "*" | "/" | ">=" | ">" | "<=" | "<" | "<>" | "="
Boolean expressions:
let boolean := expr; boolean_op; expr
let boolean_op == "&" | "|"
Helper rule for a type annotation which we used in the var_dec
and ty_field
above:
let ty_ann := ":"; "id"
Menhir allows us to declare functions, so these are three helper functions that we used everywhere else for delimited expressions:
let parenthesized(x) == delimited("(", x, ")")
let bracketed(x) == delimited("[", x, "]")
let braced(x) == delimited("{", x, "}")
Test suite Link to heading
In order to iterate on our parser quickly we want to automate the process of testing it against the provided example programs. The most convenient way to do that would be to create a test suite that we could run after every change we make.
Some of the provided example Tiger programs are intentionally broken, so we want to skip those:
let skipped =
["test16.tig"; "test17.tig"; "test19.tig";
"test20.tig"; "test25.tig";
"test45.tig"; "test49.tig"]
A couple of helper functions to list the example programs (recursively):
let is_tig_ext filename =
let (_, ext) = Filename.split_extension filename in
match ext with
| Some "tig" -> true
| _ -> false
let is_tig_file f =
Sys.is_file_exn ~follow_symlinks:true f &&
is_tig_ext f
let rec ls_rec dir =
if is_tig_file dir
then
if not(List.mem skipped (Filename.basename dir) ~equal:(=)) then
[dir]
else
[]
else
dir
|> Sys.ls_dir
|> List.concat_map
~f:(fun sub -> ls_rec (Filename.concat dir sub))
Let’s add another helper function to get current position along with a filename as a string. We’ll need it later to output an error location:
let position lexbuf =
let pos = lexbuf.lex_curr_p in
let col = pos.pos_cnum - pos.pos_bol + 1 in
Printf.sprintf "%s:%d:%d" pos.pos_fname pos.pos_lnum col
And another two functions to run the parser and print the error details in case of failure:
let parse_with_error lexbuf =
try
Parser.main Lexer.read lexbuf;
assert_bool "Ok" true
with
| SyntaxError msg ->
assert_failure (position lexbuf ^ " : " ^ msg)
| Parser.Error ->
assert_failure ("Syntax error: " ^ position lexbuf)
let parse filename ch =
let lexbuf = Lexing.from_channel ch in
lexbuf.lex_curr_p <- {
lexbuf.lex_curr_p with pos_fname = Filename.basename filename
};
parse_with_error lexbuf
Here is the function that we’re going to call for each .tig
file:
let run_parser filename _ =
In_channel.with_file filename ~f:(parse filename)
Finally, the test suite is going to look like this:
let suite =
"tiger programs" >:::
let tests_dir = "../../../book" ^ Filename.dir_sep ^ "testcases" in
let tests_path = Filename.(concat parent_dir_name tests_dir) in
let tig_files = ls_rec tests_path in
(List.map ~f:
(fun filename ->
Filename.basename filename >:: run_parser filename)
tig_files)
let () =
run_test_tt_main suite
The dune file for our test suite project will simply run the
testsuite.exe
executable:
(executable
(name testsuite)
(libraries base core ch3 oUnit))
(alias
(name runtest)
(action (run ./testsuite.exe)))
Now it’ll be much easier to work on the parser implementation. Also this test suite might be useful to detect possible regressions if we want to change the parser in the future.
Conclusion Link to heading
Of course, I didn’t come up with this grammar right away. I spent almost a week implementing this parser. I started with a simple parser for “straightline programs” and iterated on it until it evolved to the fully functional Tiger-language grammar. And it took some time for me to figure out how to resolve all the shift-reduce conflicts.
Another cool feature of Menhir is its REPL, which might be helpful for debugging. This is especially useful when you already caught a parser error by running the test suite and now you want to try constructing an invalid expression to check your hypothesis. I used this kind of workflow a lot while working on this grammar.
You can run the interpreter like this:
menhir --interpret --interpret-show-cst parser.mly
Now you can start typing expressions until you find something that breaks your parser. Note that you can’t use token aliases inside this REPL.
For example, here is how you can test a function declaration:
LET FUNCTION ID LPAREN ID COLON ID RPAREN COLON ID EQ IF ID EQ INT THEN INT ELSE INT TIMES ID LPAREN ID MINUS INT RPAREN IN ID LPAREN INT RPAREN END
That’s it for now. The full source code for this chapter is here. Next time we’re going to work on the AST for our Tiger language.