Making of Tiger #3, Parsing

Intro Link to heading

In this chapter we’re going to build a parser for our Tiger language. First, let’s do a quick recap of some important concepts of the theory behind programming language parsers:

Parser generator is the most common type of compiler-compiler’s. It takes some formal grammar (typically it is a context-free grammar in BNF or EBNF form), that defines a syntax of a programming language.

Tiger grammar with Menhir Link to heading

While reading the current paragraph I highly recommend consulting the Tiger Language Reference Manual that has a precise description (along with a BNF notation) of everything we’re going to define below.

Menhir is an LR(1) parser generator library for OCaml. It integrates with Dune quiet nicely. All we need to do is to add the menhir stanza to our dune file (the one from the previous chapter). So the whole file will look like this:

(menhir
 (modules parser)
 (flags ("--dump" "--explain")))

(ocamllex
 (modules lexer))

(library
 (name ch3)
 (inline_tests)
 (libraries core stdio)
 (preprocess (pps ppx_inline_test ppx_expect)))

(env (dev (flags (:standard -warn-error -A))))

Notice the --dump and --explain switches:

The --dump switch means to write a description of the automaton to the .automaton file.
The --explain switch helps us to understand severe conflicts in terms of a grammar (rather than in terms of automaton), enabling it means to write a textual explanation of detected shift-reduce conflicts to the .conflicts file. Here is the example of how it looks.

See the conflicts part of the Menhir manual for details.

We’ll use the new syntax for rules (despite the fact that it is considered experimental). First, let’s define some token aliases, priorities and associativity levels. Basically these are the same tokens we used in our lexer.

Base keywords:

%token TYPE     "type"
%token VAR      "var"
%token FUNCTION "function"
%token BREAK    "break"
%token OF       "of"
%token END      "end"
%token IN       "in"
%token NIL      "nil" (* nil denotes a value belonging to every record type *)
%token LET      "let"
%token ARRAY    "array"

Loop-related keywords:

%token DO    "do"
%token TO    "to"
%token FOR   "for"
%token WHILE "while"

Keywords for conditional expression:

%token IF   "if"
%token THEN "then"
%token ELSE "else"

Operator tokens:

(* General *)
%token ASSIGN ":="

(* Logical *)
%token OR  "|"
%token AND "&"

(* Comparison *)
%token GE  ">="
%token GT  ">"
%token LE  "<="
%token LT  "<"
%token NEQ "<>"
%token EQ  "="

(* Arithmetics *)
%token DIVIDE "/"
%token TIMES  "*"
%token PLUS   "+"
%token MINUS  "-"

Tokens for separators:

%token DOT       "."
%token LBRACE    "{"
%token RBRACE    "}"
%token LBRACK    "["
%token RBRACK    "]"
%token LPAREN    "("
%token RPAREN    ")"
%token SEMICOLON ";"
%token COLON     ":"
%token COMMA     ","

Strings, numbers, identifiers and the EOF token:

%token <string> STRING "string"
%token <int>    INT    "int"
%token <string> ID     "id"

%token EOF

Associativity of operators:

%nonassoc "of"
%nonassoc "then"
%nonassoc "else"
%nonassoc "do"
%nonassoc ":="
%left     "|"
%left     "&"
%nonassoc ">=" ">" "<=" "<" "<>" "="
%left     "+" "-"
%left     "*" "/"

The grammar rules:

%start <unit> main

%%

let main :=
  ~ = expr; EOF; <>

Top-level expression:

let expr :=
  | primitive
  | "nil"
  | "break"
  | create_rec
  | create_arr
  | lvalue
  | assignment
  | local
  | conditional
  | loop
  | fun_call
  | unary
  | binary
  | seq

It might not be obvious from the expr definition, but it also includes a thing called no value (an expression that yields no value). So when looking at the top-level expression and comparing it to the language reference manual, please note that in our grammar no_val := "(" ")" (empty seq).

There are two built-in (predefined) primitive types: int and string. The grammar rule for those is pretty straightforward:

let primitive :=
  | "string"; { () }
  | "int"; { () }

In our language we have only one unary operator – minus:

let unary := "-"; expr

But we have two kinds of binary operators: boolean and “logical + arithmetic”, which we’ll simply call bin:

let binary :=
  | bin
  | boolean

Scroll down to see the definitions of bin and boolean.

We have 2 kinds of loops: the while loop and the for loop:

let loop :=
  | while_loop
  | for_loop

let while_loop := "while"; expr; "do"; expr
let for_loop   := "for"; "id"; ":="; expr; "to"; expr; "do"; expr

Nothing new or unusual. And the conditional expression rule is also quite trivial:

let conditional :=
  | "if"; expr; "then"; expr; "else"; expr
  | "if"; expr; "then"; expr

Syntax for local bindings is going to be exactly like in OCaml or SML:

let local := "let"; decs; "in"; expr_seq; "end"

Now, a slightly more complex part – declarations.

As stated in the Tiger language reference manual, a declaration-sequence is a sequence of type, value, and function declarations. No punctuation separates or terminates individual declarations.

In terms of Menhir we can express it like this:

let decs := list(dec); { () }

let dec :=
  | ty_dec  (* type *)
  | var_dec (* value *)
  | fun_dec (* function declaration *)

Ok, we need to define a separate rule for each declaration mentioned above.

To declare a data type in the Tiger language we start with the keyword type, after which comes the identifier, the equality token and the type declaration body:

let ty_dec := "type"; "id"; "="; ty

The type declaration could be a record, an array or a type alias (any identifier):

let ty :=
  | braced(ty_fields) (* records *)
  | "array"; "of"; "id"; { () } (* arrays *)
  | "id"; { () }

let ty_fields := separated_list(",", ty_field); { () }
let ty_field  := "id"; ty_ann

As per language reference manual, there are two possible ways to declare a variable: with or without type annotation. The following rule reflects this:

let var_dec :=
  | "var"; "id";         ":="; expr
  | "var"; "id"; ty_ann; ":="; expr

The record and array creation rules below are also considered top-level expressions:

let create_rec := "id"; braced(init_rec_fields)
let create_arr := "id"; bracketed(expr); "of"; expr

let init_rec_fields := separated_list(",", init_rec_field); { () }
let init_rec_field  := "id"; "="; expr

Next, we need a function declaration rule:

let fun_dec :=
  (* procedures doesn't return values *)
  | fun_head; "="; fun_body
  (* functions return values and the type is specified after the colon *)
  | fun_head; ":"; "id"; "="; fun_body

let fun_head   := "function"; "id"; fun_params
let fun_body   := expr
let fun_params := parenthesized(ty_fields)

According to our language spec:

> An l-value is a location, whose value may be read or assigned. > Variables, procedure parameters, fields or records, and elements are all l-values.

let lvalue :=
  (* variable *)
  | "id"; { () }
  (* everything else *)
  | lvalue_t

let lvalue_t :=
  (* simple record field *)
  | "id"; "."; "id"; { () }
  (* compound record field *)
  | lvalue_t; "."; "id"; { () }
  (* simple array subscript *)
  | "id"; bracketed(expr)
  (* compund array subscript *)
  | lvalue_t; bracketed(expr)

Assignment of an expression to lvalue:

let assignment := lvalue; ":="; expr

Sequence of expressions delimited by semicolon:

let expr_seq := separated_list(";", expr); { () }

let seq := parenthesized(expr_seq)

Function call rule:

let fun_call := "id"; parenthesized(fun_args)
let fun_args := separated_list(",", expr); { () }

Arithmetic, comparison and equality expressions:

let bin := expr; bin_op; expr
let bin_op == "+" | "-" | "*" | "/" | ">=" | ">" | "<=" | "<" | "<>" | "="

Boolean expressions:

let boolean := expr; boolean_op; expr
let boolean_op == "&" | "|"

Helper rule for a type annotation which we used in the var_dec and ty_field above:

let ty_ann := ":"; "id"

Menhir allows us to declare functions, so these are three helper functions that we used everywhere else for delimited expressions:

let parenthesized(x) == delimited("(", x, ")")
let bracketed(x)     == delimited("[", x, "]")
let braced(x)        == delimited("{", x, "}")

Test suite Link to heading

In order to iterate on our parser quickly we want to automate the process of testing it against the provided example programs. The most convenient way to do that would be to create a test suite that we could run after every change we make.

Some of the provided example Tiger programs are intentionally broken, so we want to skip those:

let skipped =
  ["test16.tig"; "test17.tig"; "test19.tig";
   "test20.tig"; "test25.tig";
   "test45.tig"; "test49.tig"]

A couple of helper functions to list the example programs (recursively):

let is_tig_ext filename =
  let (_, ext) = Filename.split_extension filename in
  match ext with
  | Some "tig" -> true
  | _ -> false

let is_tig_file f =
  Sys.is_file_exn ~follow_symlinks:true f &&
  is_tig_ext f

let rec ls_rec dir =
  if is_tig_file dir
  then
    if not(List.mem skipped (Filename.basename dir) ~equal:(=)) then
      [dir]
    else
      []
  else
    dir
    |> Sys.ls_dir
    |> List.concat_map
      ~f:(fun sub -> ls_rec (Filename.concat dir sub))

Let’s add another helper function to get current position along with a filename as a string. We’ll need it later to output an error location:

let position lexbuf =
  let pos = lexbuf.lex_curr_p in
  let col = pos.pos_cnum - pos.pos_bol + 1 in
  Printf.sprintf "%s:%d:%d" pos.pos_fname pos.pos_lnum col

And another two functions to run the parser and print the error details in case of failure:

let parse_with_error lexbuf =
  try
    Parser.main Lexer.read lexbuf;
    assert_bool "Ok" true
  with
  | SyntaxError msg ->
    assert_failure (position lexbuf ^ " : " ^ msg)
  | Parser.Error ->
    assert_failure ("Syntax error: " ^ position lexbuf)

let parse filename ch =
  let lexbuf = Lexing.from_channel ch in
  lexbuf.lex_curr_p <- {
    lexbuf.lex_curr_p with pos_fname = Filename.basename filename
  };
  parse_with_error lexbuf

Here is the function that we’re going to call for each .tig file:

let run_parser filename _ =
  In_channel.with_file filename ~f:(parse filename)

Finally, the test suite is going to look like this:

let suite =
  "tiger programs" >:::
  let tests_dir = "../../../book" ^ Filename.dir_sep ^ "testcases" in
  let tests_path = Filename.(concat parent_dir_name tests_dir) in
  let tig_files = ls_rec tests_path in
  (List.map ~f:
     (fun filename ->
        Filename.basename filename >:: run_parser filename)
     tig_files)

let () =
  run_test_tt_main suite

The dune file for our test suite project will simply run the testsuite.exe executable:

(executable
 (name testsuite)
 (libraries base core ch3 oUnit))

(alias
 (name runtest)
 (action (run ./testsuite.exe)))

Now it’ll be much easier to work on the parser implementation. Also this test suite might be useful to detect possible regressions if we want to change the parser in the future.

Conclusion Link to heading

Of course, I didn’t come up with this grammar right away. I spent almost a week implementing this parser. I started with a simple parser for “straightline programs” and iterated on it until it evolved to the fully functional Tiger-language grammar. And it took some time for me to figure out how to resolve all the shift-reduce conflicts.

Another cool feature of Menhir is its REPL, which might be helpful for debugging. This is especially useful when you already caught a parser error by running the test suite and now you want to try constructing an invalid expression to check your hypothesis. I used this kind of workflow a lot while working on this grammar.

You can run the interpreter like this:

menhir --interpret --interpret-show-cst parser.mly

Now you can start typing expressions until you find something that breaks your parser. Note that you can’t use token aliases inside this REPL.

For example, here is how you can test a function declaration:

LET FUNCTION ID LPAREN ID COLON ID RPAREN COLON ID EQ IF ID EQ INT THEN INT ELSE INT TIMES ID LPAREN ID MINUS INT RPAREN IN ID LPAREN INT RPAREN END

That’s it for now. The full source code for this chapter is here. Next time we’re going to work on the AST for our Tiger language.