Making of Tiger #2, Lexical Analysis

Intro

We need a way to translate a program written in one (human-friendly) language to another (machine-specific) language. Generally, this work is splitted into 2 parts: analysis and synthesis.

The synthesis-part (back end) is responsible for the code generation and optimizations.

Analysis-part (front end) is responsible for breaking the program apart to understand its structure and meaning. There are 3 commonly used analysis phases:

  1. Lexical – breaking a sequence of characters into sequence of individual tokens (words)
  2. Syntax – parsing and checking that we have a valid sequence of tokens
  3. Semantic – gathering the program’s meaning, making sure that declarations and statements of program are semantically correct, this usually includes type checking

In this post we’ll focus on implementing the lexical analysis phase.

A lexical token is a string with an assigned and identified meaning. This is a unit in a grammar of a programming language.

  • Example tokens: identifiers, keywords, separators, operators
  • Example non-tokens: white-spaces, tabs, newlines, comments, preprocessor directives, macroses

Now, I’m going to skip everything related to finite automata and regular expressions and go straight to the implementation of lexer (lexical analyzer). You can find a good intro to lexing and parsing using ocamllex and menhir in the Read World OCaml book.

Example lexer (morse code)

We’ll start with the simplest possible example of building a lexer for morse code, the full source code is here.

The dune file for our playground project is going to look like this:

(ocamllex
 (modules lexer))

(executable
 (name driver)
 (libraries core stdio)
 (preprocess (pps ppx_deriving.show)))

(env (dev (flags (:standard -warn-error -A))))

Nothing fancy here. Next one is the driver.ml module, this is our entry point. It simply reads the given file, runs our lexer and displays the resulting S-expression.

open Core
open Syntax

let rec tokens buf =
  match Lexer.read buf with
  | EOF -> [EOF]
  | tok -> tok::tokens buf

let lex_print ch =
  ch
  |> Lexing.from_channel
  |> tokens
  |> List.iter ~f:(fun e -> Format.printf "%s\n" (show_exp e))

let run filename () =
  In_channel.with_file filename ~f:lex_print

let () =
  let spec = Command.Spec.(empty +> anon ("filename" %: string)) in
  run
  |> Command.basic_spec ~summary:"Run the lexer and display tokens" spec
  |> Command.run

We need a module to describe our tokens, lets call it syntax.ml:

type exp =
  | SYM of string (* symbol *)
  | SEP (* separator *)
  | EOF (* end of file *)
  [@@deriving show]

And the lexer generator itself:

{
  open Lexing
  open Syntax

  (* Custom exception type for lexer errors *)
  exception SyntaxError of string
}

(* Regular expressions: *)

(* We use whitespace as a separator, so
   it is a valid token in our language *)
let white   = [' ' '\t']+
let newline = '\r' | '\n' | "\r\n"
let sym     = ['.' '-']+

(* Lexing rules: *)

rule read = parse
  (* New lines are separators too *)
  | newline { new_line lexbuf; SEP }
  | sym     { SYM (lexeme lexbuf) }
  | white   { SEP }
  | _       { raise (SyntaxError (Printf.sprintf "At offset %d: unexpected character.\n" (lexeme_start lexbuf))) }
  | eof     { EOF }

Now we can parse the “hello world” sequence written in Morse Code:

.... . .-.. .-.. ---
.-- --- .-. .-.. -..

The output will be:

(SYM "....") SEP (SYM ".") SEP (SYM ".-..") SEP (SYM ".-..") SEP (SYM "---")
SEP
(SYM ".--") SEP (SYM "---") SEP (SYM ".-.") SEP (SYM ".-..") SEP (SYM "-..")
EOF

Tiger lexer

Having that basic understanding of how to use the ocamllex to build lexers, we can try to build one for our Tiger language.

The dune file is going to remain the same as in the “morse code” example above. There is a Tiger Language Reference Manual in Appendix A of the book where you can find the information about the tokens we need: identifiers, comments, declarations, data types, etc.

For now, we don’t need to define a real AST data type.

type exp =
    | TYPE
    | VAR
    | FUNCTION
    | BREAK
    | OF
    | END
    | IN
    | NIL
    | LET
    | DO
    | TO
    | FOR
    | WHILE
    | ELSE
    | THEN
    | IF
    | ARRAY
    | ASSIGN
    | OR
    | AND
    | GE
    | GT
    | LE
    | LT
    | NEQ
    | EQ
    | DIVIDE
    | TIMES
    | MINUS
    | PLUS
    | DOT
    | RBRACE
    | LBRACE
    | RBRACK
    | LBRACK
    | RPAREN
    | LPAREN
    | SEMICOLON
    | COLON
    | COMMA
    | STRING of string
    | INT of int
    | ID of string
    | EOF
    [@@deriving show]

We want to focus on the lexer part. First, we might want to open some commonly used modules and define a custom exception type for lexing errors.

{
open Lexing
open Syntax
open Base

exception SyntaxError of string
}

Next, we need a few regular expressions:


let digit = ['0'-'9']
(* There are no negative integer literals in Tiger *)
let int = digit+
let frac = '.' digit*
let exp = ['e' 'E'] ['-' '+']? int
let float = digit* frac? exp?
let white = [' ' '\t']+
let newline = '\r' | '\n' | "\r\n"

According to Appendix A:

An identifier is a sequence of letters, digits and underscores, starting with a letter. Uppercase letters are distinguished from lowercase.

let alphanum = ['a'-'z' 'A'-'Z' '0'-'9' '_']
let id = ['a'-'z'] alphanum*

Lexing rules.

rule read = parse
  (* Whitespaces *)
  | white { read lexbuf }

We want to skip the new lines.

  | newline { new_line lexbuf; read lexbuf }

We have a separate function to read strings (see the implementation below).

  | '"' { read_string (Buffer.create 16) lexbuf }

Again, according the Appendix A:

A comment may appear between any two tokens. Comments start with /* and end with */ and may be nested.

We’ll have a separate function for reading comments (scroll down to see its code).

  | "/*" { read_comment [lexbuf.lex_curr_p] lexbuf }

Basic keywords:

  | "type"     { TYPE }
  | "var"      { VAR }
  | "function" { FUNCTION }
  | "break"    { BREAK }
  | "of"       { OF }
  | "end"      { END }
  | "in"       { IN }
  | "nil"      { NIL }
  | "let"      { LET }
  | "array"    { ARRAY }

We need 4 tokens for loops:

  | "do"    { DO }
  | "to"    { TO }
  | "for"   { FOR }
  | "while" { WHILE }

Conditionals (if-then-else):

  | "else" { ELSE }
  | "then" { THEN }
  | "if"   { IF }

Operators:

  (* General *)
  | ":=" { ASSIGN }

  (* Logical *)
  | "|" { OR }
  | "&" { AND }

  (* Comparison *)
  | ">=" { GE }
  | ">"  { GT }
  | "<=" { LE }
  | "<"  { LT }
  | "<>" { NEQ }
  | "="  { EQ }

  (* Arithmetics *)
  | "/" { DIVIDE }
  | "*" { TIMES }
  | "-" { MINUS }
  | "+" { PLUS }

Separators:

  | "." { DOT }
  | "{" { LBRACE }
  | "}" { RBRACE }
  | "[" { LBRACK }
  | "]" { RBRACK }
  | "(" { LPAREN }
  | ")" { RPAREN }
  | ";" { SEMICOLON }
  | ":" { COLON }
  | "," { COMMA }

Numbers, identifiers, error and EOF handling:

  | int { INT (Int.of_string (Lexing.lexeme lexbuf)) }
  | id  { ID (Lexing.lexeme lexbuf) }
  | _   { raise (SyntaxError ("Unexpected character: " ^ Lexing.lexeme lexbuf)) }
  | eof { EOF }

The rule to match string literals:

and read_string buf = parse
  (* If we reach the terminating double quote, then
   * we return the contents of the buffer as a STRING. *)
  | '"'       { STRING (Buffer.contents buf) }
  (* Handling escape sequences *)
  | '\\' '\\' { Buffer.add_char buf '\\'; read_string buf lexbuf }
  | '\\' '/'  { Buffer.add_char buf '/';  read_string buf lexbuf }
  | '\\' 'b'  { Buffer.add_char buf '\b'; read_string buf lexbuf }
  | '\\' 'f'  { Buffer.add_char buf '\012'; read_string buf lexbuf }
  | '\\' 'n'  { Buffer.add_char buf '\n'; read_string buf lexbuf }
  | '\\' 'r'  { Buffer.add_char buf '\r'; read_string buf lexbuf }
  | '\\' 't'  { Buffer.add_char buf '\t'; read_string buf lexbuf }
  | [^ '"' '\\']+
    { Buffer.add_string buf (Lexing.lexeme lexbuf);
      read_string buf lexbuf
    }
  | _   { raise (SyntaxError ("Illegal string character: " ^ Lexing.lexeme lexbuf)) }
  | eof { raise (SyntaxError "String is not terminated") }

The rule to match comments (including nested comments), keeping a list of where comments open:

and read_comment opened = parse
  (* Opening comment *)
  | "/*" { read_comment (lexbuf.lex_curr_p::opened) lexbuf }
  (* Closing comment *)
  | "*/"
    { match opened with
      (* No nested opened comments left, continue parsing. *)
      | _::[] -> read lexbuf
      (* Continue parsing comment. *)
      | _ -> read_comment (List.tl_exn opened) lexbuf
    }
  | newline { new_line lexbuf; read_comment opened lexbuf }
  | _ { read_comment opened lexbuf }
  (* Unexpected end-of-file. Update the current location to
   * point to the opening token that wasn't closed and raise an error. *)
  | eof
    { lexbuf.lex_curr_p <- List.hd_exn opened;
      raise (SyntaxError "Unterminated comment")
    }

Summary

We’ve learned some new things, like ocamllex, but overall, it wasn’t too hard to follow the book. The full OCaml source code for the second chapter can be found here. In the next part we’re going to write a parser.