Module 06: The Arith language: parsing

Write your team names here:
You may again choose whoever you want to start as the driver. Write your choice here:

In this module, we will explore some rudiments of parsing.

{-# LANGUAGE GADTSyntax #-}

import Data.Char   -- for isSpace, isDigit

Tokenizing

data Token where
  TLit   :: Integer -> Token
  TPlus  :: Token
  TMinus :: Token
  TTimes :: Token
  LParen :: Token
  RParen :: Token
  deriving (Show, Eq)

Write a function tokenize :: String -> [Token] which turns a string such as " ( 2+ 45)" into a list of tokens like [LParen, TLit 2, TPlus, TLit 45, RParen]. Your function should ignore any whitespace (use the isSpace function). Hint: you may also find the span function useful. Try calling span isDigit to see what it does.

Of course, tokenize might fail if the provided String is not a valid Arith expression. Ideally, we would handle the errors in a principled way (e.g. by having tokenize return a Maybe [Token], or something even more informative). For now, your function should simply crash when given invalid input. We will consider error handling in more depth later in the semester.

Postfix

ROTATE ROLES and write the name of the new driver here:

Recall our representation of the Arith language from last time:

data Op where
  Plus  :: Op
  Minus :: Op
  Times :: Op
  deriving (Show, Eq)

data Arith where
  Lit :: Integer -> Arith
  Bin :: Op -> Arith -> Arith -> Arith
  deriving (Show)

Consider the following alternative postfix syntax for Arith:

<digit> ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
<num>   ::= <digit>+
<op>    ::= '+' | '-' | '*'
<post>  ::= <num>
          | <post> ' ' <post> ' ' <op>

Notice that each operator appears after its operands (instead of between them).

Write down three examples that match the grammar <post>.
Consider the following infix expressions. Write each as an equivalent postfix expression. Be careful about the order of arguments!
- 3 + 4 * 5
- 3 * 4 + 5
- (2 + 3 - (4 - 5)) * 6
Get out a sheet of paper and draw each of the abstract syntax trees corresponding to the above expressions. What is the relationship between their trees and their postfix forms?
Note that <post> expressions do not contain any parentheses. Will this ever cause ambiguity or incorrect parsing? If so, give an example; if not, explain why not.

Class discussion: shunting

Shunting

ROTATE ROLES and write the name of the new driver here:

As discussed in class, the Shunting Yard Algorithm reorders a list of tokens from infix to postfix notation by repeatedly applying the following rules. It makes use of a temporary stack of tokens. \(\newcommand{\opin}{\mathit{op}_{\mathit{in}}}\) \(\newcommand{\opstk}{\mathit{op}_{\mathit{stk}}}\)

If the input is empty, move all operators on the temporary stack to the output.
If the next input token is a literal number, move it to the output.
If the next input token is a left parenthesis, push it onto the stack.
If the next input token is a right parenthesis:
- If the top of the stack is a left parenthesis, discard both parens.
- Otherwise, pop the top token from the stack and move it to the output (keeping the right paren in the input).
This has the effect of moving all operators on the stack to the output up until the first matching left parenthesis.
If the next input token is an operator \(\opin\):
- If the stack is empty, push \(\opin\) on the stack.
- If the top of the stack is a token \(\opstk\):
  - If the precedence of \(\opin\) is \(\leq\) the precedence of \(\opstk\), pop \(\opstk\) from the stack to the output and keep \(\opin\) in the input.
  - Otherwise, keep \(\opstk\) on the stack and push \(\opin\) on top of it.
More concisely, when we see \(\opin\) we pop operators from the stack as long as they have precedence greater than \(\opin\), until either the stack is empty or the top operator on the stack has precedence less than \(\opin\); then we push \(\opin\).

Write a function shunt :: [Token] -> [Token] which converts an infix expression into postfix, implementing the shunting yard algorithm discussed in class.
- Hint 1: make another “helper” function with type [Token] -> [Token] -> [Token], which keeps track of both the input list of tokens as well as the temporary operator stack.
- Hint 2: make a function of type Token -> Int which returns the precedence of each token. Be sure to give parentheses the lowest precedence (this means Rule 5 above does not need any special cases for parentheses). It probably doesn’t matter what precedence you give to number literals, but logically they should have the highest precedence.
Where do you think the given algorithm would need to be modified to accommodate both left- and right-associative operators?

Parsing

Write a function parsePostfix :: [Token] -> Arith which takes a list of tokens in postfix form and builds a corresponding Arith expression. For example, parsePostfix [TLit 2, TLit 45, TPlus] = Bin Plus (Lit 2) (Lit 45).

parsePostfix should work by maintaining an [Arith] stack. Literal tokens are just pushed onto the stack. Each operator in the postfix expression applies to the two previous subexpressions, so when encountering an operator, pop the top two Arith trees from the stack, make them into a new Arith tree with the operator at the root, and push the result back on the stack.

Again, parsePostfix should work by just calling a recursive helper function with an extra parameter for the [Arith] stack (which starts off empty).
Finally, put it all together: write a function eval :: String -> Integer which works as follows:

Concrete syntax (infix) \(\rightarrow\) tokenize \(\rightarrow\) Token list (infix) \(\rightarrow\) shunt \(\rightarrow\) Token list (postfix) \(\rightarrow\) parse \(\rightarrow\) Arith AST \(\rightarrow\) interpret \(\rightarrow\) Integer value