Module 06: The Arith language: parsing
======================================

* Write your team names here:

* You may again choose whoever you want to start as the driver.  Write
  your choice here:

In this module, we will explore some rudiments of parsing.

> {-# LANGUAGE GADTs #-}
>
> import Data.Char   -- for isSpace, isDigit

Tokenizing
----------

> data Token where
>   TLit   :: Integer -> Token
>   TPlus  :: Token
>   TMinus :: Token
>   TTimes :: Token
>   LParen :: Token
>   RParen :: Token
>   deriving (Show, Eq)

* Write a function `tokenize :: String -> [Token]` which turns a
  string such as `" ( 2+ 45)"` into a list of tokens like
  `[LParen, TLit 2, TPlus, TLit 45, RParen]`.  Your function should
  ignore any whitespace (use the `isSpace` function).  *Hint*: you may
  also find the `span` function useful.  Try calling `span isDigit` to
  see what it does.

    Of course, `tokenize` might fail if the provided `String` is not a
    valid Arith expression.  Ideally, we would handle the errors in a
    principled way (*e.g.* by having `tokenize` return a `Maybe
    [Token]`, or something even more informative). For now, your
    function should simply crash when given invalid input.  We will
    consider error handling in more depth later in the semester.

Postfix
-------

* **ROTATE ROLES** and write the name of the new driver here:

Recall our representation of the Arith language from last time:

> data Op where
>   Plus  :: Op
>   Minus :: Op
>   Times :: Op
>   deriving (Show, Eq)
>
> data Arith where
>   Lit :: Integer -> Arith
>   Bin :: Op -> Arith -> Arith -> Arith
>   deriving (Show)

Consider the following alternative *postfix* syntax for Arith:

```
<digit> ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
<num>   ::= <digit>+
<op>    ::= '+' | '-' | '*'
<post>  ::= <num>
          | <post> ' ' <post> ' ' <op>
```

Notice that each operator appears *after* its operands (instead of
*between* them).

* Write down three examples that match the grammar `<post>`.

* Consider the following infix expressions.  Write each as an
  equivalent postfix expression.  Be careful about the order of arguments!

    + `3 + 4 * 5`
    + `3 * 4 + 5`
    + `(2 + 3 - (4 - 5)) * 6`

* Get out a sheet of paper and draw each of the abstract syntax trees
  corresponding to the above expressions.  What is the relationship
  between their trees and their postfix forms?

* Note that `<post>` expressions do not contain any parentheses.
  Will this ever cause ambiguity or incorrect parsing?  If so, give an
  example; if not, explain why not.

Class discussion: shunting
--------------------------

![](../images/stop.gif)

Shunting
--------

* **ROTATE ROLES** and write the name of the new driver here:

As discussed in class, the Shunting Yard Algorithm reorders a list
of tokens from infix to postfix notation by repeatedly applying the
following rules.  It makes use of a temporary stack of tokens.
$\newcommand{\opin}{\mathit{op}_{\mathit{in}}}$
$\newcommand{\opstk}{\mathit{op}_{\mathit{stk}}}$

1. If the input is empty, move all operators on the temporary stack to
   the output.

2. If the next input token is a literal number, move it to the output.

3. If the next input token is a left parenthesis, push it onto the
   stack.

4. If the next input token is a right parenthesis:

    * If the top of the stack is a left parenthesis, discard both
      parens.

    * Otherwise, pop the top token from the stack and move it to the
      output (keeping the right paren in the input).

    This has the effect of moving all operators on the stack to the
    output up until the first matching left parenthesis.

5. If the next input token is an operator $\opin$:

    * If the stack is empty, push $\opin$ on the stack.
    * If the top of the stack is a token $\opstk$:

        * If the precedence of $\opin$ is $\leq$ the precedence of
          $\opstk$, pop $\opstk$ from the stack to the output and keep $\opin$
          in the input.

        * Otherwise, keep $\opstk$ on the stack and push $\opin$ on
          top of it.

    More concisely, when we see $\opin$ we pop operators from the
    stack as long as they have precedence greater than $\opin$, until
    either the stack is empty or the top operator on the stack has
    precedence less than $\opin$; then we push $\opin$.

* Write a function `shunt :: [Token] -> [Token]` which converts an
  infix expression into postfix, implementing the shunting yard
  algorithm discussed in class.

    * *Hint 1*: make another "helper" function with type `[Token] ->
      [Token] -> [Token]`, which keeps track of both the input list of
      tokens as well as the temporary operator stack.

    * *Hint 2*: make a function of type `Token -> Int` which returns
      the precedence of each token.  Be sure to give parentheses the
      *lowest* precedence (this means Rule 5 above does not need any
      special cases for parentheses). It probably doesn't matter what
      precedence you give to number literals, but logically they
      should have the *highest* precedence.

* Where do you think the given algorithm would need to be modified to
  accommodate both left- and right-associative operators?

Parsing
-------

* Write a function `parsePostfix :: [Token] -> Arith` which takes a
  list of tokens in postfix form and builds a corresponding `Arith`
  expression.  For example, `parsePostfix [TLit 2, TLit 45, TPlus] =
  Bin Plus (Lit 2) (Lit 45)`.

    `parsePostfix` should work by maintaining an `[Arith]` stack.
    Literal tokens are just pushed onto the stack.  Each operator in
    the postfix expression applies to the two previous subexpressions,
    so when encountering an operator, pop the top two `Arith` trees
    from the stack, make them into a new `Arith` tree with the
    operator at the root, and push the result back on the stack.

    Again, `parsePostfix` should work by just calling a recursive
    helper function with an extra parameter for the `[Arith]` stack
    (which starts off empty).

* Finally, put it all together: write a function `eval :: String ->
  Integer` which works as follows:

    Concrete syntax (infix) $\rightarrow$ *tokenize* $\rightarrow$
    Token list (infix) $\rightarrow$ *shunt* $\rightarrow$ Token list
    (postfix) $\rightarrow$ *parse* $\rightarrow$ Arith AST
    $\rightarrow$ *interpret* $\rightarrow$ Integer value