# Module 06: The Arith language: parsing

Write your team names here:

You may again choose whoever you want to start as the driver. Write your choice here:

In this module, we will explore some rudiments of parsing.

```
> {-# LANGUAGE GADTs #-}
>
> import Data.Char -- for isSpace, isDigit
```

## Tokenizing

```
> data Token where
> TLit :: Integer -> Token
> TPlus :: Token
> TMinus :: Token
> TTimes :: Token
> LParen :: Token
> RParen :: Token
> deriving (Show, Eq)
```

Write a function

`tokenize :: String -> [Token]`

which turns a string such as`" ( 2+ 45)"`

into a list of tokens like`[LParen, TLit 2, TPlus, TLit 45, RParen]`

. Your function should ignore any whitespace (use the`isSpace`

function).*Hint*: you may also find the`span`

function useful. Try calling`span isDigit`

to see what it does.Of course,

`tokenize`

might fail if the provided`String`

is not a valid Arith expression. Ideally, we would handle the errors in a principled way (*e.g.*by having`tokenize`

return a`Maybe [Token]`

, or something even more informative). For now, your function should simply crash when given invalid input. We will consider error handling in more depth later in the semester.

## Postfix

**ROTATE ROLES**and write the name of the new driver here:

Recall our representation of the Arith language from last time:

```
> data Op where
> Plus :: Op
> Minus :: Op
> Times :: Op
> deriving (Show, Eq)
>
> data Arith where
> Lit :: Integer -> Arith
> Bin :: Op -> Arith -> Arith -> Arith
> deriving (Show)
```

Consider the following alternative *postfix* syntax for Arith:

```
<digit> ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
<num> ::= <digit>+
<op> ::= '+' | '-' | '*'
<post> ::= <num>
| <post> ' ' <post> ' ' <op>
```

Notice that each operator appears *after* its operands (instead of *between* them).

Write down three examples that match the grammar

`<post>`

.Consider the following infix expressions. Write each as an equivalent postfix expression. Be careful about the order of arguments!

`3 + 4 * 5`

`3 * 4 + 5`

`(2 + 3 - (4 - 5)) * 6`

Get out a sheet of paper and draw each of the abstract syntax trees corresponding to the above expressions. What is the relationship between their trees and their postfix forms?

Note that

`<post>`

expressions do not contain any parentheses. Will this ever cause ambiguity or incorrect parsing? If so, give an example; if not, explain why not.

## Class discussion: shunting

## Shunting

**ROTATE ROLES**and write the name of the new driver here:

As discussed in class, the Shunting Yard Algorithm reorders a list of tokens from infix to postfix notation by repeatedly applying the following rules. It makes use of a temporary stack of tokens. \(\newcommand{\opin}{\mathit{op}_{\mathit{in}}}\) \(\newcommand{\opstk}{\mathit{op}_{\mathit{stk}}}\)

If the input is empty, move all operators on the temporary stack to the output.

If the next input token is a literal number, move it to the output.

If the next input token is a left parenthesis, push it onto the stack.

If the next input token is a right parenthesis:

If the top of the stack is a left parenthesis, discard both parens.

Otherwise, pop the top token from the stack and move it to the output (keeping the right paren in the input).

This has the effect of moving all operators on the stack to the output up until the first matching left parenthesis.

If the next input token is an operator \(\opin\):

- If the stack is empty, push \(\opin\) on the stack.
If the top of the stack is a token \(\opstk\):

If the precedence of \(\opin\) is \(\leq\) the precedence of \(\opstk\), pop \(\opstk\) from the stack to the output and keep \(\opin\) in the input.

Otherwise, keep \(\opstk\) on the stack and push \(\opin\) on top of it.

More concisely, when we see \(\opin\) we pop operators from the stack as long as they have precedence greater than \(\opin\), until either the stack is empty or the top operator on the stack has precedence less than \(\opin\); then we push \(\opin\).

Write a function

`shunt :: [Token] -> [Token]`

which converts an infix expression into postfix, implementing the shunting yard algorithm discussed in class.*Hint 1*: make another “helper” function with type`[Token] -> [Token] -> [Token]`

, which keeps track of both the input list of tokens as well as the temporary operator stack.*Hint 2*: make a function of type`Token -> Int`

which returns the precedence of each token. Be sure to give parentheses the*lowest*precedence (this means Rule 5 above does not need any special cases for parentheses). It probably doesn’t matter what precedence you give to number literals, but logically they should have the*highest*precedence.

Where do you think the given algorithm would need to be modified to accommodate both left- and right-associative operators?

## Parsing

Write a function

`parsePostfix :: [Token] -> Arith`

which takes a list of tokens in postfix form and builds a corresponding`Arith`

expression. For example,`parsePostfix [TLit 2, TLit 45, TPlus] = Bin Plus (Lit 2) (Lit 45)`

.`parsePostfix`

should work by maintaining an`[Arith]`

stack. Literal tokens are just pushed onto the stack. Each operator in the postfix expression applies to the two previous subexpressions, so when encountering an operator, pop the top two`Arith`

trees from the stack, make them into a new`Arith`

tree with the operator at the root, and push the result back on the stack.Again,

`parsePostfix`

should work by just calling a recursive helper function with an extra parameter for the`[Arith]`

stack (which starts off empty).Finally, put it all together: write a function

`eval :: String -> Integer`

which works as follows:Concrete syntax (infix) \(\rightarrow\)

*tokenize*\(\rightarrow\) Token list (infix) \(\rightarrow\)*shunt*\(\rightarrow\) Token list (postfix) \(\rightarrow\)*parse*\(\rightarrow\) Arith AST \(\rightarrow\)*interpret*\(\rightarrow\) Integer value