Module 06: The Arith language: parsing

In this module, we will explore some rudiments of parsing.

> {-# LANGUAGE GADTs #-}
> import Data.Char   -- for isSpace, isDigit


> data Token where
>   TLit   :: Integer -> Token
>   TPlus  :: Token
>   TMinus :: Token
>   TTimes :: Token
>   LParen :: Token
>   RParen :: Token
>   deriving (Show, Eq)


Recall our representation of the Arith language from last time:

> data Op where
>   Plus  :: Op
>   Minus :: Op
>   Times :: Op
>   deriving (Show, Eq)
> data Arith where
>   Lit :: Integer -> Arith
>   Bin :: Op -> Arith -> Arith -> Arith
>   deriving (Show)

Consider the following alternative postfix syntax for Arith:

<digit> ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
<num>   ::= <digit>+
<op>    ::= '+' | '-' | '*'
<post>  ::= <num>
          | <post> ' ' <post> ' ' <op>

Notice that each operator appears after its operands (instead of between them).

Class discussion: shunting


As discussed in class, the Shunting Yard Algorithm reorders a list of tokens from infix to postfix notation by repeatedly applying the following rules. It makes use of a temporary stack of tokens. \(\newcommand{\opin}{\mathit{op}_{\mathit{in}}}\) \(\newcommand{\opstk}{\mathit{op}_{\mathit{stk}}}\)

  1. If the input is empty, move all operators on the temporary stack to the output.

  2. If the next input token is a literal number, move it to the output.

  3. If the next input token is a left parenthesis, push it onto the stack.

  4. If the next input token is a right parenthesis:

    • If the top of the stack is a left parenthesis, discard both parens.

    • Otherwise, pop the top token from the stack and move it to the output (keeping the right paren in the input).

    This has the effect of moving all operators on the stack to the output up until the first matching left parenthesis.

  5. If the next input token is an operator \(\opin\):

    • If the stack is empty, push \(\opin\) on the stack.
    • If the top of the stack is a token \(\opstk\):

      • If the precedence of \(\opin\) is \(\leq\) the precedence of \(\opstk\), pop \(\opstk\) from the stack to the output and keep \(\opin\) in the input.

      • Otherwise, keep \(\opstk\) on the stack and push \(\opin\) on top of it.

    More concisely, when we see \(\opin\) we pop operators from the stack as long as they have precedence greater than \(\opin\), until either the stack is empty or the top operator on the stack has precedence less than \(\opin\); then we push \(\opin\).