Module 07: Parser combinators

In practice, one rarely writes parsers from scratch like we did last week. Typically, one uses some sort of tool or framework for constructing parsers, which hides a lot of the complexity of dealing with lexing/tokenizing, precendence and associativity, etc., and allows you to focus more directly on the grammar you wish to parse.

In this module, we will explore a Haskell library for constructing parsers called parsec.

To install parsec, open a command prompt and type

stack install parsec

This module is for the entire week of September 18 and 20. It is due on Tuesday, September 25.

{-# LANGUAGE GADTSyntax #-}

-- Hide some standard operators so we can use
-- variants with more specific types (for now)
import Prelude hiding ((<$>), (<$), (<*>), (<*), (*>))

-- Parsing is a module I have provided for you which wraps up some
-- functionality of parsec into a somewhat easier/simpler interface.
import Parsing

-- Our old friend Arith
data Arith where
  Lit :: Integer -> Arith
  Add :: Arith -> Arith -> Arith
  Sub :: Arith -> Arith -> Arith
  Mul :: Arith -> Arith -> Arith
  deriving (Show)

interpArith :: Arith -> Integer
interpArith (Lit i) = i
interpArith (Add e1 e2) = interpArith e1 + interpArith e2
interpArith (Sub e1 e2) = interpArith e1 - interpArith e2
interpArith (Mul e1 e2) = interpArith e1 * interpArith e2

lexer :: TokenParser u
lexer = makeTokenParser emptyDef

parens :: Parser a -> Parser a
parens     = getParens lexer

reservedOp :: String -> Parser ()
reservedOp = getReservedOp lexer

integer :: Parser Integer
integer    = getInteger lexer

whiteSpace :: Parser ()
whiteSpace = getWhiteSpace lexer

parseArithAtom :: Parser Arith
parseArithAtom = (Lit <$> integer) <|> parens parseArith

parseArith :: Parser Arith
parseArith = buildExpressionParser table parseArithAtom
  where
    table = [ [ Infix (Mul <$ reservedOp "*") AssocLeft ]
            , [ Infix (Add <$ reservedOp "+") AssocLeft
              , Infix (Sub <$ reservedOp "-") AssocLeft
              ]
            ]

arith :: Parser Arith
arith = whiteSpace *> parseArith <* eof

eval :: String -> Maybe Integer
eval s = case parse arith s of
  Left _  -> Nothing
  Right e -> Just (interpArith e)

Token parsers

The first thing to consider is lexer, which is a TokenParser. For now, we are using a simple default TokenParser; later we will see how to customize it. Essentially, lexer is an automatically generated collection of special parsers which do the low-level work of tokenizing. The functions getInteger, getParens, etc. extract these individual parsers from lexer. We have extracted four such token parsers and given them names to make them easier to use: one to parse integers, one for whitespace, one for operators, and one to parse parenthesized things. (See the definition of GenTokenParser for a full list of the available token parsers.)

Parsers have a type like Parser a; for example, integer has type Parser Integer. This means it is a parser which consumes some part of a String and either returns a value of type Integer or fails with parse error. You can think of it like this:

Parser a == String -> Maybe (a, String)

although the actual definition is quite a bit more complicated.

You can use the parseSome function (provided in the Parsing module) to try a parser and see what part of the input it consumes and what part is left to be consumed by subsequent parsers.

Parser combinators

The token parsers provide the primitive building blocks out of which we can construct more complicated parsers. Now we will explore some of the functions for building up more complex parsers. Such functions that allow building more complex things out of simpler parts are known as combinators.

Combinator exercises

Building an Arith parser

Now we will explore how the token parsers and combinators are used to build a parser for the Arith language.

Note that we have defined two parsers of type Parser Arith, namely, parseArith and parseArithAtom (not counting arith, which we will discuss later). This is a common pattern when building these sorts of parsers. The idea is that an “atomic” thing is something which forms an indivisible unit, which we know how to parse just by looking at the first token. A non-atomic thing might be more complicated, for example, it might involve infix operators.

Now look at the definition of parseArith. It uses a function provided by parsec called buildExpressionParser, which deals with parsing infix operators with various precedence levels and associativities (using similar algorithms to those we explored in the previous module).

Notice that table consists of a list with two elements, each of which is itself a list. The first list has one element which refers to Mul; the second list has two elements referencing Add and Sub.

Finally, take a look at the arith parser.

We have now explored the entire Arith parser. Let’s modify it a bit.

The (<*>) operator

There is one more parser operator we need to learn about, namely, (<*>). It wasn’t needed in the Arith parser but will often be useful.

Unfortunately, the type of (<*>) does not give you a good sense of how to use it unless you have already spent a good deal of time thinking about higher-order functions, currying, and the like in functional programming. Instead, we’ll look at some examples.

(In fact, we could also have written ((+) <$> integer <*> integer), which would do exactly the same thing.)

BIN and EBIN parsers

Feedback