Module 07: Parser combinators
- Write your team names here:
In practice, one rarely writes parsers from scratch like we did last week. Typically, one uses some sort of tool or framework for constructing parsers, which hides a lot of the complexity of dealing with lexing/tokenizing, precendence and associativity, etc., and allows you to focus more directly on the grammar you wish to parse.
In this module, we will explore a Haskell library for constructing parsers called parsec
.
To install parsec
, open a command prompt and type
stack install parsec
This module is for the entire week of September 18 and 20. It is due on Tuesday, September 25.
{-# LANGUAGE GADTSyntax #-}
-- Hide some standard operators so we can use
-- variants with more specific types (for now)
import Prelude hiding ((<$>), (<$), (<*>), (<*), (*>))
-- Parsing is a module I have provided for you which wraps up some
-- functionality of parsec into a somewhat easier/simpler interface.
import Parsing
-- Our old friend Arith
data Arith where
Lit :: Integer -> Arith
Add :: Arith -> Arith -> Arith
Sub :: Arith -> Arith -> Arith
Mul :: Arith -> Arith -> Arith
deriving (Show)
interpArith :: Arith -> Integer
interpArith (Lit i) = i
interpArith (Add e1 e2) = interpArith e1 + interpArith e2
interpArith (Sub e1 e2) = interpArith e1 - interpArith e2
interpArith (Mul e1 e2) = interpArith e1 * interpArith e2
lexer :: TokenParser u
lexer = makeTokenParser emptyDef
parens :: Parser a -> Parser a
parens = getParens lexer
reservedOp :: String -> Parser ()
reservedOp = getReservedOp lexer
integer :: Parser Integer
integer = getInteger lexer
whiteSpace :: Parser ()
whiteSpace = getWhiteSpace lexer
parseArithAtom :: Parser Arith
parseArithAtom = (Lit <$> integer) <|> parens parseArith
parseArith :: Parser Arith
parseArith = buildExpressionParser table parseArithAtom
where
table = [ [ Infix (Mul <$ reservedOp "*") AssocLeft ]
, [ Infix (Add <$ reservedOp "+") AssocLeft
, Infix (Sub <$ reservedOp "-") AssocLeft
]
]
arith :: Parser Arith
arith = whiteSpace *> parseArith <* eof
eval :: String -> Maybe Integer
eval s = case parse arith s of
Left _ -> Nothing
Right e -> Just (interpArith e)
Token parsers
- Choose who will start out as the driver, and write their name here:
The first thing to consider is lexer
, which is a TokenParser
. For now, we are using a simple default TokenParser
; later we will see how to customize it. Essentially, lexer
is an automatically generated collection of special parsers which do the low-level work of tokenizing. The functions getInteger
, getParens
, etc. extract these individual parsers from lexer
. We have extracted four such token parsers and given them names to make them easier to use: one to parse integers, one for whitespace, one for operators, and one to parse parenthesized things. (See the definition of GenTokenParser
for a full list of the available token parsers.)
Parsers have a type like Parser a
; for example, integer
has type Parser Integer
. This means it is a parser which consumes some part of a String
and either returns a value of type Integer
or fails with parse error. You can think of it like this:
Parser a == String -> Maybe (a, String)
although the actual definition is quite a bit more complicated.
You can use the parseSome
function (provided in the Parsing
module) to try a parser and see what part of the input it consumes and what part is left to be consumed by subsequent parsers.
Try the
integer
parser usingparseSome
. For example, you could evaluateparseSome integer "23"
in GHCi. Explain howinteger
behaves. Be sure to try lots of different inputs: try spaces before and/or after an integer, try a negative sign, try non-digit characters before and after, etc. (You do not have to record the results of all your experiments.) Explain in detail what theinteger
parser does.Try each of the following and explain what they do. Just as with
integer
, be sure to try many different inputs.reservedOp
whiteSpace
parens
In general, how do token parsers like
integer
,reservedOp
, andparens
handle spaces?
Parser combinators
- ROTATE ROLES and write the name of the new driver here:
The token parsers provide the primitive building blocks out of which we can construct more complicated parsers. Now we will explore some of the functions for building up more complex parsers. Such functions that allow building more complex things out of simpler parts are known as combinators.
Try
integer <|> parens integer
. (As before, “try” means "use theparseSome
function to try it on various example inputs.)What does
integer <|> parens integer
do?What is the type of
(<|>)
?What does
(<|>)
do in general?
Now consider
Lit <$> integer
.What does
Lit <$> integer
do?What is the type of
(<$>)
?What is the type of
Lit
?What is the type of
Lit <$> integer
? What does it do?
Try
integer *> integer
.What does
integer *> integer
do?What is the type of
(*>)
?What does
(*>)
do?
What do you think
(<*)
does? Guess before trying it.Now try it. Were you right?
Try
"I" <$ integer
.What does
"I" <$ integer
do?What is the type of
(<$)
?What does
(<$)
do?How is
(<$)
similar to(<*)
? How is it different?
Combinator exercises
ROTATE ROLES and write the name of the new driver here:
Write a parser which parses two integers, discards the first one, and returns the second wrapped in a
Lit
constructor. Be sure to test it!Write a parser which parses an integer surrounded by
~
on either side (for example,"~23~"
) and returns the integer.- Skim through the
parsec
documentation for the following modules:Note, wherever you see something like
ParsecT s u m a
you should think of it asParser a
(ParsecT s u m a
is a more general version ofParser a
).Pick one of the combinators you find and demonstrate its use. You will have to add an
import
statement to the top of this file of the form:import Text.Parsec.SomeModule (combinatorYouChose)
Building an Arith parser
- ROTATE ROLES and write the name of the new driver here:
Now we will explore how the token parsers and combinators are used to build a parser for the Arith language.
Note that we have defined two parsers of type Parser Arith
, namely, parseArith
and parseArithAtom
(not counting arith
, which we will discuss later). This is a common pattern when building these sorts of parsers. The idea is that an “atomic” thing is something which forms an indivisible unit, which we know how to parse just by looking at the first token. A non-atomic thing might be more complicated, for example, it might involve infix operators.
- Explain what
parseArithAtom
does.
Now look at the definition of parseArith
. It uses a function provided by parsec
called buildExpressionParser
, which deals with parsing infix operators with various precedence levels and associativities (using similar algorithms to those we explored in the previous module).
- Explain what the parser
(Mul <$ reservedOp "*")
does. What is its type?
Notice that table
consists of a list with two elements, each of which is itself a list. The first list has one element which refers to Mul
; the second list has two elements referencing Add
and Sub
.
What do you think this list signifies?
Test your theory by switching the order of the two lists, and doing some test parses. Were you right? (Then switch them back.)
Finally, take a look at the arith
parser.
What does the
eof
parser do? (Try it on some examples, and/or read its documentation.)What does
arith
do?
We have now explored the entire Arith parser. Let’s modify it a bit.
Extend Arith with an exponentiation operator. Remember that exponentiation has higher precedence than multiplication and is right-associative. (The exponentiation operator in Haskell is
(^)
.) You will have to modify theArith
data type, the interpreter, and the parser. Be sure to test that it works correctly.Now extend Arith with prefix negation. For example,
"-(2+3)"
should evaluate to-5
. Again, you should modify theArith
definition, parser, and interpreter. When modifying the parser, you can usePrefix
instead ofInfix
. See the documentation forText.Parsec.Expr
.
The (<*>)
operator
There is one more parser operator we need to learn about, namely, (<*>)
. It wasn’t needed in the Arith parser but will often be useful.
- What is the type of
(<*>)
?
Unfortunately, the type of (<*>)
does not give you a good sense of how to use it unless you have already spent a good deal of time thinking about higher-order functions, currying, and the like in functional programming. Instead, we’ll look at some examples.
- Test the parser
(add <$> integer <*> integer)
. What does it do?
(In fact, we could also have written ((+) <$> integer <*> integer)
, which would do exactly the same thing.)
Write a function
add3
which takes threeInteger
s and adds them.Now test the parser
(add3 <$> integer <*> integer <*> integer)
. What does it do?In general, what do you think
(func <$> parser1 <*> parser2 <*> ... <*> parserN)
does?Write a parser
adder :: Parser Arith
which expects to read two integer values, and returns anArith
value consisting of anAdd
node containing the two integers. For example:parseSome adder "32 9 xyz" Right (Add (Lit 32) (Lit 9), "xyz")
BIN and EBIN parsers
ROTATE ROLES and write the name of the new driver here:
Recall the BIN language from module 4. As a reminder, the concrete syntax of BIN looks like this:
<bin> ::= '#' | '(' <bin> <bin> ')'
Write a parser for BIN using parsec. A few hints:
- Define
symbol = getSymbol lexer
and use it to parse the#
character. UsingreservedOp
will not work since it wants to treat##
as a single operator.
- Define
Now extend your BIN parser to an EBIN parser. Some hints:
Use the parser
digit :: Parser Char
to parse a single digit.You will probably find it useful to write a function
mkLit :: Char -> EBin
which creates anEBin
out of aChar
which is a single digit.
Feedback
How long would you estimate that you spent working on this module?
Were any parts particularly confusing or difficult?
Were any parts particularly fun or interesting?
Record here any other questions, comments, or suggestions for improvement.