In this post, we write the parser for our expression language to an AST, and an AST interpreter.
Introduction
The language that we are going to work with is that of basic arithmetic expressions, with integer values, and addition, subtraction, multiplication and integer division operations. However, our expression language has a small twist: it is possible to introduce a variable using a let binding and use the variable in the expressions in the body of let1. Furthermore, we use the same syntax for let as Haskell does. Here are some examples of valid expressions in our language:
1+2-3*4+5/6/0+1let x =4in x +1let x =4inlet y =5in x + ylet x =4inlet y =5in x +let z = y in z * zlet x =4in (let y =5in x +1) +let z =2in z * zlet x = (let y =3in y + y) in x *3let x =let y =3in y + y in x *3let x =let y =1+let z =2in z * z in y +1in x *3
The only gotcha here is that the body of a let expression extends as far as possible while accounting for nested lets. It becomes clear when we look at parsed expressions later.
The eventual product is a command-line tool that can run different commands. Let’s start with a demo of the tool:
$ arith-vm -h
Bytecode VM for Arithmetic written in Haskell
Usage: arith-vm COMMAND
Available options:
-h,--help Show this help text
Available commands:
parse Parse expression to AST
compile Parse and compile expression to bytecode
disassemble Disassemble bytecode to opcodes
decompile Disassemble and decompile bytecode to expression
interpret-ast Parse expression and interpret AST
interpret-bytecode Parse, compile and assemble expression, and
interpret bytecode
run Run bytecode
generate Generate a random arithmetic expression
$ arith-vm parse -h
Usage: arith-vm parse [FILE]
Parse expression to AST
Available options:
FILE Input file, pass - to read from STDIN (default)
-h,--help Show this help text
$ echo -n "let x = 1 in let y = 2 in y + x * 3" | arith-vm parse
( let x = 1 in ( let y = 2 in ( y + ( x * 3 ) ) ) )
$ echo -n "let x = 1 in let y = 2 in y + x * 3" | arith-vm compile > a.tbc
$ hexdump -C a.tbc
00000000 00 01 00 00 02 00 03 01 03 00 00 03 00 06 04 02 |................|
00000010 01 02 01 |...|
00000013
$ arith-vm disassemble a.tbc
OPush 1
OPush 2
OGet 1
OGet 0
OPush 3
OMul
OAdd
OSwap
OPop
OSwap
OPop
$ arith-vm decompile a.tbc
( let v0 = 1 in ( let v1 = 2 in ( v1 + ( v0 * 3 ) ) ) )
$ echo -n "let x = 1 in let y = 2 in y + x * 3" | arith-vm interpret-ast
5
$ echo -n "let x = 1 in let y = 2 in y + x * 3" | arith-vm interpret-bytecode
5
$ arith-vm run a.tbc
5
$ arith-vm generate
(
(
(
( let nD =
( 11046 - -20414 ) in
( let xqf = ( -15165 * nD ) in nD )
) * 26723
) /
(
( let phMuOI =
( let xQ = ( let mmeBy = -28095 in 22847 ) in 606 ) in 25299
) *
( let fnoNQm = ( let mzZaZk = 29463 in 18540 ) in ( -2965 / fnoNQm ) )
)
) * 21400
)
We can parse an expression, or compile it to bytecode. We can also disassemble bytecode to opcodes, or decompile it back to an expression. We can interpret an expression either as an AST or as bytecode. We can also run a bytecode file directly. Finally, we have a handy command to generate random expressions for testing/benchmarking purposes2.
Let’s start.
Expressions
Since this is Haskell, we start with listing many language extensions and imports:
We use the GHC2021 extension here that enables a lot of useful GHC extensions by default. We are using the bytestring and attoparsec libraries for parsing, dlist and containers for compilation, deepseq, mtl and vector for interpreting, and QuickCheck for testing.
dataExpr=Num!Int16|Var!Ident|BinOp!OpExprExpr|Let!IdentExprExprderiving (Eq, Generic)newtypeIdent=Ident {unIdent ::BS.ByteString} deriving (Eq, Ord, Generic)dataOp=Add|Sub|Mul|Divderiving (Eq, Enum, Generic)instanceNFDataExprinstanceShowExprwhereshow= \caseNum n ->show nVar (Ident x) -> BSC.unpack xBinOp op a b ->"("<>show a <>" "<>show op <>" "<>show b <>")"Let (Ident x) a b ->"(let "<> BSC.unpack x <>" = "<>show a <>" in "<>show b <>")"instanceNFDataIdentinstanceShowIdentwhereshow (Ident x) = BSC.unpack xmkIdent ::String->IdentmkIdent =Ident. BSC.packinstanceNFDataOpinstanceShowOpwhereshow= \caseAdd->"+"Sub->"-"Mul->"*"Div->"/"
ArithVMLib.hs
We add Show instances for ADTs so that we can pretty-print the parsed AST3. Now, we can start parsing.
expr ::= term | term space*("+"|"-") termterm ::= factor | factor space*("*"|"/") factorfactor ::= space*(grouping | num | var | let)grouping ::="(" expr space*")"num ::="-"?[1-9][0-9]*var ::= identident ::=([a-z]|[A-Z])+let ::="let" space+ ident space*"=" expr space*"in" space+ expr space*space ::=" "|"\t"|"\n"|"\f"|"\r"
The expr, term, factor, and grouping productions take care of having the right precedence of arithmetic operations. The num and var productions are trivial. Our language is fairly oblivious of whitespaces; we allow zero-or-more spaces at most places.
The let expressions grammar is pretty standard, except we require one-or-more spaces after the let and in keywords to make them unambiguous.
We use the parser combinator library attoparsec for creating the parser. attoparsec works directly with bytestrings so we don’t incur the cost of decoding unicode characters45.
We write the parser in a top-down fashion, same as the grammar, starting with the expr parser:
-- expr ::= term | term space* ("+" | "-") termexprParser ::P.ParserExprexprParser = chainBinOps termParser $ \case'+'->pureAdd'-'->pureSub op ->fail$"Expected '+' or '-', got: "<>show op-- term ::= factor | factor space* ("*" | "/") factortermParser ::P.ParserExprtermParser = chainBinOps factorParser $ \case'*'->pureMul'/'->pureDiv op ->fail$"Expected '*' or '/', got: "<>show opchainBinOps ::P.ParserExpr-> (Char->P.ParserOp) ->P.ParserExprchainBinOps operandParser operatorParser = operandParser >>= restwhere rest !expr = ( do P.skipSpace c <- P.anyChar operator <- operatorParser c operand <- operandParser rest $BinOp operator expr operand ) <|>pure expr{-# INLINE chainBinOps #-}
ArithVMLib.hs
Both exprParser and termParser chain the right higher precedence parsers with the right operators between them6 using the chainBinOps combinator.
-- factor ::= space* (grouping | num | var | let)factorParser ::P.ParserExprfactorParser =do P.skipSpace P.peekChar' >>= \case'('-> groupingParser'-'-> numParser c | P.isDigit c -> numParser c | c /='l'-> varParser _ -> varParser <|> letParser-- grouping ::= "(" expr space* ")"groupingParser ::P.ParserExprgroupingParser = P.char '('*> exprParser <* P.skipSpace <* P.char ')'
ArithVMLib.hs
factorParser uses lookahead to dispatch between one of the primary parsers, which is faster than using backtracking. groupingParser simply skips the parenthesis, and recursively calls exprParser.
-- num ::= "-"? [1-9] [0-9]*numParser ::P.ParserExprnumParser =do n <- P.signed P.decimal P.<?>"number"if validInt16 nthenpure$Num$fromIntegral nelsefail$"Expected a valid Int16, got: "<>show nwhere validInt16 ::Integer->Bool validInt16 i =fromIntegral (minBound@Int16) <= i&& i <=fromIntegral (maxBound@Int16)
ArithVMLib.hs
numParser uses the signed and decimal parsers from the attoparsec library to parse an optionally signed integer. We restrict the numbers to 2-byte integers (-32768–32767 inclusive)7. The <?> helper from attoparsec names parsers so that the error message shown in case of failures point to the right parser.
varParser and identParser are straightforward. We restrict identifiers to upper-and-lowercase ASCII alphabetic letters. We also check that our reserved keywords (let and in) are not used as identifiers.
Finally, we write the parser for let expressions:
-- let ::= "let" space+ ident space* "=" expr space* "in" space+ expr space*letParser ::P.ParserExprletParser =do expect "let"<* skipSpace1!x <- identParser P.skipSpace *> expect "=" assign <- exprParser P.skipSpace *> expect "in"<* skipSpace1 body <- exprParser <* P.skipSpacepure$Let x assign bodywhere expect s = void (P.string s) <|>do found <- P.manyTill P.anyChar (void P.space <|> P.endOfInput)let found' =if found ==""then"end-of-input"else"\""<> found <>"\""fail$"Expected: \""<> BSC.unpack s <>"\", got: "<> found' skipSpace1 = P.space *> P.skipSpace
ArithVMLib.hs
In letParser we use identParser to parse the variable name, and recursively call exprParser to parse the assignment and body expressions, while making sure to correctly parse the spaces. The helper parser expect is used to parse known string tokens (let, = and in), and provide good error messages in case of failures. Talking about error messages …
Error Handling
Let’s figure out an error handling strategy. We use an Error type wrapped in Either to propagate the errors in our program:
The Error type also captures the Pass in which the error is thrown. Result is a type alias that represents either an error or a result. Finally, we put all the parsers together to write the parse function.
The Parser
Our parse function uses the parse function from attoparsec to run the exprParser over an input.
The processResult function deals with intricacies of how attoparsec returns the parsing result. Basically, we inspect the returned result and throw appropriate errors with useful error messages. We use throwError from the MonadError typeclass that works with all its instances, which Either is one of.
The parser is done. But as good programmers, we must make sure that it works correctly. Let’s write some unit tests.
Testing the Parser
We use the hspec library to write unit tests for our program. Each test is written as a spec8.
{-# LANGUAGE GHC2021 #-}{-# LANGUAGE OverloadedStrings #-}moduleMain (main) whereimportArithVMLibimportControl.Monad (forM_, (>=>))importData.ByteString.Char8qualifiedasBSCimportData.Int (Int16)importData.SequencequalifiedasSeqimportTest.HspecimportTest.Hspec.QuickCheckimportTest.QuickCheckqualifiedasQparserSpec ::SpecparserSpec = describe "Parser"$do forM_ parserSuccessTests $ \(input, result) -> it ("parses: \""<> BSC.unpack input <>"\"") $do (show<$> parse input) `shouldBe`Right result forM_ parserErrorTests $ \(input, err) -> it ("fails for: \""<> BSC.unpack input <>"\"") $do parse input `shouldSatisfy` \caseLeft (ErrorParse msg) | err == msg ->True _ ->FalseparserSuccessTests :: [(BSC.ByteString, String)]parserSuccessTests = [ ( "1 + 2 - 3 * 4 + 5 / 6 / 0 + 1","((((1 + 2) - (3 * 4)) + ((5 / 6) / 0)) + 1)" ), ( "1+2-3*4+5/6/0+1","((((1 + 2) - (3 * 4)) + ((5 / 6) / 0)) + 1)" ), ( "1 + -1","(1 + -1)" ), ( "let x = 4 in x + 1","(let x = 4 in (x + 1))" ), ( "let x=4in x+1","(let x = 4 in (x + 1))" ), ( "let x = 4 in let y = 5 in x + y","(let x = 4 in (let y = 5 in (x + y)))" ), ( "let x = 4 in let y = 5 in x + let z = y in z * z","(let x = 4 in (let y = 5 in (x + (let z = y in (z * z)))))" ), ( "let x = 4 in (let y = 5 in x + 1) + let z = 2 in z * z","(let x = 4 in ((let y = 5 in (x + 1)) + (let z = 2 in (z * z))))" ), ( "let x=4in 2+let y=x-5in x+let z=y+1in z/2","(let x = 4 in (2 + (let y = (x - 5) in (x + (let z = (y + 1) in (z / 2))))))" ), ( "let x = (let y = 3 in y + y) in x * 3","(let x = (let y = 3 in (y + y)) in (x * 3))" ), ( "let x = let y = 3 in y + y in x * 3","(let x = (let y = 3 in (y + y)) in (x * 3))" ), ( "let x = let y = 1 + let z = 2 in z * z in y + 1 in x * 3","(let x = (let y = (1 + (let z = 2 in (z * z))) in (y + 1)) in (x * 3))" ) ]parserErrorTests :: [(BSC.ByteString, String)]parserErrorTests = [ ("", "Not enough input"), ("1 +", "Leftover input: \"+\""), ("1 & 1", "Leftover input: \"& 1\""), ("1 + 1 & 1", "Leftover input: \"& 1\""), ("1 & 1 + 1", "Leftover input: \"& 1 + 1\""), ("(", "Not enough input"), ("(1", "Expected: ')', got: end-of-input"), ("(1 + ", "Expected: ')', got: \"+\""), ("(1 + 2", "Expected: ')', got: end-of-input"), ("(1 + 2}", "Expected: ')', got: \"}\""), ("66666", "Expected a valid Int16, got: 66666"), ("-x", "Expected: number, got: \"-x\""), ("let 1", "Expected: identifier, got: \"1\""), ("let x = 1 in ", "Not enough input"), ( "let let = 1 in 1","Expected identifier, got: \"let\", which is a reversed keyword" ), ( "let x = 1 in in","Expected identifier, got: \"in\", which is a reversed keyword" ), ("let x=1 inx", "Expected: space, got: \"x\""), ("letx = 1 in x", "Leftover input: \"= 1 in x\""), ("let x ~ 1 in x", "Expected: \"=\", got: \"~\""), ("let x = 1 & 2 in x", "Expected: \"in\", got: \"&\""), ("let x = 1 inx", "Expected: space, got: \"x\""), ("let x = 1 in x +", "Leftover input: \"+\""), ("let x = 1 in x in", "Leftover input: \"in\""), ("let x = let x = 1 in x", "Expected: \"in\", got: end-of-input") ]
ArithVMSpec.hs
We have a bunch of tests for the parser, testing both success and failure cases. Notice how spaces are treated in the expressions. Also notice how the let expressions are parsed. We’ll add property-based tests for the parser in the next post.
There is not much we can do with the parsed ASTs at this point. Let’s write an interpreter to evaluate our ASTs.
The AST Interpreter
The AST interpreter is a standard and short recursive interpreter with an environment mapping variables to their values:
interpretAST ::Expr->ResultInt16interpretAST = go Map.emptywhere go env = \caseNum n ->pure nVar x ->case Map.lookup x env ofJust v ->pure vNothing-> throwError .ErrorInterpretAST$"Unknown variable: "<> BSC.unpack (unIdent x)BinOp op a b ->do!a' <- go env a!b' <- go env b interpretOp InterpretAST a' b' opLet x assign body ->do!val <- go env assign go (Map.insert x val env) bodyinterpretOp :: (MonadErrorError m) =>Pass->Int16->Int16->Op-> m Int16interpretOp pass a b = \caseAdd->pure$! a + bSub->pure$! a - bMul->pure$! a * bDiv| b ==0-> throwError $Error pass "Division by zero"Div| b == (-1) && a ==minBound-> throwError $Error pass "Arithmetic overflow"Div->pure$! a `div` b{-# INLINE interpretOp #-}
ArithVMLib.hs
This interpreter serves both as a performance baseline for the bytecode VM we write later, and as a definitional interpreter for testing the VM. We extract the interpretOp helper function for later reuse9. interpretOp is careful in detecting division-by-zero and arithmetic overflow errors, but we ignore possible integer overflow/underflow errors that may be caused by the arithmetic operations.
Testing the Interpreter
We write some unit tests for the interpreter following the same pattern as the parser:
astInterpreterSpec ::SpecastInterpreterSpec = describe "AST interpreter"$do forM_ astInterpreterSuccessTests $ \(input, result) -> it ("interprets: \""<> BSC.unpack input <>"\"") $do parseInterpret input `shouldBe`Right result forM_ astInterpreterErrorTests $ \(input, err) -> it ("fails for: \""<> BSC.unpack input <>"\"") $do parseInterpret input `shouldSatisfy` \caseLeft (ErrorInterpretAST msg) | err == msg ->True _ ->Falsewhere parseInterpret = parse >=> interpretASTastInterpreterSuccessTests :: [(BSC.ByteString, Int16)]astInterpreterSuccessTests = [ ("1", 1), ("1 + 2 - 3 * 4 + 5 / 6 / 1 + 1", -8), ("1 + (2 - 3) * 4 + 5 / 6 / (1 + 1)", -3), ("1 + -1", 0), ("1 * -1", -1), ("let x = 4 in x + 1", 5), ("let x = 4 in let y = 5 in x + y", 9), ("let x = 4 in let y = 5 in x + let z = y in z * z", 29), ("let x = 4 in (let y = 5 in x + y) + let z = 2 in z * z", 13), ("let x = let y = 3 in y + y in x * 3", 18), ("let x = let y = 1 + let z = 2 in z * z in y + 1 in x * 3", 18) ]astInterpreterErrorTests :: [(BSC.ByteString, String)]astInterpreterErrorTests = [ ("x", "Unknown variable: x"), ("let x = 4 in y + 1", "Unknown variable: y"), ("let x = y + 1 in x", "Unknown variable: y"), ("let x = x + 1 in x", "Unknown variable: x"), ("1/0", "Division by zero"), ("-32768 / -1", "Arithmetic overflow") ]
ArithVMSpec.hs
Now, we can run the parser and interpreter tests to make sure that everything works correctly.
main ::IO ()main = hspec $do parserSpec astInterpreterSpec
ArithVMSpec.hs
Output of the test run
$ cabal test -O2
Running 1 test suites...
Test suite specs: RUNNING...
Parser
parses: "1 + 2 - 3 * 4 + 5 / 6 / 0 + 1" [✔]
parses: "1+2-3*4+5/6/0+1" [✔]
parses: "1 + -1" [✔]
parses: "let x = 4 in x + 1" [✔]
parses: "let x=4in x+1" [✔]
parses: "let x = 4 in let y = 5 in x + y" [✔]
parses: "let x = 4 in let y = 5 in x + let z = y in z * z" [✔]
parses: "let x = 4 in (let y = 5 in x + 1) + let z = 2 in z * z" [✔]
parses: "let x=4in 2+let y=x-5in x+let z=y+1in z/2" [✔]
parses: "let x = (let y = 3 in y + y) in x * 3" [✔]
parses: "let x = let y = 3 in y + y in x * 3" [✔]
parses: "let x = let y = 1 + let z = 2 in z * z in y + 1 in x * 3" [✔]
fails for: "" [✔]
fails for: "1 +" [✔]
fails for: "1 & 1" [✔]
fails for: "1 + 1 & 1" [✔]
fails for: "1 & 1 + 1" [✔]
fails for: "(" [✔]
fails for: "(1" [✔]
fails for: "(1 + " [✔]
fails for: "(1 + 2" [✔]
fails for: "(1 + 2}" [✔]
fails for: "66666" [✔]
fails for: "-x" [✔]
fails for: "let 1" [✔]
fails for: "let x = 1 in " [✔]
fails for: "let let = 1 in 1" [✔]
fails for: "let x = 1 in in" [✔]
fails for: "let x=1 inx" [✔]
fails for: "letx = 1 in x" [✔]
fails for: "let x ~ 1 in x" [✔]
fails for: "let x = 1 & 2 in x" [✔]
fails for: "let x = 1 inx" [✔]
fails for: "let x = 1 in x +" [✔]
fails for: "let x = 1 in x in" [✔]
fails for: "let x = let x = 1 in x" [✔]
AST interpreter
interprets: "1" [✔]
interprets: "1 + 2 - 3 * 4 + 5 / 6 / 1 + 1" [✔]
interprets: "1 + (2 - 3) * 4 + 5 / 6 / (1 + 1)" [✔]
interprets: "1 + -1" [✔]
interprets: "1 * -1" [✔]
interprets: "let x = 4 in x + 1" [✔]
interprets: "let x = 4 in let y = 5 in x + y" [✔]
interprets: "let x = 4 in let y = 5 in x + let z = y in z * z" [✔]
interprets: "let x = 4 in (let y = 5 in x + y) + let z = 2 in z * z" [✔]
interprets: "let x = let y = 3 in y + y in x * 3" [✔]
interprets: "let x = let y = 1 + let z = 2 in z * z in y + 1 in x * 3" [✔]
fails for: "x" [✔]
fails for: "let x = 4 in y + 1" [✔]
fails for: "let x = y + 1 in x" [✔]
fails for: "let x = x + 1 in x" [✔]
fails for: "1/0" [✔]
fails for: "-32768 / -1" [✔]
Finished in 0.0061 seconds
53 examples, 0 failures
Test suite specs: PASS
Awesome, it works! That’s it for this post. In the next part, we write a bytecode compiler for our expression AST.
If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading!