Generalizing recursive descent

11.03.2023

The recursive descent parser is the first algorithm that comes to mind when your task is to write expression parser, so the example grammar of a simple language that can add multiple things will look like this:

digit  = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9";
number = digit {digit};

term   = factor { "*" factor };
factor = number
       | "(" expression ")"
       | "-" factor;

expression = term { "+" term };

Whitespaces are ignored. This will parse into the following expressions:

  • 1+2*3 as (+ 1 (* 2 3))
  • 1+2*4+3 as (+ (+ 1 (* 2 4)) 3)

And so on, grammar is really primitive, so further examples are unnecessary. Main idea of the algorithm is that items of lower precedence are compound of higher precedence items, like so:

t - term
f - factor
p - exponentiation

     t   +   t
     |       |
    p*p     p*p
    | |     | |
     |       |
    f^f     f^f

I added exponentiation to make the schema easier to understand. But what if we want

  • Arbitrary number of operators?
  • Arbitrary number of precedences?

It can be useful to be able to easily tweak the parser or build a more flexible language, so what?

Generalization

The task is to write mathematical expression evaluator that can:

  • Handle arbitrary number of operators
  • Handle arbitrary number of precedences

So the exact algorithm is: write tokenizer (1), write a parser (2), evaluate the result (3)

1. Writing a tokenizer

Tokenizer is the easiest part of writing an evaluator, we just need to break up our text into tokens:

import Data.Char ( isDigit )

data BType = Open | Close
           deriving Show

newtype Operator = Operator String
                 deriving(Show, Eq, Ord) -- Ord will be needed later

data Token = TOperator Operator
           | TNumber   Integer
           | TBracket  BType
           deriving Show

compoundOperatorChars :: String
compoundOperatorChars = "+-*<>/!?"

tokenize :: String -> [Token]

-- Skipping the whitespaces
tokenize (h:t) | h `elem` " \t" =
       tokenize t

-- Brackets
tokenize ('(':t) = TBracket Open  : tokenize t
tokenize (')':t) = TBracket Close : tokenize t

tokenize (h:t) | h `elem` compoundOperatorChars =
       TOperator compound : tokenize tail'
       where (rest, tail') = span (`elem` compoundOperatorChars) t
             compound      = Operator $ h:rest

-- Parsing digits
tokenize (h:t) | isDigit h =
       token : tokenize tail'
       where (rest, tail') = span isDigit t
             token         = TNumber $ read (h:rest)

tokenize (h:t) = error $ "Unexpected char " ++ [h]
tokenize [] = []

Demonstration

ghci> tokenize "1 + 2 + 3 * 4 <> 5"
[TNumber 1,TOperator "+",TNumber 2,TOperator "+",TNumber 3,TOperator "*",TNumber 4,TOperator "<>",TNumber 5]

So far so good, it will parse:

  • Operators compound of +-*<>/!? characters, for example: <$>, <!>, <>, >, <, */ and so on
  • Integer literals: 1, 2, 3, 4, …
  • Brackets

2. Writing a parser

Main objective of this article, basically our parser will consist of two components:

  1. Precedence store - where to store
  2. Expression parser - main logic of the parser

NOTE: The parser will not implement unary operations because they are pretty easy to tweak the parser to implement.

What’s a Precedence store?

Just a map with some additional quirks for hierarchy:

import qualified Data.Map       as M
import qualified Data.Bifunctor as Bi

import Prelude hiding ( scope, lookup )

type Precedence = Integer
data Store = Store { operators   :: M.Map Operator   Precedence
                   , scope       :: M.Map Precedence Integer
                   }
           deriving Show

splitMin :: Store -> Maybe (Precedence, Store)
splitMin (Store operators scope) =
       -- M.keys returns keys in the ascending order
       -- so the first returned key is the least
       case M.keys scope of
           [] -> Nothing
           key:_ ->
              -- Here' we remove the least precedence from scope
              -- if number of operators with that precedence is 1
              -- or just subtract one, if there's more\
              let scope' = M.updateWithKey f key scope
                  f _ v  = if v == 1 then Nothing else Just (v - 1)
              in Just (key, Store operators scope')

lookup :: Operator -> Store -> Precedence
lookup op (Store operators _) =
    case M.lookup op operators of
       Just r -> r
       Nothing -> error $ "No such operator" ++ show op

fromList :: [( Operator, Precedence )] -> Store
fromList list' =
       let -- Construct map of operator:precedence
           operators' = M.fromList list'
           -- Create so-called "scope"
           -- it's just map of precedence:number-of-operators
           scope'     = foldl f M.empty list'

           f map' (op, prec') =
              -- insert or add 1 to existing entry in the scope map
              M.insertWith (+) prec' 1 map'
       in Store operators' scope'

Wait, why is scope even needed? Well, that’s the key aspect of generalization, I call it “splitting”. Still remember the idea of recursive descent? The lower precedence items are compound of higher precedence, the general algorithm will do the following:

  • Pick lowest precedence from the store
  • Remove it from the scope
  • Process it

And it will work… As far as your operators map doesn’t contain operators with same precedence:

1 + 2 - 3

It will successfuly parse (+ 1 2) and return “- 3” as the tail, because - has same precedence as the +. Why is that so? I’ll explain it in further detail later, so pay attention!

Actual parser

We need exactly two things:

  • factor - items with the highest possible precedence
  • expression - items with lower precedence

or not two, there’s one more thing: AST representation and eval function.

data Expr = EBinary Operator Expr Expr
          | ENumber Integer
          deriving Show

-- The function that will reduce two expressions to one
-- (perform binary operation)
type BinEvalFn = Operator -> Integer -> Integer -> Integer

eval :: BinEvalFn -> Expr -> Integer
eval reduce_binary expr =
    case expr of
       EBinary operator lhs rhs ->
           reduce_binary operator (eval' lhs) (eval' rhs)
       ENumber number ->
           number
    where eval' = eval reduce_binary
  1. Skeleton of the expression function will look like this:
type Tailed a = Maybe (a, [Token])

expression :: Store -> [Token] -> Tailed Expr
expression store =
    case splitMin store of
       Just (precedence, whats_left) ->
           -- We still have ways to go down
           undefined
       Nothing ->
           -- Here we on the factor's precedence
           undefined

Key aspect is the case splitMin store of part: splitMin returns lowest precedence on the Store, returns it as the head and what’s left as the tail, so we’ll go down from lowest to highest precedences.

Adding factor:

type Tailed a = Maybe (a, [Token])

-- This function would parse most fundamental things in the expression:
-- 1. numbers
-- 2. expressions in the brackets
--
-- Referencing our first grammar, it's basically the equivalent to:
-- factor = digit {digit}
--        | "(" expression ")";
factor :: Store -> [Token] -> Tailed Expr
factor store (h:t) =
    case h of
       TNumber number -> Just ( ENumber number
                              , t
                              )
       TBracket Open  -> do
           (expr, t') <- expression store t
           case t' of
              -- Expect next token to be a closing bracket
              TBracket Close:t'' ->
                  Just ( expr
                       , t'' )
              -- Fail if it isn't
              _ -> Nothing

       
       -- Unexpected token
       _ -> Nothing
factor _ [] = Nothing

type ParseF = [Token] -> Tailed Expr

expression :: Store -> [Token] -> Tailed Expr
expression store =
    -- Cut head off the store to get least precedence
    case splitMin store of
       Just (precedence, whats_left) ->
           -- We still have ways to go down
           -- whats_left will contain the remaining precedences
           binary precedence $ expression whats_left
       Nothing ->
           -- And eventually we'll fall here, 
           -- On the factor's precedence - highest precedence possible
           factor store
    where
       -- Binary function parses binary expressions, obviously
       -- a {<operator> b}
       -- ^  ^^^^^^^^^^^^
       -- term    |
       --         +-- Any number of operations with the same operator
       binary :: Integer -> ParseF -> [Token] -> Tailed Expr
       binary current_precedence parse tokens =
           parse tokens >>= \(lhs, t) ->
              case t of
                  -- Next token should be infix operator
                  TOperator operator:t' ->
                     -- If we're not parsing operator with that precedence
                     -- return only lhs
                     if lookup operator store == current_precedence then
                         do
                             (rhs, t'') <- parse t'
                             Just ( EBinary operator lhs rhs
                                  , t'' )
                     else
                         Just ( lhs
                              , t )

                  -- Or it's just the end, for example:
                  -- 2: from + to * we got only factor
                  -- but it's still OK
                  _ -> Just ( lhs
                            , t )

But wait, will the binary parse correctly only something like a + b and a + b + c would parse as (+ a b) and return "+ c" as the tail? Definitely! So taking sequential operator applications into account are required as well:

type Tailed a = Maybe (a, [Token])

factor :: Store -> [Token] -> Tailed Expr
factor store (h:t) =
    case h of
       TNumber number -> Just ( ENumber number
                              , t
                              )
       TBracket Open  -> do
           (expr, t') <- expression store t
           case t' of
              -- Expect next token to be a closing bracket
              TBracket Close:t'' ->
                  Just ( expr
                       , t'' )
              -- Fail if it isn't
              _ -> Nothing

       
       -- Unexpected token
       _ -> Nothing
factor _ [] = Nothing

type ParseF = [Token] -> Tailed Expr

-- Function that consumes left hand side expression
-- and remaining tokens and returns pair of resulting expression and the tail
type FoldFn = Expr -> [Token] -> Tailed Expr

expression :: Store -> [Token] -> Tailed Expr
expression store =
    -- Result of this case would be curried
    -- So actually we return `[Token] -> Tailed Expr`
    case splitMin store of
       Just (precedence, whats_left) ->
           -- We still have ways to go down
           binary precedence $ expression whats_left
       Nothing ->
           -- Here we on the factor's precedence
           factor store
    where
       binary :: Integer -> ParseF -> [Token] -> Tailed Expr
       binary current_precedence parse tokens =
           parse tokens >>= \(lhs, t) ->
              case t of
                  -- Next token should be infix operator
                  TOperator operator:t' ->
                     -- If we're not parsing operator with that precedence
                     -- return only lhs
                     if lookup operator store == current_precedence then
                         -- Before we met first operator
                         foldlExpr lhs (leftFoldFn parse $ expectSameOp operator) t
                     else
                         Just ( lhs
                              , t )

                  -- Or it's just the end, for example:
                  -- 2: from + to * we got only factor
                  -- but it's still OK
                  _ -> Just ( lhs
                            , t )

       leftFoldFn :: ParseF -> ([Token] -> Tailed Operator) -> Expr -> [Token] -> Tailed Expr
       leftFoldFn parse expectOperator lhs tokens = do
           (op, t) <- expectOperator tokens
           (rhs, t') <- parse t

           Just ( EBinary op lhs rhs
                , t' )
       expectSameOp :: Operator -> [Token] -> Tailed Operator
       expectSameOp op (TOperator got_op:t)
           | got_op == op = Just (got_op, t)
       expectSameOp _ _   = Nothing

       -- This is just modification of the standard `foldl`
       -- Folds multiple expressions into one
       foldlExpr :: Expr -> FoldFn -> [Token] -> Tailed Expr
       foldlExpr expr fold_fn tokens =
           case fold_fn expr tokens of
              Just (expr', tail') ->
                  foldlExpr expr' fold_fn tail'
              Nothing -> Just (expr, tokens)

For comfortable testing we’ll need to add this:

defaultStore :: Store
defaultStore = fromList [(Operator "+", 1), (Operator "-", 1), (Operator "*", 2)]

parseText :: String -> Tailed Expr
parseText = expression defaultStore . tokenize

-- Parses and returns S-expression
parseTextAsSExpr :: String -> Tailed String 
parseTextAsSExpr =
    fmap (Bi.first toSExpr) . parseText

toSExpr :: Expr -> String
toSExpr (EBinary (Operator op) lhs rhs) =
    "(" ++ op ++ " " ++ toSExpr lhs ++ " " ++ toSExpr rhs ++ ")"
toSExpr (ENumber number) =
    show number

Testing:

ghci> parseTextAsSExpr "1 + 2 * 3"
Just ("(+ 1 (* 2 3))",[])
ghci> parseTextAsSExpr "1 + 2 + 3*3 + 4"
Just ("(+ (+ (+ 1 2) (* 3 3)) 4)",[])

Nice! Works as expected, but what if we try use brackets?

ghci> parseTextAsSExpr "2 + 2 * 2"
Just ("(+ 2 (* 2 2))",[])
ghci> parseTextAsSExpr "(2 + 2) * 2"
Nothing
ghci>

What? It fails, what if…

ghci> parseTextAsSExpr "(2) * 2"
Just ("(* 2 2)",[])
ghci>

Yeah, so just essentially our problem is factor, going in the place…

expression :: Store -> [Token] -> Tailed Expr
expression store =
    -- Result of this case would be curried
    -- So actually we return `[Token] -> Tailed Expr`
    case splitMin store of
       Just (precedence, whats_left) ->
           -- We still have ways to go down
           binary precedence $ expression whats_left
           --                  ^^^^^^^^^^^^^^^^^^^^^
           --                        Cutted off
       Nothing ->
           -- Here we on the factor's precedence
           factor store
           --     ^^^^^

Nah, our store’s scope is empty at the moment where it gets to the factor and factor internally can call an expression, the result is that nested expressions can only be factors:

ghci> parseTextAsSExpr "((2)) * 2"
Just ("(* 2 2)",[])
ghci> parseTextAsSExpr "(((((2))))) * 2"
Just ("(* 2 2)",[])
ghci>

The solution is just to keep track of the “root” scope:

import qualified Data.Map       as M
import qualified Data.Bifunctor as Bi

import Prelude hiding ( scope, lookup )

type Precedence = Integer
data Store = Store { operators   :: M.Map Operator   Precedence
                   , scope       :: M.Map Precedence Integer
                   , root        :: M.Map Precedence Integer
                   }
           deriving Show

splitMin :: Store -> Maybe (Precedence, Store)
splitMin (Store operators scope root) =
       -- M.keys returns keys in the ascending order
       -- so the first returned key is the least
       case M.keys scope of
           [] -> Nothing
           key:_ ->
              -- Here' we remove the least precedence from scope
              -- if number of operators with that precedence is 1
              -- or just subtract one, if there's more\
              let scope' = M.updateWithKey f key scope
                  f _ v  = if v == 1 then Nothing else Just (v - 1)
              in Just (key, Store operators scope' root)

restoreRoot :: Store -> Store
restoreRoot (Store operators _ root) =
    Store operators root root

lookup :: Operator -> Store -> Precedence
lookup op (Store operators _ _) =
    case M.lookup op operators of
       Just r -> r
       Nothing -> error $ "No such operator" ++ show op

fromList :: [( Operator, Precedence )] -> Store
fromList list' =
       let -- Construct map of operator:precedence
           operators' = M.fromList list'
           -- Create so-called "scope"
           -- it's just map of precedence:number-of-operators
           scope'     = foldl f M.empty list'

           f map' (op, prec') =
              -- insert or add 1 to existing entry in the scope map
              M.insertWith (+) prec' 1 map'
       in Store operators' scope' scope'

Here we just save the root precedences list and chopping plain scope, fixing original code:


-- snip --

expression :: Store -> [Token] -> Tailed Expr
expression store =
    -- Result of this case would be curried
    -- So actually we return `[Token] -> Tailed Expr`
    case splitMin store of
       Just (precedence, whats_left) ->
           -- We still have ways to go down
           binary precedence $ expression whats_left
       Nothing ->
           -- Here we on the factor's precedence
           factor $ restoreRoot store
    -- snip --

Rechecking gives…

ghci> parseTextAsSExpr "(2 + 2) * 2"
Just ("(* (+ 2 2) 2)",[])
ghci> parseTextAsSExpr "(2 + 2 * 2) * 2"
Just ("(* (+ 2 (* 2 2)) 2)",[])
ghci>

The right answer, nice!

3. Evaluate

The last part is evaluation, the easiest, since we’ve done the most

evalText :: String -> Integer
evalText text =
    case parseText text of
       Just (tree, []) ->
           eval evaluateBinary tree
       Just (tree, t) ->
           error $ "Failed to parse entire expr (" ++ toSExpr tree ++ "): tail is " ++ show t
       Nothing ->
           error "Failed to parse expression"
    where
        evaluateBinary op lhs rhs =
            case op of
                Operator "+" -> lhs + rhs
                Operator "-" -> lhs - rhs
                Operator "*" -> lhs * rhs

let’s test it:

ghci> evalText "2 + 2 * 2"
6
ghci> evalText "(2 + 2) * 2"
8
ghci> evalText "3*3 + 4*4"
25
ghci> evalText "2 + 2 + 2 - 2"
4
ghci>

Works as expected

Further reading

  1. Crafting Interpreters book - easy to start
  2. Simple but Powerful Pratt Parsing - matklad’s article, alternative to the recursive descent, can be generalized the same way

At the end

As you can see recursive descent is pretty easily can be generalized and, as said in the craftinginterpreters book: It rocks.

Now go write your own language for the great good! Resulting code you can find on the gist