Recursive Descent Parsing

2026-01-29

Recursive descent parsing is a top-down parsing technique where each non-terminal in the grammar is implemented as a recursive procedure. The parser begins with the start symbol and recursively expands non-terminals according to production rules, consuming input tokens in a left-to-right scan with lookahead to select appropriate productions.

Characteristics

Top-down construction: Builds the parse tree from the root (start symbol) down to the leaves (terminals), in contrast to bottom-up parsers which work from leaves to root.
Predictive parsing: Uses lookahead tokens to deterministically select which production to apply
Direct grammar mapping: Each non-terminal maps to a parsing function, making the implementation structurally isomorphic to the grammar specification.
Left-to-right, leftmost derivation: Processes input tokens sequentially from left to right while constructing a leftmost derivation of the input string.

Step by Step

1. One Function Per Grammar Rule

Each grammar rule becomes a function. When a rule says “do A then B”, the function calls A’s function, then B’s function. When a rule says “do A or B”, the function looks at the next token to decide which one to call. The “recursive” part happens when rules reference each other in circles.

Grammar:
  expression → term (('+' | '-') term)*
  term → number

Parser:
  function expression() { ... }
  function term() { ... }

2. Reading the Input

The parser keeps track of:

Current token (what we’re looking at right now)
A way to get the next token (from the lexer)

parser:
  current_token = first token from lexer
  
  function advance():
    current_token = get next token from lexer

3. Matching Terminals

When the grammar says expect this specific token, we check and consume it:

function eat(expected_token_type):
  if current_token.type == expected_token_type:
    advance()  // Move to next token
  else:
    error("Expected " + expected_token_type)

This is like checking: “Do I have the ingredient I need next?”

4. Handling Sequences

When the grammar says “A then B then C”:

Grammar:
  statement → 'if' expression 'then' body

Parser:
function statement():
  eat('if')
  expression()
  eat('then')
  body()

Just call each part in order.

5. Handling Choices

When the grammar says “A or B or C”:

Grammar:
  value → number | string | boolean

Parser:
function value():
  if current_token is NUMBER:
    return number()
  else if current_token is STRING:
    return string()
  else if current_token is BOOLEAN:
    return boolean()
  else:
    error("Expected value")

Look at the current token to decide which path to take.

6. Handling Repetition

When the grammar says “zero or more of X”:

Grammar:
  list → item (',' item)*

Parser:
function list():
  result = [item()]
  
  while current_token is COMMA:
    eat(COMMA)
    result.append(item())
  
  return result

Keep going while the pattern continues.

7. Handling Optionals

When the grammar says “maybe X”:

Grammar:
  number → ['-'] digits

Parser:
function number():
  negative = false
  
  if current_token is MINUS:
    eat(MINUS)
    negative = true
  
  value = digits()
  
  if negative:
    value = -value
  
  return value

Check if the optional part is there, process it if so.

Python Example

  1# Parse arithmetic expressions
  2
  3from enum import Enum
  4from dataclasses import dataclass
  5import re
  6
  7# ============================================================================
  8# TOKENS (Terminal Symbols)
  9# ============================================================================
 10
 11class TokenType(Enum):
 12    # Literals
 13    NUMBER = 'NUMBER'
 14    
 15    # Operators
 16    PLUS = 'PLUS'
 17    MINUS = 'MINUS'
 18    MULTIPLY = 'MULTIPLY'
 19    DIVIDE = 'DIVIDE'
 20    
 21    # Grouping
 22    LPAREN = 'LPAREN'
 23    RPAREN = 'RPAREN'
 24    
 25    # Special
 26    EOF = 'EOF'
 27
 28@dataclass
 29class Token:
 30    type: TokenType
 31    value: any
 32    position: int  # Where in the input this token starts
 33    
 34    def __repr__(self):
 35        return f'Token({self.type.name}, {repr(self.value)}, pos={self.position})'
 36
 37
 38# ============================================================================
 39# LEXER (Tokenizer) - Recognizes Terminal Symbols
 40# ============================================================================
 41
 42class Lexer:
 43    """
 44    Converts input text into a stream of tokens.
 45    Implements the terminal definitions from our grammar.
 46    """
 47    
 48    def __init__(self, text):
 49        self.text = text
 50        self.pos = 0
 51        self.current_char = self.text[0] if text else None
 52    
 53    def error(self, msg):
 54        raise Exception(f"Lexer error at position {self.pos}: {msg}")
 55    
 56    def advance(self):
 57        """Move to next character"""
 58        self.pos += 1
 59        if self.pos < len(self.text):
 60            self.current_char = self.text[self.pos]
 61        else:
 62            self.current_char = None
 63    
 64    def peek(self, offset=1):
 65        """Look ahead without consuming"""
 66        peek_pos = self.pos + offset
 67        if peek_pos < len(self.text):
 68            return self.text[peek_pos]
 69        return None
 70    
 71    def skip_whitespace(self):
 72        """
 73        WHITESPACE → (' ' | '\t' | '\n' | '\r')+
 74        Whitespace is ignored between tokens
 75        """
 76        while self.current_char and self.current_char in ' \t\n\r':
 77            self.advance()
 78    
 79    def read_number(self):
 80        """
 81        NUMBER → DIGIT+ ('.' DIGIT+)?
 82        DIGIT  → '0' | '1' | ... | '9'
 83        
 84        Reads integers (42) or floats (3.14)
 85        """
 86        start_pos = self.pos
 87        result = ''
 88        
 89        # Read digits before decimal point
 90        while self.current_char and self.current_char.isdigit():
 91            result += self.current_char
 92            self.advance()
 93        
 94        # Check for decimal point
 95        if self.current_char == '.' and self.peek() and self.peek().isdigit():
 96            result += self.current_char
 97            self.advance()
 98            
 99            # Read digits after decimal point
100            while self.current_char and self.current_char.isdigit():
101                result += self.current_char
102                self.advance()
103            
104            return Token(TokenType.NUMBER, float(result), start_pos)
105        
106        return Token(TokenType.NUMBER, int(result), start_pos)
107    
108    def get_next_token(self):
109        """
110        Main lexer method - returns next token from input.
111        Implements recognition of all terminal symbols.
112        """
113        while self.current_char:
114            
115            # Skip whitespace (not a token)
116            if self.current_char in ' \t\n\r':
117                self.skip_whitespace()
118                continue
119            
120            # NUMBER terminal
121            if self.current_char.isdigit():
122                return self.read_number()
123            
124            # PLUS terminal: '+'
125            if self.current_char == '+':
126                token = Token(TokenType.PLUS, '+', self.pos)
127                self.advance()
128                return token
129            
130            # MINUS terminal: '-'
131            if self.current_char == '-':
132                token = Token(TokenType.MINUS, '-', self.pos)
133                self.advance()
134                return token
135            
136            # MULTIPLY terminal: '*'
137            if self.current_char == '*':
138                token = Token(TokenType.MULTIPLY, '*', self.pos)
139                self.advance()
140                return token
141            
142            # DIVIDE terminal: '/'
143            if self.current_char == '/':
144                token = Token(TokenType.DIVIDE, '/', self.pos)
145                self.advance()
146                return token
147            
148            # LPAREN terminal: '('
149            if self.current_char == '(':
150                token = Token(TokenType.LPAREN, '(', self.pos)
151                self.advance()
152                return token
153            
154            # RPAREN terminal: ')'
155            if self.current_char == ')':
156                token = Token(TokenType.RPAREN, ')', self.pos)
157                self.advance()
158                return token
159            
160            # Invalid character - not recognized by any terminal
161            self.error(f"Invalid character '{self.current_char}'")
162        
163        # EOF terminal (end of input)
164        return Token(TokenType.EOF, None, self.pos)
165
166
167# ============================================================================
168# AST NODES (Abstract Syntax Tree)
169# ============================================================================
170
171@dataclass
172class Number:
173    """
174    Represents a numeric literal.
175    Created by: factor → NUMBER
176    """
177    value: float
178    
179    def __repr__(self):
180        return f'Number({self.value})'
181
182@dataclass
183class BinOp:
184    """
185    Represents a binary operation (left op right).
186    Created by: expression → term (op term)*
187                term → factor (op factor)*
188    """
189    left: any
190    op: str
191    right: any
192    
193    def __repr__(self):
194        return f'BinOp({self.left} {self.op} {self.right})'
195
196
197# ============================================================================
198# PARSER (Syntax Analyzer) - Implements Grammar Rules
199# ============================================================================
200
201class Parser:
202    """
203    Recursive descent parser implementing the grammar:
204    
205    expression → term (('+' | '-') term)*
206    term       → factor (('*' | '/') factor)*
207    factor     → NUMBER | '(' expression ')'
208    """
209    
210    def __init__(self, lexer):
211        self.lexer = lexer
212        self.current_token = self.lexer.get_next_token()
213    
214    def error(self, msg):
215        raise Exception(f"Parser error at position {self.current_token.position}: {msg}")
216    
217    def eat(self, token_type):
218        """
219        Consume current token if it matches expected type.
220        Move to next token.
221        """
222        if self.current_token.type == token_type:
223            self.current_token = self.lexer.get_next_token()
224        else:
225            self.error(f"Expected {token_type.name}, got {self.current_token.type.name}")
226    
227    def factor(self):
228        """
229        factor → NUMBER | '(' expression ')'
230        
231        A factor is the most basic unit:
232        - A number literal
233        - An expression wrapped in parentheses
234        """
235        token = self.current_token
236        
237        if token.type == TokenType.NUMBER:
238            self.eat(TokenType.NUMBER)
239            return Number(token.value)
240        
241        elif token.type == TokenType.LPAREN:
242            self.eat(TokenType.LPAREN)
243            node = self.expression()
244            self.eat(TokenType.RPAREN)
245            return node
246        
247        self.error(f"Expected NUMBER or '(', got {token.type.name}")
248    
249    def term(self):
250        """
251        term → factor (('*' | '/') factor)*
252        
253        A term handles multiplication and division (high precedence).
254        Reads: factor, then zero or more (operator factor) pairs.
255        """
256        node = self.factor()
257        
258        while self.current_token.type in (TokenType.MULTIPLY, TokenType.DIVIDE):
259            op_token = self.current_token
260            
261            if op_token.type == TokenType.MULTIPLY:
262                self.eat(TokenType.MULTIPLY)
263            elif op_token.type == TokenType.DIVIDE:
264                self.eat(TokenType.DIVIDE)
265            
266            node = BinOp(left=node, op=op_token.value, right=self.factor())
267        
268        return node
269    
270    def expression(self):
271        """
272        expression → term (('+' | '-') term)*
273        
274        An expression handles addition and subtraction (low precedence).
275        Reads: term, then zero or more (operator term) pairs.
276        
277        This is the start symbol of our grammar.
278        """
279        node = self.term()
280        
281        while self.current_token.type in (TokenType.PLUS, TokenType.MINUS):
282            op_token = self.current_token
283            
284            if op_token.type == TokenType.PLUS:
285                self.eat(TokenType.PLUS)
286            elif op_token.type == TokenType.MINUS:
287                self.eat(TokenType.MINUS)
288            
289            node = BinOp(left=node, op=op_token.value, right=self.term())
290        
291        return node
292    
293    def parse(self):
294        """
295        Entry point for parsing.
296        Parses complete expression and ensures we consumed all input.
297        """
298        ast = self.expression()
299        
300        if self.current_token.type != TokenType.EOF:
301            self.error(f"Unexpected token after expression: {self.current_token}")
302        
303        return ast
304
305
306# ============================================================================
307# INTERPRETER (Evaluator) - Walks AST and Computes Result
308# ============================================================================
309
310class Interpreter:
311    """
312    Walks the AST and evaluates the expression to a numeric result.
313    Uses the Visitor pattern.
314    """
315    
316    def visit(self, node):
317        """Dispatch to appropriate visit method"""
318        if isinstance(node, Number):
319            return self.visit_number(node)
320        elif isinstance(node, BinOp):
321            return self.visit_binop(node)
322        else:
323            raise Exception(f"Unknown node type: {type(node)}")
324    
325    def visit_number(self, node):
326        """Number node: just return its value"""
327        return node.value
328    
329    def visit_binop(self, node):
330        """Binary operation: evaluate left and right, apply operator"""
331        left_val = self.visit(node.left)
332        right_val = self.visit(node.right)
333        
334        if node.op == '+':
335            return left_val + right_val
336        elif node.op == '-':
337            return left_val - right_val
338        elif node.op == '*':
339            return left_val * right_val
340        elif node.op == '/':
341            if right_val == 0:
342                raise Exception("Division by zero")
343            return left_val / right_val
344        else:
345            raise Exception(f"Unknown operator: {node.op}")
346
347
348# ============================================================================
349# HELPER FUNCTIONS
350# ============================================================================
351
352def tokenize(text):
353    """Show all tokens for debugging"""
354    lexer = Lexer(text)
355    tokens = []
356    while True:
357        token = lexer.get_next_token()
358        tokens.append(token)
359        if token.type == TokenType.EOF:
360            break
361    return tokens
362
363def print_ast(node, indent=0):
364    """Pretty print the AST"""
365    spacing = "  " * indent
366    if isinstance(node, Number):
367        print(f"{spacing}Number({node.value})")
368    elif isinstance(node, BinOp):
369        print(f"{spacing}BinOp({node.op})")
370        print_ast(node.left, indent + 1)
371        print_ast(node.right, indent + 1)
372
373def calculate(text, verbose=False):
374    """Main function: lex, parse, and evaluate"""
375    if verbose:
376        print(f"\nInput: {text}")
377        print("\n--- TOKENS ---")
378        tokens = tokenize(text)
379        for token in tokens:
380            print(token)
381    
382    lexer = Lexer(text)
383    parser = Parser(lexer)
384    ast = parser.parse()
385    
386    if verbose:
387        print("\n--- AST ---")
388        print_ast(ast)
389    
390    interpreter = Interpreter()
391    result = interpreter.visit(ast)
392    
393    if verbose:
394        print(f"\n--- RESULT ---")
395        print(f"{text} = {result}\n")
396    
397    return result
398
399
400# ============================================================================
401# TESTS
402# ============================================================================
403
404if __name__ == '__main__':
405    # Test cases
406    test_cases = [
407        "42",
408        "3 + 5",
409        "10 - 2 * 3",
410        "(10 - 2) * 3",
411        "7 + 3 * (10 / (12 / (3 + 1) - 1))",
412        "3.14 * 2",
413        "100 / 4 / 5",
414        "2 + 3 * 4 - 5 / 2",
415    ]
416    
417    print("=" * 60)
418    print("ARITHMETIC PARSER - Test Results")
419    print("=" * 60)
420    
421    for expr in test_cases:
422        result = calculate(expr)
423        print(f"{expr:40} = {result}")
424    
425    print("\n" + "=" * 60)
426    print("VERBOSE EXAMPLE")
427    print("=" * 60)
428    
429    # Show detailed breakdown for one expression
430    calculate("3 + 5 * 2", verbose=True)

#ebnf #parsing #lexer #computer-science #programming