Recursive Descent Parsing
Recursive descent parsing is a top-down parsing technique where each non-terminal in the grammar is implemented as a recursive procedure. The parser begins with the start symbol and recursively expands non-terminals according to production rules, consuming input tokens in a left-to-right scan with lookahead to select appropriate productions.
Characteristics
- Top-down construction: Builds the parse tree from the root (start symbol) down to the leaves (terminals), in contrast to bottom-up parsers which work from leaves to root.
- Predictive parsing: Uses lookahead tokens to deterministically select which production to apply
- Direct grammar mapping: Each non-terminal maps to a parsing function, making the implementation structurally isomorphic to the grammar specification.
- Left-to-right, leftmost derivation: Processes input tokens sequentially from left to right while constructing a leftmost derivation of the input string.
Step by Step
1. One Function Per Grammar Rule
Each grammar rule becomes a function. When a rule says “do A then B”, the function calls A’s function, then B’s function. When a rule says “do A or B”, the function looks at the next token to decide which one to call. The “recursive” part happens when rules reference each other in circles.
Grammar:
expression → term (('+' | '-') term)*
term → number
Parser:
function expression() { ... }
function term() { ... }2. Reading the Input
The parser keeps track of:
- Current token (what we’re looking at right now)
- A way to get the next token (from the lexer)
parser:
current_token = first token from lexer
function advance():
current_token = get next token from lexer3. Matching Terminals
When the grammar says expect this specific token, we check and consume it:
function eat(expected_token_type):
if current_token.type == expected_token_type:
advance() // Move to next token
else:
error("Expected " + expected_token_type)This is like checking: “Do I have the ingredient I need next?”
4. Handling Sequences
When the grammar says “A then B then C”:
Grammar:
statement → 'if' expression 'then' body
Parser:
function statement():
eat('if')
expression()
eat('then')
body()Just call each part in order.
5. Handling Choices
When the grammar says “A or B or C”:
Grammar:
value → number | string | boolean
Parser:
function value():
if current_token is NUMBER:
return number()
else if current_token is STRING:
return string()
else if current_token is BOOLEAN:
return boolean()
else:
error("Expected value")Look at the current token to decide which path to take.
6. Handling Repetition
When the grammar says “zero or more of X”:
Grammar:
list → item (',' item)*
Parser:
function list():
result = [item()]
while current_token is COMMA:
eat(COMMA)
result.append(item())
return resultKeep going while the pattern continues.
7. Handling Optionals
When the grammar says “maybe X”:
Grammar:
number → ['-'] digits
Parser:
function number():
negative = false
if current_token is MINUS:
eat(MINUS)
negative = true
value = digits()
if negative:
value = -value
return valueCheck if the optional part is there, process it if so.
Python Example
1# Parse arithmetic expressions
2
3from enum import Enum
4from dataclasses import dataclass
5import re
6
7# ============================================================================
8# TOKENS (Terminal Symbols)
9# ============================================================================
10
11class TokenType(Enum):
12 # Literals
13 NUMBER = 'NUMBER'
14
15 # Operators
16 PLUS = 'PLUS'
17 MINUS = 'MINUS'
18 MULTIPLY = 'MULTIPLY'
19 DIVIDE = 'DIVIDE'
20
21 # Grouping
22 LPAREN = 'LPAREN'
23 RPAREN = 'RPAREN'
24
25 # Special
26 EOF = 'EOF'
27
28@dataclass
29class Token:
30 type: TokenType
31 value: any
32 position: int # Where in the input this token starts
33
34 def __repr__(self):
35 return f'Token({self.type.name}, {repr(self.value)}, pos={self.position})'
36
37
38# ============================================================================
39# LEXER (Tokenizer) - Recognizes Terminal Symbols
40# ============================================================================
41
42class Lexer:
43 """
44 Converts input text into a stream of tokens.
45 Implements the terminal definitions from our grammar.
46 """
47
48 def __init__(self, text):
49 self.text = text
50 self.pos = 0
51 self.current_char = self.text[0] if text else None
52
53 def error(self, msg):
54 raise Exception(f"Lexer error at position {self.pos}: {msg}")
55
56 def advance(self):
57 """Move to next character"""
58 self.pos += 1
59 if self.pos < len(self.text):
60 self.current_char = self.text[self.pos]
61 else:
62 self.current_char = None
63
64 def peek(self, offset=1):
65 """Look ahead without consuming"""
66 peek_pos = self.pos + offset
67 if peek_pos < len(self.text):
68 return self.text[peek_pos]
69 return None
70
71 def skip_whitespace(self):
72 """
73 WHITESPACE → (' ' | '\t' | '\n' | '\r')+
74 Whitespace is ignored between tokens
75 """
76 while self.current_char and self.current_char in ' \t\n\r':
77 self.advance()
78
79 def read_number(self):
80 """
81 NUMBER → DIGIT+ ('.' DIGIT+)?
82 DIGIT → '0' | '1' | ... | '9'
83
84 Reads integers (42) or floats (3.14)
85 """
86 start_pos = self.pos
87 result = ''
88
89 # Read digits before decimal point
90 while self.current_char and self.current_char.isdigit():
91 result += self.current_char
92 self.advance()
93
94 # Check for decimal point
95 if self.current_char == '.' and self.peek() and self.peek().isdigit():
96 result += self.current_char
97 self.advance()
98
99 # Read digits after decimal point
100 while self.current_char and self.current_char.isdigit():
101 result += self.current_char
102 self.advance()
103
104 return Token(TokenType.NUMBER, float(result), start_pos)
105
106 return Token(TokenType.NUMBER, int(result), start_pos)
107
108 def get_next_token(self):
109 """
110 Main lexer method - returns next token from input.
111 Implements recognition of all terminal symbols.
112 """
113 while self.current_char:
114
115 # Skip whitespace (not a token)
116 if self.current_char in ' \t\n\r':
117 self.skip_whitespace()
118 continue
119
120 # NUMBER terminal
121 if self.current_char.isdigit():
122 return self.read_number()
123
124 # PLUS terminal: '+'
125 if self.current_char == '+':
126 token = Token(TokenType.PLUS, '+', self.pos)
127 self.advance()
128 return token
129
130 # MINUS terminal: '-'
131 if self.current_char == '-':
132 token = Token(TokenType.MINUS, '-', self.pos)
133 self.advance()
134 return token
135
136 # MULTIPLY terminal: '*'
137 if self.current_char == '*':
138 token = Token(TokenType.MULTIPLY, '*', self.pos)
139 self.advance()
140 return token
141
142 # DIVIDE terminal: '/'
143 if self.current_char == '/':
144 token = Token(TokenType.DIVIDE, '/', self.pos)
145 self.advance()
146 return token
147
148 # LPAREN terminal: '('
149 if self.current_char == '(':
150 token = Token(TokenType.LPAREN, '(', self.pos)
151 self.advance()
152 return token
153
154 # RPAREN terminal: ')'
155 if self.current_char == ')':
156 token = Token(TokenType.RPAREN, ')', self.pos)
157 self.advance()
158 return token
159
160 # Invalid character - not recognized by any terminal
161 self.error(f"Invalid character '{self.current_char}'")
162
163 # EOF terminal (end of input)
164 return Token(TokenType.EOF, None, self.pos)
165
166
167# ============================================================================
168# AST NODES (Abstract Syntax Tree)
169# ============================================================================
170
171@dataclass
172class Number:
173 """
174 Represents a numeric literal.
175 Created by: factor → NUMBER
176 """
177 value: float
178
179 def __repr__(self):
180 return f'Number({self.value})'
181
182@dataclass
183class BinOp:
184 """
185 Represents a binary operation (left op right).
186 Created by: expression → term (op term)*
187 term → factor (op factor)*
188 """
189 left: any
190 op: str
191 right: any
192
193 def __repr__(self):
194 return f'BinOp({self.left} {self.op} {self.right})'
195
196
197# ============================================================================
198# PARSER (Syntax Analyzer) - Implements Grammar Rules
199# ============================================================================
200
201class Parser:
202 """
203 Recursive descent parser implementing the grammar:
204
205 expression → term (('+' | '-') term)*
206 term → factor (('*' | '/') factor)*
207 factor → NUMBER | '(' expression ')'
208 """
209
210 def __init__(self, lexer):
211 self.lexer = lexer
212 self.current_token = self.lexer.get_next_token()
213
214 def error(self, msg):
215 raise Exception(f"Parser error at position {self.current_token.position}: {msg}")
216
217 def eat(self, token_type):
218 """
219 Consume current token if it matches expected type.
220 Move to next token.
221 """
222 if self.current_token.type == token_type:
223 self.current_token = self.lexer.get_next_token()
224 else:
225 self.error(f"Expected {token_type.name}, got {self.current_token.type.name}")
226
227 def factor(self):
228 """
229 factor → NUMBER | '(' expression ')'
230
231 A factor is the most basic unit:
232 - A number literal
233 - An expression wrapped in parentheses
234 """
235 token = self.current_token
236
237 if token.type == TokenType.NUMBER:
238 self.eat(TokenType.NUMBER)
239 return Number(token.value)
240
241 elif token.type == TokenType.LPAREN:
242 self.eat(TokenType.LPAREN)
243 node = self.expression()
244 self.eat(TokenType.RPAREN)
245 return node
246
247 self.error(f"Expected NUMBER or '(', got {token.type.name}")
248
249 def term(self):
250 """
251 term → factor (('*' | '/') factor)*
252
253 A term handles multiplication and division (high precedence).
254 Reads: factor, then zero or more (operator factor) pairs.
255 """
256 node = self.factor()
257
258 while self.current_token.type in (TokenType.MULTIPLY, TokenType.DIVIDE):
259 op_token = self.current_token
260
261 if op_token.type == TokenType.MULTIPLY:
262 self.eat(TokenType.MULTIPLY)
263 elif op_token.type == TokenType.DIVIDE:
264 self.eat(TokenType.DIVIDE)
265
266 node = BinOp(left=node, op=op_token.value, right=self.factor())
267
268 return node
269
270 def expression(self):
271 """
272 expression → term (('+' | '-') term)*
273
274 An expression handles addition and subtraction (low precedence).
275 Reads: term, then zero or more (operator term) pairs.
276
277 This is the start symbol of our grammar.
278 """
279 node = self.term()
280
281 while self.current_token.type in (TokenType.PLUS, TokenType.MINUS):
282 op_token = self.current_token
283
284 if op_token.type == TokenType.PLUS:
285 self.eat(TokenType.PLUS)
286 elif op_token.type == TokenType.MINUS:
287 self.eat(TokenType.MINUS)
288
289 node = BinOp(left=node, op=op_token.value, right=self.term())
290
291 return node
292
293 def parse(self):
294 """
295 Entry point for parsing.
296 Parses complete expression and ensures we consumed all input.
297 """
298 ast = self.expression()
299
300 if self.current_token.type != TokenType.EOF:
301 self.error(f"Unexpected token after expression: {self.current_token}")
302
303 return ast
304
305
306# ============================================================================
307# INTERPRETER (Evaluator) - Walks AST and Computes Result
308# ============================================================================
309
310class Interpreter:
311 """
312 Walks the AST and evaluates the expression to a numeric result.
313 Uses the Visitor pattern.
314 """
315
316 def visit(self, node):
317 """Dispatch to appropriate visit method"""
318 if isinstance(node, Number):
319 return self.visit_number(node)
320 elif isinstance(node, BinOp):
321 return self.visit_binop(node)
322 else:
323 raise Exception(f"Unknown node type: {type(node)}")
324
325 def visit_number(self, node):
326 """Number node: just return its value"""
327 return node.value
328
329 def visit_binop(self, node):
330 """Binary operation: evaluate left and right, apply operator"""
331 left_val = self.visit(node.left)
332 right_val = self.visit(node.right)
333
334 if node.op == '+':
335 return left_val + right_val
336 elif node.op == '-':
337 return left_val - right_val
338 elif node.op == '*':
339 return left_val * right_val
340 elif node.op == '/':
341 if right_val == 0:
342 raise Exception("Division by zero")
343 return left_val / right_val
344 else:
345 raise Exception(f"Unknown operator: {node.op}")
346
347
348# ============================================================================
349# HELPER FUNCTIONS
350# ============================================================================
351
352def tokenize(text):
353 """Show all tokens for debugging"""
354 lexer = Lexer(text)
355 tokens = []
356 while True:
357 token = lexer.get_next_token()
358 tokens.append(token)
359 if token.type == TokenType.EOF:
360 break
361 return tokens
362
363def print_ast(node, indent=0):
364 """Pretty print the AST"""
365 spacing = " " * indent
366 if isinstance(node, Number):
367 print(f"{spacing}Number({node.value})")
368 elif isinstance(node, BinOp):
369 print(f"{spacing}BinOp({node.op})")
370 print_ast(node.left, indent + 1)
371 print_ast(node.right, indent + 1)
372
373def calculate(text, verbose=False):
374 """Main function: lex, parse, and evaluate"""
375 if verbose:
376 print(f"\nInput: {text}")
377 print("\n--- TOKENS ---")
378 tokens = tokenize(text)
379 for token in tokens:
380 print(token)
381
382 lexer = Lexer(text)
383 parser = Parser(lexer)
384 ast = parser.parse()
385
386 if verbose:
387 print("\n--- AST ---")
388 print_ast(ast)
389
390 interpreter = Interpreter()
391 result = interpreter.visit(ast)
392
393 if verbose:
394 print(f"\n--- RESULT ---")
395 print(f"{text} = {result}\n")
396
397 return result
398
399
400# ============================================================================
401# TESTS
402# ============================================================================
403
404if __name__ == '__main__':
405 # Test cases
406 test_cases = [
407 "42",
408 "3 + 5",
409 "10 - 2 * 3",
410 "(10 - 2) * 3",
411 "7 + 3 * (10 / (12 / (3 + 1) - 1))",
412 "3.14 * 2",
413 "100 / 4 / 5",
414 "2 + 3 * 4 - 5 / 2",
415 ]
416
417 print("=" * 60)
418 print("ARITHMETIC PARSER - Test Results")
419 print("=" * 60)
420
421 for expr in test_cases:
422 result = calculate(expr)
423 print(f"{expr:40} = {result}")
424
425 print("\n" + "=" * 60)
426 print("VERBOSE EXAMPLE")
427 print("=" * 60)
428
429 # Show detailed breakdown for one expression
430 calculate("3 + 5 * 2", verbose=True)