How it works
The Parser
The parser takes the flat token stream from the lexer and builds a tree that represents the structure of the program - nesting, operator precedence, and the relationships between statements and the expressions they contain.
Input
List[Token]
received from previous stage
Output
Program
passed to next stage
Source
parser.py
295 lines
Why a tree?
Source code has inherent structure. An if statement contains a condition and one or two blocks. A block contains statements. A statement might contain an expression. An expression might contain sub-expressions. This hierarchy of containment is exactly what a tree represents.
The Abstract Syntax Tree (AST) is the parser's answer. It is "abstract" because it drops irrelevant syntax tokens (braces, the word "che", newlines) and keeps only the semantically meaningful structure.
The kemlang-py grammar (BNF)
A context-free grammar defines what sequences of tokens form a valid program.
program ::= "kem bhai" statement* "aavjo bhai"
statement ::= print_stmt
| declaration
| assignment
| if_stmt
| while_stmt
| break_stmt
| continue_stmt
print_stmt ::= "bhai bol" expression
declaration ::= "aa" IDENTIFIER "che" expression
assignment ::= IDENTIFIER "che" expression
break_stmt ::= "tame jao"
continue_stmt::= "aagal vado"
if_stmt ::= "jo" expression block ( "nahi to" block )?
while_stmt ::= "farvu" block "jya sudhi" expression
block ::= "{" statement* "}"
expression ::= comparison
comparison ::= term ( ( "==" | "!=" | "<" | ">" | "<=" | ">=" ) term )*
term ::= factor ( ( "+" | "-" ) factor )*
factor ::= unary ( ( "*" | "/" | "%" ) unary )*
unary ::= "-" unary | primary
primary ::= INTEGER | FLOAT | STRING | BOOLEAN
| "bapu tame bolo"
| IDENTIFIER
| "(" expression ")"Operator precedence
Why does 1 + 2 * 3 equal 7and not 9? The grammar encodes precedence by stratification: lower in the grammar = tighter binding = higher precedence.
expression() lowest precedence (entry point)
└── comparison() == != < > <= >=
└── term() + -
└── factor() * / %
└── unary() - (negation)
└── primary() literals, variables, (expr)
highest precedence
Parsing 1 + 2 * 3 :
term() needs a left operand -> calls factor()
factor() needs left -> calls unary() -> primary() -> Literal(1)
factor() sees no * / % -> returns Literal(1)
term() has left=1, sees '+', needs right -> calls factor()
factor() needs left -> calls unary() -> primary() -> Literal(2)
factor() sees '*' -> calls unary() -> primary() -> Literal(3)
factor() returns Binary(*, Literal(2), Literal(3)) <- tighter!
term() returns Binary(+, Literal(1), Binary(*, 2, 3))
Result tree: 1 + (2 * 3) = 7 correctRecursive descent
Each grammar rule has a corresponding Python method. Methods call each other recursively, naturally mirroring the nested structure of the grammar.
def statement(self) -> Stmt | None:
if self.match(TokenType.JO):
return self.if_statement() # dispatch on current token
if self.match(TokenType.AA):
return self.declaration()
# ...
def if_statement(self) -> If:
condition = self.expression() # parse the condition
then_branch = self.block() # parse { ... }
else_branch = None
if self.match(TokenType.ELSE):
else_branch = self.block() # optional nahi to { ... }
return If(condition, then_branch, else_branch)
def expression(self) -> Expr:
return self.comparison() # delegate to next level
def comparison(self) -> Expr:
left = self.term()
while self.match(TokenType.EQUAL, TokenType.NOT_EQUAL, ...):
op = self.previous()
right = self.term()
left = Binary(left, op, right)
return leftAST node types
Every node is an immutable @dataclass. The parser builds the tree; the interpreter reads it without modification.
Parsing a complete example
kem bhai
aa x che 10
jo x > 5 {
bhai bol "big"
} nahi to {
bhai bol "small"
}
aavjo bhaiThe AST contains no braces, no che, no jo, no nahi to. The parser consumed those tokens to understand structure, then discarded them.
Parse errors
The parser raises ParseError when the current token does not match what the grammar expects. It reports both the expected token and the actual one found.
ParseError: expected '{' after 'jo' condition at line 3, column 14
ParseError: expected 'jya sudhi' after loop body at line 7, column 0
ParseError: program must start with 'kem bhai' at line 1, column 0