A simple grammar parsing library in Python.
It also includes a sample programming language implementation called Jura to demonstrate the capabilities of the library.
Language design made easy.
- Index
- Getting Started
- Grammar
- Lexer
- Parser
- Grammar Normalization
- Abstract Syntax Tree (AST)
- Jura Language
- Testing
- Deployment
- Contributing
- Authors
- Disclaimer
- License
You will need Python +3.12 to run this library.
# vermin output
Minimum required versions: 3.12
Incompatible versions: 2.xIf you don't directly have Python 3.12,
uvwill download and manage the correct version for you.
To install the library for use in another project:
pip install git+https://github.com/Animenosekai/monicaYou also need to install Graphviz to visualize ASTs, from the official website: https://www.graphviz.org.
On Linux, you can install it via your package manager.
sudo apt install graphvizsudo dnf install graphvizbrew install graphvizFor development, we recommend using uv to manage the environment and dependencies (as standard pip does not support dependency groups yet).
Follow the instructions on the official website but a quick install command is:
curl -LsSf https://astral.sh/uv/install.sh | sh
Then, clone the repository and install all dependencies:
$ git clone https://github.com/Animenosekai/monica
$ cd monica
# Create virtual environment and install all dependencies (including dev)
$ uv sync
# Activate the environment
$ source .venv/bin/activateNote
Alternatively, you can use uv run <command> to run commands within the virtual environment without activating it.
This class allows you to parse grammar definition files and use them to parse text according to the defined grammar.
You can load a grammar from a file using the Grammar.load or Grammar.loads method.
from monica import Grammar
grammar = Grammar.load("path/to/grammar.gram")This will parse the grammar definition file and return a Grammar object that you can use to parse text according to the defined grammar.
You can specify lots of parameters when loading a grammar, such as the different prefixes (for comments, commands, rule definitions, etc.), symbols (for ranges, repetitions, etc.) and wrappers (around non-terminals, groups, etc.) used in the grammar definition file.
Monica also supports range definitions in terminals using the range_symbol (default is -). For example, <nt> ::= a-c is equivalent to <nt> ::= a | b | c.
You can dump a grammar to a file using the Grammar.dump or Grammar.dumps method.
from monica import Grammar
grammar = Grammar.load("path/to/grammar.gram")
grammar.dump("path/to/dumped_grammar.gram")
# Or get the grammar as a string
grammar_str = grammar.dumps()This will write the grammar definition to the specified file.
You can specify lots of parameters when dumping a grammar:
- Customize prefixes, symbols, and wrappers
- Control range factorization
- Preserve or modify formatting
- Include or exclude commands
The grammar will be dumped using the specified prefixes, symbols and wrappers and by factoring the ranges as much as possible. The round-trip property is preserved: loading a dumped grammar produces an equivalent grammar.
Commands are special annotations that can be added to grammar rules to modify their behavior.
See the commands documentation for more details.
This class allows you to tokenize input text according to the defined grammar.
This is where you check if the current token can be parsed as the current token type from the source code.
There are three possible return values:
- A value of
Falsemeans 'definitely not': It cannot be of this type. - A value of
TrueorTokenizationContextwithTokenizationContext.done == Truemeans 'definitely': It can be of this type. - A value of
NoneorTokenizationContextwithTokenizationContext.done == Falsemeans 'maybe': It could be of this type, but more context is needed (for example, we don't know if an identifier is a keyword until we have the full identifier).
from typing import override
from monica.lexer import ValuedToken, TokenizationContext
class Identifier(ValuedToken[str]):
"""Token representing an identifier ::= (<alpha> | _) (<alpha> | _ | <digit>)*"""
@override
@classmethod
def can_parse(
cls,
value: str,
context: TokenizationContext | None = None,
) -> bool | TokenizationContext | None:
if Keyword.can_parse(value, context) is True:
return None
if not (value[0].isalpha() or value[0] == "_"):
return False
return all(c.isalnum() or c == "_" for c in value[1:])This is where you really parse the token from the source code.
It should return an instance of the token if it can be parsed, or None if it cannot.
from typing import Self, override
from monica.lexer import ValuedToken, TokenizationContext
class Identifier(ValuedToken[str]):
"""Token representing an identifier ::= (<alpha> | _) (<alpha> | _ | <digit>)*"""
@override
@classmethod
def parse(
cls,
value: str,
line: int,
column: int,
context: TokenizationContext | None = None,
) -> Self | None:
return cls(
line=line,
column=column,
length=len(value),
value=value,
)A value of None can also be returned to indicate that the token should be ignored (for example, whitespace tokens).
class Whitespace(Token):
"""Token representing whitespace."""
@override
@classmethod
def can_parse(
cls,
value: str,
context: TokenizationContext | None = None,
) -> bool | TokenizationContext | None:
return value.isspace()
@override
@classmethod
def parse(
cls,
value: str,
line: int,
column: int,
context: TokenizationContext | None = None,
) -> Self | None:
return None # Whitespace tokens are ignoredSome tokens have associated values (for example, an identifier token has the identifier name as its value).
While the basic Token class does not have any associated values, the ValuedToken[T] is a generic class that allows you to define tokens with associated values.
These are tokens that can take one of a predefined set of values (for example, keywords or operators).
This can be done by subclassing the EnumToken class and defining an enumeration for the possible values.
import enum
from monica.lexer import EnumToken
class KeywordEnum(enum.Enum):
"""Enumeration of keywords."""
IF = "if"
ELSE = "else"
FOR = "for"
WHILE = "while"
class Keyword(EnumToken[KeywordEnum]):
"""Token representing a keyword."""These are tokens that represent a single character, such as parentheses, braces, or operators.
This can be done by subclassing the CharacterToken class and defining the CHARACTER class attribute.
from monica.lexer import CharacterToken
class LeftParenthesis(CharacterToken):
"""Token representing a left parenthesis."""
CHARACTER = "("
class RightParenthesis(CharacterToken):
"""Token representing a right parenthesis."""
CHARACTER = ")"Once you have defined your tokens, you can assemble them into a list of token types to be used by the lexer.
from monica.lexer import Lexer
lexer = Lexer(Whitespace, Keyword, Identifier, LeftParenthesis, RightParenthesis, ...)You can tokenize input text using the tokenize method of the Lexer class.
for token in lexer.tokenize(text, "filepath_of_the_source_code.ext"):
print_token(token)The lexer supports parallel tokenization using the multiprocessing module.
You can call the tokenize_parallel method of the Lexer class instead of the regular tokenize method to tokenize the input text in parallel.
The output will be stored in a multiprocessing.Queue you provide.
queue: multiprocessing.Queue[Token | None] = multiprocessing.Queue()
lexer.parallel_tokenize(text, "filepath_of_the_source_code.ext", queue=queue)
while (token := queue.get()) is not None:
print_token(token)The parser module provides an LL(1) parser implementation to analyze syntax based on the defined grammar.
The parser uses an LL(1) parsing algorithm with a predictive parse table. It processes tokens from left to right and uses a single lookahead token to determine which production rule to apply.
from monica.parser import parse, create_table
from monica.grammar import Grammar
grammar = Grammar.load("path/to/grammar.gram")
table = create_table(grammar)
# Parse input tokens
parse(
table,
grammar.axiom,
lexer.tokenize(input_text),
token_to_terminal=lambda token, rule: Terminal(value=token.value),
conflicts_resolution={},
)The create_table function generates an LL(1) parse table from a grammar. This table maps (non-terminal, lookahead) pairs to production rules.
from monica.parser import create_table
table = create_table(grammar)The parse table construction automatically computes:
- FIRST sets: the set of terminals that can begin strings derived from a non-terminal
- FOLLOW sets: the set of terminals that can appear immediately after a non-terminal
The parser can detect LL(1) grammar conflicts (FIRST/FIRST and FIRST/FOLLOW conflicts) and provides mechanisms to resolve them.
# Detect conflicts
if table.conflicts:
for non_terminal, conflicts in table.conflicts.items():
print(f"Conflict in {non_terminal.name}")
for lookahead, rules in conflicts.items():
print(f" Lookahead: {lookahead}")
for rule in rules:
print(f" {rule}")
# Resolve conflicts with a custom resolver function
def conflict_resolver(possible_rules: list[Rule], lookahead: Token) -> Rule:
# Custom logic to choose the correct rule
return possible_rules[0]
parse(
table,
grammar.axiom,
tokens,
token_to_terminal=...,
conflicts_resolution={
NonTerminal(name="stmt"): {
Terminal(value="if"): conflict_resolver,
},
},
)The parser provides detailed error messages for various parsing errors:
UnrecognizedTokenError: Token that cannot be mapped to a terminalUnexpectedTokenError: Token that doesn't match expected terminalNoRuleError: No production rule found for non-terminal with given lookaheadConflictError: Multiple rules apply for the same (non-terminal, lookahead) pairEndOfStackError: Input tokens remain after parsing stack is emptyExpectedTerminalError: Expected a terminal but got a non-terminalExpectedNonTerminalError: Expected a non-terminal but got a terminal
The parser supports listener functions that receive parsing events in real-time:
from monica.parser import parse
def my_listener(item: Token | Rule | ParserError) -> None:
if isinstance(item, Rule):
print(f"Applied rule: {item}")
elif isinstance(item, Token):
print(f"Matched token: {item}")
elif isinstance(item, ParserError):
print(f"Error: {item}")
# Optionally raise the error to stop parsing
raise item
parse(
table,
grammar.axiom,
tokens,
token_to_terminal=...,
conflicts_resolution={},
listener=my_listener,
)The normalization module provides tools to transform grammars into LL(1) compatible forms.
The normalize function applies a series of transformations to convert a grammar into a weakly equivalent LL(1) grammar:
from monica.normalize import normalize
# Normalize a grammar
normalized_grammar = normalize(
grammar,
precedence={"*": 2, "/": 2, "+": 1, "-": 1},
associativity={"*": "left", "/": "left", "+": "left", "-": "left"},
)The normalization pipeline includes the following steps:
- Expand optionals and repeats (
?,*,+) - Expand groups
(and) - Eliminate left recursion
- Left factor the grammar
- Apply precedence transformations (optional) (Under Development)
- Apply associativity transformations (optional) (Under Development)
- Remove epsilon productions (Under Development)
- Remove unit productions (Under Development)
- Remove redundant productions
- Prune identical rules
Left recursion (when a non-terminal can derive itself as the leftmost symbol) is incompatible with LL parsers. The elimination process transforms left-recursive rules:
from monica.normalize import eliminate_left_recursion
# Before: <expr> ::= <expr> + <term> | <term>
# After: <expr> ::= <term> <expr'>
# <expr'> ::= + <term> <expr'> | ε
grammar = eliminate_left_recursion(grammar)Left factoring eliminates common prefixes in alternative productions to make the grammar suitable for predictive parsing:
from monica.normalize import left_factor
# Before: <stmt> ::= if ( <expr> ) <stmt> else <stmt>
# | if ( <expr> ) <stmt>
# After: <stmt> ::= if ( <expr> ) <stmt> <else_part>
# <else_part> ::= else <stmt> | ε
grammar = left_factor(grammar)Expansion functions convert shorthand notations into explicit productions:
from monica.normalize import expand_optionals_and_repeats, expand_groups
# Expand optional (?) and repeatable (*, +) operators
grammar = expand_optionals_and_repeats(grammar)
# Expand grouped alternatives
grammar = expand_groups(grammar)Optimization functions clean up the grammar by removing redundancies:
from monica.normalize import (
remove_redundant_productions,
prune_identical_rules,
)
# Remove rules with identical definitions
grammar = remove_redundant_productions(grammar)
# Remove duplicate rules
grammar = prune_identical_rules(grammar)The AST module provides data structures for representing parse trees.
The TreeNode class represents nodes in an abstract syntax tree:
from monica.ast.tree import TreeNode
# Create a tree
tree = TreeNode("root", [
TreeNode("child1", [
TreeNode("grandchild1"),
TreeNode("grandchild2"),
]),
TreeNode("child2"),
])
# Print the tree representation
print(tree)The AST module can generate Graphviz DOT diagrams from tree structures:
from monica.ast.visualize import tree_to_dot
from graphviz import Source
# Generate DOT code
dot_code = tree_to_dot(tree)
# Render with Graphviz
src = Source(dot_code)
src.view()The ParseTreeBuilder class provides an automated way to construct abstract syntax trees during parsing:
from monica.ast.tree import ParseTreeBuilder
from monica.parser import parse
# Create an AST builder listener
builder = ParseTreeBuilder()
# Parse with the builder as listener
parse(
table,
grammar.axiom,
tokens,
token_to_terminal=token_to_terminal,
conflicts_resolution={},
listener=builder,
)
# Access the built AST
ast_tree = builder.rootThe AST builder automatically:
- Creates tree nodes for each token encountered
- Organizes nodes according to grammar rules
- Maintains parent-child relationships
- Labels nodes with token types and values
AST reduction simplifies parse trees by removing unnecessary nodes and combining related elements:
from jura.ast import JuraASTReducer
# Create a reducer
reducer = JuraASTReducer()
# Reduce the AST
reduced_tree = reducer.reduce(ast_tree)The reducer:
- Removes syntactic noise (brackets, semicolons, etc.)
- Flattens simple statement nodes
- Interns strings to save memory
- Preserves semantic information
Jura is a sample programming language implementation demonstrating monica's capabilities.
The Jura language includes a command-line interface for lexical analysis, parsing, and conflict detection:
# Lex a Jura file
jura lex jura/examples/HelloWorld.jura
# Lex with error recovery
jura lex jura/examples/errors/LexerError.jura --continue-on-error
# Parse a Jura file
jura parse jura/examples/HelloWorld.jura
# Parse with error recovery
jura parse jura/examples/errors/MultiInheritance.jura --continue-on-error
# Show the AST of a Jura file
jura ast jura/examples/Calculator.jura
# Show the AST with error recovery
jura ast jura/examples/errors/MultiInheritance.jura --continue-on-error
# Show grammar conflicts
jura conflicts
# Show version
jura --versionThe Jura lexer tokenizes Jura source code into tokens like keywords, identifiers, operators, numbers, and strings:
jura lex jura/examples/Calculator.juraThis outputs a detailed list of tokens with their line numbers, column numbers, and values.
The Jura parser validates the syntax of Jura programs according to the Jura grammar:
jura parse jura/examples/Animals.juraThe parser will report any syntax errors with detailed error messages including line and column information.
Visualize the abstract syntax tree of a Jura program:
jura ast jura/examples/HelloWorld.juraThis command:
- Parses the Jura file and builds an AST
- Applies AST reduction to simplify the tree
- Generates a Graphviz visualization
- Opens the visualization in your default viewer
The generated AST diagrams are cached in jura/.cache/ast for easy access.
Clear the AST visualization cache:
jura cleanThis removes all cached AST visualizations from the .cache directory.
The Jura language includes several example programs demonstrating various features:
- HelloWorld.jura - Basic program structure with variables
- BareBone.jura - Minimal valid Jura program
- Calculator.jura - Classes, methods, and arithmetic operations
- Animals.jura - Object-oriented programming with inheritance
- LogicLoops.jura - Control flow and logical operations
- Recursion.jura - Recursive function calls
- SimpleArithmetic.jura - Simple arithmetic operations
You can run any example:
jura parse jura/examples/Calculator.jura
jura lex jura/examples/Animals.jura
jura ast jura/examples/Recursion.juraIt also includes examples with syntax errors to test error handling:
jura parse jura/examples/errors/MultiInheritance.jura
jura parse jura/examples/errors/MainSignature.juraThe project uses pytest for testing.
To run all tests:
pytestTo run tests with coverage reporting:
pytest --cov=monica --cov=jura --cov-report=term-missingYou can also run tests for a specific package:
pytest tests/juraThis module is currently in development and might contain bugs.
Feel free to use it in production if you feel like it is suitable for your production even if you may encounter issues.
Pull requests are welcome. For major changes, please open an discussion first to discuss what you would like to change.
Please make sure to update the tests as appropriate.
- anise — Initial work, Monica — Animenosekai
- dindedouce — Initial work, Parser — dindedouce
- PIGASSOU Clement — Initial work, AST — clement.pigassou
- BESSIERE Lena — Initial work, Jura Grammar — lena.bessiere
- GOMPEL Ingrid - Initial work - ingrid.gompel
This project doesn't handle the generation of executables or interpreters from the parsed grammar yet.
This project is licensed under the MIT License - see the LICENSE file for details
