monica

A simple grammar parsing library in Python.

It also includes a sample programming language implementation called Jura to demonstrate the capabilities of the library.

Language design made easy.

Index

Index
Getting Started
- Prerequisites
- Installation
  - Graphviz
  - Development
Grammar
Lexer
Parser
Grammar Normalization
Abstract Syntax Tree (AST)
Jura Language
Testing
- Running Tests
Deployment
Contributing
Authors
Disclaimer
License

Getting Started

Prerequisites

You will need Python +3.12 to run this library.

# vermin output
Minimum required versions: 3.12
Incompatible versions:     2.x

If you don't directly have Python 3.12, uv will download and manage the correct version for you.

Installation

To install the library for use in another project:

pip install git+https://github.com/Animenosekai/monica

Graphviz

You also need to install Graphviz to visualize ASTs, from the official website: https://www.graphviz.org.

On Linux, you can install it via your package manager.

Debian/Ubuntu

sudo apt install graphviz

Fedora

sudo dnf install graphviz

macOS

brew install graphviz

Development

For development, we recommend using uv to manage the environment and dependencies (as standard pip does not support dependency groups yet).

Follow the instructions on the official website but a quick install command is:

curl -LsSf https://astral.sh/uv/install.sh | sh

Then, clone the repository and install all dependencies:

$ git clone https://github.com/Animenosekai/monica
$ cd monica

# Create virtual environment and install all dependencies (including dev)
$ uv sync

# Activate the environment
$ source .venv/bin/activate

Note

Alternatively, you can use uv run <command> to run commands within the virtual environment without activating it.

Grammar

This class allows you to parse grammar definition files and use them to parse text according to the defined grammar.

Loading a grammar

You can load a grammar from a file using the Grammar.load or Grammar.loads method.

from monica import Grammar
grammar = Grammar.load("path/to/grammar.gram")

This will parse the grammar definition file and return a Grammar object that you can use to parse text according to the defined grammar.

You can specify lots of parameters when loading a grammar, such as the different prefixes (for comments, commands, rule definitions, etc.), symbols (for ranges, repetitions, etc.) and wrappers (around non-terminals, groups, etc.) used in the grammar definition file.

Monica also supports range definitions in terminals using the range_symbol (default is -). For example, <nt> ::= a-c is equivalent to <nt> ::= a | b | c.

Dumping a grammar

You can dump a grammar to a file using the Grammar.dump or Grammar.dumps method.

from monica import Grammar
grammar = Grammar.load("path/to/grammar.gram")
grammar.dump("path/to/dumped_grammar.gram")

# Or get the grammar as a string
grammar_str = grammar.dumps()

This will write the grammar definition to the specified file.

You can specify lots of parameters when dumping a grammar:

Customize prefixes, symbols, and wrappers
Control range factorization
Preserve or modify formatting
Include or exclude commands

The grammar will be dumped using the specified prefixes, symbols and wrappers and by factoring the ranges as much as possible. The round-trip property is preserved: loading a dumped grammar produces an equivalent grammar.

Commands

Commands are special annotations that can be added to grammar rules to modify their behavior.

See the commands documentation for more details.

Lexer

This class allows you to tokenize input text according to the defined grammar.

`Token.can_parse`

This is where you check if the current token can be parsed as the current token type from the source code.

There are three possible return values:

A value of False means 'definitely not': It cannot be of this type.
A value of True or TokenizationContext with TokenizationContext.done == True means 'definitely': It can be of this type.
A value of None or TokenizationContext with TokenizationContext.done == False means 'maybe': It could be of this type, but more context is needed (for example, we don't know if an identifier is a keyword until we have the full identifier).

from typing import override
from monica.lexer import ValuedToken, TokenizationContext

class Identifier(ValuedToken[str]):
    """Token representing an identifier ::= (<alpha> | _) (<alpha> | _ | <digit>)*"""

    @override
    @classmethod
    def can_parse(
        cls,
        value: str,
        context: TokenizationContext | None = None,
    ) -> bool | TokenizationContext | None:
        if Keyword.can_parse(value, context) is True:
            return None
        if not (value[0].isalpha() or value[0] == "_"):
            return False
        return all(c.isalnum() or c == "_" for c in value[1:])

`Token.parse`

This is where you really parse the token from the source code.

It should return an instance of the token if it can be parsed, or None if it cannot.

from typing import Self, override
from monica.lexer import ValuedToken, TokenizationContext

class Identifier(ValuedToken[str]):
    """Token representing an identifier ::= (<alpha> | _) (<alpha> | _ | <digit>)*"""

    @override
    @classmethod
    def parse(
        cls,
        value: str,
        line: int,
        column: int,
        context: TokenizationContext | None = None,
    ) -> Self | None:
        return cls(
            line=line,
            column=column,
            length=len(value),
            value=value,
        )

A value of None can also be returned to indicate that the token should be ignored (for example, whitespace tokens).

class Whitespace(Token):
    """Token representing whitespace."""

    @override
    @classmethod
    def can_parse(
        cls,
        value: str,
        context: TokenizationContext | None = None,
    ) -> bool | TokenizationContext | None:
        return value.isspace()

    @override
    @classmethod
    def parse(
        cls,
        value: str,
        line: int,
        column: int,
        context: TokenizationContext | None = None,
    ) -> Self | None:
        return None  # Whitespace tokens are ignored

Values

Some tokens have associated values (for example, an identifier token has the identifier name as its value).

While the basic Token class does not have any associated values, the ValuedToken[T] is a generic class that allows you to define tokens with associated values.

Enums

These are tokens that can take one of a predefined set of values (for example, keywords or operators).

This can be done by subclassing the EnumToken class and defining an enumeration for the possible values.

import enum
from monica.lexer import EnumToken

class KeywordEnum(enum.Enum):
    """Enumeration of keywords."""

    IF = "if"
    ELSE = "else"
    FOR = "for"
    WHILE = "while"

class Keyword(EnumToken[KeywordEnum]):
    """Token representing a keyword."""

Single character tokens

These are tokens that represent a single character, such as parentheses, braces, or operators.

This can be done by subclassing the CharacterToken class and defining the CHARACTER class attribute.

from monica.lexer import CharacterToken

class LeftParenthesis(CharacterToken):
    """Token representing a left parenthesis."""

    CHARACTER = "("

class RightParenthesis(CharacterToken):
    """Token representing a right parenthesis."""

    CHARACTER = ")"

Assembling types of tokens

Once you have defined your tokens, you can assemble them into a list of token types to be used by the lexer.

from monica.lexer import Lexer

lexer = Lexer(Whitespace, Keyword, Identifier, LeftParenthesis, RightParenthesis, ...)

Tokenization

You can tokenize input text using the tokenize method of the Lexer class.

for token in lexer.tokenize(text, "filepath_of_the_source_code.ext"):
    print_token(token)

Parallelization

The lexer supports parallel tokenization using the multiprocessing module.

You can call the tokenize_parallel method of the Lexer class instead of the regular tokenize method to tokenize the input text in parallel.

The output will be stored in a multiprocessing.Queue you provide.

queue: multiprocessing.Queue[Token | None] = multiprocessing.Queue()
lexer.parallel_tokenize(text, "filepath_of_the_source_code.ext", queue=queue)

while (token := queue.get()) is not None:
    print_token(token)

Parser

The parser module provides an LL(1) parser implementation to analyze syntax based on the defined grammar.

LL(1) Parsing

The parser uses an LL(1) parsing algorithm with a predictive parse table. It processes tokens from left to right and uses a single lookahead token to determine which production rule to apply.

from monica.parser import parse, create_table
from monica.grammar import Grammar

grammar = Grammar.load("path/to/grammar.gram")
table = create_table(grammar)

# Parse input tokens
parse(
    table,
    grammar.axiom,
    lexer.tokenize(input_text),
    token_to_terminal=lambda token, rule: Terminal(value=token.value),
    conflicts_resolution={},
)

Parse Table Generation

The create_table function generates an LL(1) parse table from a grammar. This table maps (non-terminal, lookahead) pairs to production rules.

from monica.parser import create_table

table = create_table(grammar)

The parse table construction automatically computes:

FIRST sets: the set of terminals that can begin strings derived from a non-terminal
FOLLOW sets: the set of terminals that can appear immediately after a non-terminal

Conflict Detection and Resolution

The parser can detect LL(1) grammar conflicts (FIRST/FIRST and FIRST/FOLLOW conflicts) and provides mechanisms to resolve them.

# Detect conflicts
if table.conflicts:
    for non_terminal, conflicts in table.conflicts.items():
        print(f"Conflict in {non_terminal.name}")
        for lookahead, rules in conflicts.items():
            print(f"  Lookahead: {lookahead}")
            for rule in rules:
                print(f"    {rule}")

# Resolve conflicts with a custom resolver function
def conflict_resolver(possible_rules: list[Rule], lookahead: Token) -> Rule:
    # Custom logic to choose the correct rule
    return possible_rules[0]

parse(
    table,
    grammar.axiom,
    tokens,
    token_to_terminal=...,
    conflicts_resolution={
        NonTerminal(name="stmt"): {
            Terminal(value="if"): conflict_resolver,
        },
    },
)

Error Handling

The parser provides detailed error messages for various parsing errors:

UnrecognizedTokenError: Token that cannot be mapped to a terminal
UnexpectedTokenError: Token that doesn't match expected terminal
NoRuleError: No production rule found for non-terminal with given lookahead
ConflictError: Multiple rules apply for the same (non-terminal, lookahead) pair
EndOfStackError: Input tokens remain after parsing stack is empty
ExpectedTerminalError: Expected a terminal but got a non-terminal
ExpectedNonTerminalError: Expected a non-terminal but got a terminal

Parsing with Listeners

The parser supports listener functions that receive parsing events in real-time:

from monica.parser import parse

def my_listener(item: Token | Rule | ParserError) -> None:
    if isinstance(item, Rule):
        print(f"Applied rule: {item}")
    elif isinstance(item, Token):
        print(f"Matched token: {item}")
    elif isinstance(item, ParserError):
        print(f"Error: {item}")
        # Optionally raise the error to stop parsing
        raise item

parse(
    table,
    grammar.axiom,
    tokens,
    token_to_terminal=...,
    conflicts_resolution={},
    listener=my_listener,
)

Grammar Normalization

The normalization module provides tools to transform grammars into LL(1) compatible forms.

Normalization Pipeline

The normalize function applies a series of transformations to convert a grammar into a weakly equivalent LL(1) grammar:

from monica.normalize import normalize

# Normalize a grammar
normalized_grammar = normalize(
    grammar,
    precedence={"*": 2, "/": 2, "+": 1, "-": 1},
    associativity={"*": "left", "/": "left", "+": "left", "-": "left"},
)

The normalization pipeline includes the following steps:

Expand optionals and repeats (?, *, +)
Expand groups ( and )
Eliminate left recursion
Left factor the grammar
Apply precedence transformations (optional) (Under Development)
Apply associativity transformations (optional) (Under Development)
Remove epsilon productions (Under Development)
Remove unit productions (Under Development)
Remove redundant productions
Prune identical rules

Left Recursion Elimination

Left recursion (when a non-terminal can derive itself as the leftmost symbol) is incompatible with LL parsers. The elimination process transforms left-recursive rules:

from monica.normalize import eliminate_left_recursion

# Before: <expr> ::= <expr> + <term> | <term>
# After:  <expr> ::= <term> <expr'>
#         <expr'> ::= + <term> <expr'> | ε

grammar = eliminate_left_recursion(grammar)

Left Factoring

Left factoring eliminates common prefixes in alternative productions to make the grammar suitable for predictive parsing:

from monica.normalize import left_factor

# Before: <stmt> ::= if ( <expr> ) <stmt> else <stmt>
#                  | if ( <expr> ) <stmt>
# After:  <stmt> ::= if ( <expr> ) <stmt> <else_part>
#         <else_part> ::= else <stmt> | ε

grammar = left_factor(grammar)

Grammar Expansion

Expansion functions convert shorthand notations into explicit productions:

from monica.normalize import expand_optionals_and_repeats, expand_groups

# Expand optional (?) and repeatable (*, +) operators
grammar = expand_optionals_and_repeats(grammar)

# Expand grouped alternatives
grammar = expand_groups(grammar)

Grammar Optimization

Optimization functions clean up the grammar by removing redundancies:

from monica.normalize import (
    remove_redundant_productions,
    prune_identical_rules,
)

# Remove rules with identical definitions
grammar = remove_redundant_productions(grammar)

# Remove duplicate rules
grammar = prune_identical_rules(grammar)

Abstract Syntax Tree (AST)

The AST module provides data structures for representing parse trees.

Tree Node Structure

The TreeNode class represents nodes in an abstract syntax tree:

from monica.ast.tree import TreeNode

# Create a tree
tree = TreeNode("root", [
    TreeNode("child1", [
        TreeNode("grandchild1"),
        TreeNode("grandchild2"),
    ]),
    TreeNode("child2"),
])

# Print the tree representation
print(tree)

AST Visualization

The AST module can generate Graphviz DOT diagrams from tree structures:

from monica.ast.visualize import tree_to_dot
from graphviz import Source

# Generate DOT code
dot_code = tree_to_dot(tree)

# Render with Graphviz
src = Source(dot_code)
src.view()

AST Builder

The ParseTreeBuilder class provides an automated way to construct abstract syntax trees during parsing:

from monica.ast.tree import ParseTreeBuilder
from monica.parser import parse

# Create an AST builder listener
builder = ParseTreeBuilder()

# Parse with the builder as listener
parse(
    table,
    grammar.axiom,
    tokens,
    token_to_terminal=token_to_terminal,
    conflicts_resolution={},
    listener=builder,
)

# Access the built AST
ast_tree = builder.root

The AST builder automatically:

Creates tree nodes for each token encountered
Organizes nodes according to grammar rules
Maintains parent-child relationships
Labels nodes with token types and values

AST Reduction

AST reduction simplifies parse trees by removing unnecessary nodes and combining related elements:

from jura.ast import JuraASTReducer

# Create a reducer
reducer = JuraASTReducer()

# Reduce the AST
reduced_tree = reducer.reduce(ast_tree)

The reducer:

Removes syntactic noise (brackets, semicolons, etc.)
Flattens simple statement nodes
Interns strings to save memory
Preserves semantic information

Jura Language

Jura is a sample programming language implementation demonstrating monica's capabilities.

Jura CLI

The Jura language includes a command-line interface for lexical analysis, parsing, and conflict detection:

# Lex a Jura file
jura lex jura/examples/HelloWorld.jura

# Lex with error recovery
jura lex jura/examples/errors/LexerError.jura --continue-on-error

# Parse a Jura file
jura parse jura/examples/HelloWorld.jura

# Parse with error recovery
jura parse jura/examples/errors/MultiInheritance.jura --continue-on-error

# Show the AST of a Jura file
jura ast jura/examples/Calculator.jura

# Show the AST with error recovery
jura ast jura/examples/errors/MultiInheritance.jura --continue-on-error

# Show grammar conflicts
jura conflicts

# Show version
jura --version

Lexical Analysis

The Jura lexer tokenizes Jura source code into tokens like keywords, identifiers, operators, numbers, and strings:

jura lex jura/examples/Calculator.jura

This outputs a detailed list of tokens with their line numbers, column numbers, and values.

Parsing Jura Files

The Jura parser validates the syntax of Jura programs according to the Jura grammar:

jura parse jura/examples/Animals.jura

The parser will report any syntax errors with detailed error messages including line and column information.

AST Visualization

Visualize the abstract syntax tree of a Jura program:

jura ast jura/examples/HelloWorld.jura

This command:

Parses the Jura file and builds an AST
Applies AST reduction to simplify the tree
Generates a Graphviz visualization
Opens the visualization in your default viewer

The generated AST diagrams are cached in jura/.cache/ast for easy access.

Cache Management

Clear the AST visualization cache:

jura clean

This removes all cached AST visualizations from the .cache directory.

Jura Examples

The Jura language includes several example programs demonstrating various features:

HelloWorld.jura - Basic program structure with variables
BareBone.jura - Minimal valid Jura program
Calculator.jura - Classes, methods, and arithmetic operations
Animals.jura - Object-oriented programming with inheritance
LogicLoops.jura - Control flow and logical operations
Recursion.jura - Recursive function calls
SimpleArithmetic.jura - Simple arithmetic operations

You can run any example:

jura parse jura/examples/Calculator.jura
jura lex jura/examples/Animals.jura
jura ast jura/examples/Recursion.jura

It also includes examples with syntax errors to test error handling:

jura parse jura/examples/errors/MultiInheritance.jura
jura parse jura/examples/errors/MainSignature.jura

Testing

Running Tests

The project uses pytest for testing.

To run all tests:

pytest

To run tests with coverage reporting:

pytest --cov=monica --cov=jura --cov-report=term-missing

You can also run tests for a specific package:

pytest tests/jura

Deployment

This module is currently in development and might contain bugs.

Feel free to use it in production if you feel like it is suitable for your production even if you may encounter issues.

Contributing

Pull requests are welcome. For major changes, please open an discussion first to discuss what you would like to change.

Please make sure to update the tests as appropriate.

Authors

anise — Initial work, Monica — Animenosekai
dindedouce — Initial work, Parser — dindedouce
PIGASSOU Clement — Initial work, AST — clement.pigassou
BESSIERE Lena — Initial work, Jura Grammar — lena.bessiere
GOMPEL Ingrid - Initial work - ingrid.gompel

Disclaimer

This project doesn't handle the generation of executables or interpreters from the parsed grammar yet.

License

This project is licensed under the MIT License - see the LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github/workflows		.github/workflows
assets		assets
jura		jura
monica		monica
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
commands.md		commands.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

monica

Index

Getting Started

Prerequisites

Installation

Graphviz

Debian/Ubuntu

Fedora

macOS

Development

Grammar

Loading a grammar

Dumping a grammar

Commands

Lexer

Token.can_parse

Token.parse

Values

Enums

Single character tokens

Assembling types of tokens

Tokenization

Parallelization

Parser

LL(1) Parsing

Parse Table Generation

Conflict Detection and Resolution

Error Handling

Parsing with Listeners

Grammar Normalization

Normalization Pipeline

Left Recursion Elimination

Left Factoring

Grammar Expansion

Grammar Optimization

Abstract Syntax Tree (AST)

Tree Node Structure

AST Visualization

AST Builder

AST Reduction

Jura Language

Jura CLI

Lexical Analysis

Parsing Jura Files

AST Visualization

Cache Management

Jura Examples

Testing

Running Tests

Deployment

Contributing

Authors

Disclaimer

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`Token.can_parse`

`Token.parse`

Packages