Development

Setup

git clone https://github.com/ranafaraz/DocuMind.git
cd DocuMind
python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
pytest -q                        # 38 tests, all should pass

Running the benchmark

python -m evals.harness          # full benchmark, writes evals/RESULTS.md
python -m evals.gate             # CI quality gate (asserts dissociation shape)

Linting

ruff check .
ruff format .
mypy src/documind

Code structure

src/documind/
    types.py          -- Token, BoundingBox, Record, LineItem dataclasses
    config.py         -- Settings (reads env vars), doctype/backend registry
    normalize.py      -- Value canonicalisation, add_money (integer cents), OCR noise
    schema.py         -- Field and table schema definitions per doc type
    geometry.py       -- Spatial predicates: right_of, below, in_column_band
    documents/
        synthetic.py  -- Deterministic document generator + scramble_layout
        io_pdf.py     -- pdfplumber adapter (optional backend)
    extract/
        base.py       -- Shared: label finding, table-region detection, value cleaning
        layout.py     -- Geometry-based value association (right-of / below / columns)
        text.py       -- Reading-order value association (ablation)
        ollama.py     -- Ollama LLM extractor (optional)
        openai.py     -- OpenAI extractor (optional)
    verify.py         -- SchemaVerifier: arithmetic reconciliation, is_valid
    pipeline.py       -- End-to-end pipeline: source → extract → verify → score
    cli.py            -- documind extract | compare | render | extract-pdf | eval

evals/
    metrics.py        -- field_acc, cell_f1, doc_exact, validity aggregation
    harness.py        -- Runs all configs × seeds × doc types, writes RESULTS.md
    gate.py           -- CI gate: asserts dissociation shape and null collapse

tests/               -- 38 pytest tests (unit + integration, all offline)
examples/
    run_extractor.py  -- Minimal usage example
docs/
    ARCHITECTURE.md   -- Design write-up
    DECISIONS.md      -- Key decisions and their rationale

How to add a new document type

Define the schema in schema.py: add a new entry to the field-schema registry with field names, expected value types, and the arithmetic constraints (for the verifier).
Add layout rules in documents/synthetic.py: implement a generate_<doctype> function that assigns bounding boxes according to the document's layout (e.g., grid, mixed, single-column). Register the function in DocTypeRegistry.
Add ground-truth generation: the same function should also produce the ground-truth record alongside the tokens.
Update schema.py verifier rules if the new type has different arithmetic relationships.
Add tests in tests/ covering at least: a round-trip (generate → extract → score), the null test (scramble → score collapses), and the verifier on a corrupted value.
Run evals/harness.py with the new doc type and check that both effects still dissociate.

How to add a new extractor

Subclass BaseExtractor in a new file under extract/. The base class provides find_labels, detect_table_region, and clean_value — your extractor implements associate_value(label_token, tokens, table_region).
Register the backend in config.py's EXTRACTOR_REGISTRY.
Import lazily if the extractor requires an optional dependency: check for the import at call time, not at module load, and raise a helpful error or fall back to the offline path.
Add a test that runs the extractor on a synthetic document and checks field accuracy.
Verify the ablation property: the new extractor should share base.py's canonicalisation and label-finding; only value association should differ. Check that receipt (single-column control) gives comparable accuracy to the existing layout extractor.

Key invariants

Never diverge extract/base.py between extractors. All extractors must use the same label-finding and table-region detection code. If you need to change base.py, the change must apply to all extractors so the head-to-head comparison remains fair.
Money arithmetic in integer cents. Use normalize.add_money; never use float sums for amounts. The verifier's tolerance is sub-cent.
Deterministic generator. Seed from the fixed _SALT, never hash() (which is per-process randomised and makes the benchmark non-reproducible across sessions).
The verifier must not fix mis-associated fields. It only recomputes total from subtotal + tax. If you add new verifier logic, ensure it only touches arithmetic, not field assignment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development

Development

Setup

Running the benchmark

Linting

Code structure

How to add a new document type

How to add a new extractor

Key invariants

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally