Skip to content

Development

Rana Faraz edited this page Jun 23, 2026 · 1 revision

Development

Setup

git clone https://github.com/ranafaraz/DocuMind.git
cd DocuMind
python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
pytest -q                        # 38 tests, all should pass

Running the benchmark

python -m evals.harness          # full benchmark, writes evals/RESULTS.md
python -m evals.gate             # CI quality gate (asserts dissociation shape)

Linting

ruff check .
ruff format .
mypy src/documind

Code structure

src/documind/
    types.py          -- Token, BoundingBox, Record, LineItem dataclasses
    config.py         -- Settings (reads env vars), doctype/backend registry
    normalize.py      -- Value canonicalisation, add_money (integer cents), OCR noise
    schema.py         -- Field and table schema definitions per doc type
    geometry.py       -- Spatial predicates: right_of, below, in_column_band
    documents/
        synthetic.py  -- Deterministic document generator + scramble_layout
        io_pdf.py     -- pdfplumber adapter (optional backend)
    extract/
        base.py       -- Shared: label finding, table-region detection, value cleaning
        layout.py     -- Geometry-based value association (right-of / below / columns)
        text.py       -- Reading-order value association (ablation)
        ollama.py     -- Ollama LLM extractor (optional)
        openai.py     -- OpenAI extractor (optional)
    verify.py         -- SchemaVerifier: arithmetic reconciliation, is_valid
    pipeline.py       -- End-to-end pipeline: source → extract → verify → score
    cli.py            -- documind extract | compare | render | extract-pdf | eval

evals/
    metrics.py        -- field_acc, cell_f1, doc_exact, validity aggregation
    harness.py        -- Runs all configs × seeds × doc types, writes RESULTS.md
    gate.py           -- CI gate: asserts dissociation shape and null collapse

tests/               -- 38 pytest tests (unit + integration, all offline)
examples/
    run_extractor.py  -- Minimal usage example
docs/
    ARCHITECTURE.md   -- Design write-up
    DECISIONS.md      -- Key decisions and their rationale

How to add a new document type

  1. Define the schema in schema.py: add a new entry to the field-schema registry with field names, expected value types, and the arithmetic constraints (for the verifier).
  2. Add layout rules in documents/synthetic.py: implement a generate_<doctype> function that assigns bounding boxes according to the document's layout (e.g., grid, mixed, single-column). Register the function in DocTypeRegistry.
  3. Add ground-truth generation: the same function should also produce the ground-truth record alongside the tokens.
  4. Update schema.py verifier rules if the new type has different arithmetic relationships.
  5. Add tests in tests/ covering at least: a round-trip (generate → extract → score), the null test (scramble → score collapses), and the verifier on a corrupted value.
  6. Run evals/harness.py with the new doc type and check that both effects still dissociate.

How to add a new extractor

  1. Subclass BaseExtractor in a new file under extract/. The base class provides find_labels, detect_table_region, and clean_value — your extractor implements associate_value(label_token, tokens, table_region).
  2. Register the backend in config.py's EXTRACTOR_REGISTRY.
  3. Import lazily if the extractor requires an optional dependency: check for the import at call time, not at module load, and raise a helpful error or fall back to the offline path.
  4. Add a test that runs the extractor on a synthetic document and checks field accuracy.
  5. Verify the ablation property: the new extractor should share base.py's canonicalisation and label-finding; only value association should differ. Check that receipt (single-column control) gives comparable accuracy to the existing layout extractor.

Key invariants

  • Never diverge extract/base.py between extractors. All extractors must use the same label-finding and table-region detection code. If you need to change base.py, the change must apply to all extractors so the head-to-head comparison remains fair.
  • Money arithmetic in integer cents. Use normalize.add_money; never use float sums for amounts. The verifier's tolerance is sub-cent.
  • Deterministic generator. Seed from the fixed _SALT, never hash() (which is per-process randomised and makes the benchmark non-reproducible across sessions).
  • The verifier must not fix mis-associated fields. It only recomputes total from subtotal + tax. If you add new verifier logic, ensure it only touches arithmetic, not field assignment.

Clone this wiki locally