Skip to content
Rana Faraz edited this page Jun 23, 2026 · 1 revision

DocuMind

CI Live demo License: MIT

Layout-aware document key-information extraction — with a measured proof that layout is doing the work.

DocuMind extracts structured records (key fields and line-item tables) from invoices, forms, and receipts using the page geometry an OCR engine emits, not just the text. A schema verifier reconciles arithmetic relationships and repairs OCR-corrupted values. Every extraction is scored against a ground-truth record, so the claims are measured, not asserted.

The interesting question is not "can a model read a document?" — it is what is the layout actually buying you? DocuMind separates two effects:

  • Geometry buys field-association accuracy. Holding the verifier fixed, the layout-aware extractor lifts field accuracy from 59% to 100%. On two-column forms, text-only extraction scores 0%; layout scores 100%.
  • The schema verifier buys arithmetic validity. Holding the extractor fixed, the verifier lifts record validity 91% to 100% by recomputing an OCR-corrupted total from subtotal + tax.

Architecture overview

flowchart LR
    subgraph Source["Document source (env-selectable)"]
        SYN[synthetic<br/>boxes + ground truth · offline]
        PDF[pdf<br/>pdfplumber · optional]
    end
    Source --> TOK[Tokens with bounding boxes]
    subgraph Extract["Extractor (env-selectable)"]
        LAY[layout<br/>geometry: right-of / below / columns]
        TXT[text<br/>reading-order · ablation]
        LLM[ollama / openai<br/>optional extras]
    end
    TOK --> Extract
    Extract --> REC[Record: fields + line items]
    REC --> VER[Schema verifier<br/>amounts → subtotal → +tax → total]
    VER --> FINAL[Reconciled record]
    SYN -.ground truth.-> SCORE[Score vs. ground-truth record]
    FINAL --> SCORE
    SCORE --> M[field acc · cell F1 · doc exact · validity]
Loading

Quick start

python -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
pytest -q                                            # 38 tests, all offline
documind compare --doctype invoice --seed 1          # four configs head-to-head

Wiki pages

  • Architecture — document types, extractor design, verifier, null control, synthetic data
  • Evaluation — benchmark setup, results table, dissociation, reproduce commands
  • Configuration — env vars, backend matrix, .env.example
  • Development — setup, code structure, how to add a new document type or extractor

Clone this wiki locally