-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Rana Faraz edited this page Jun 23, 2026
·
1 revision
Layout-aware document key-information extraction — with a measured proof that layout is doing the work.
DocuMind extracts structured records (key fields and line-item tables) from invoices, forms, and receipts using the page geometry an OCR engine emits, not just the text. A schema verifier reconciles arithmetic relationships and repairs OCR-corrupted values. Every extraction is scored against a ground-truth record, so the claims are measured, not asserted.
The interesting question is not "can a model read a document?" — it is what is the layout actually buying you? DocuMind separates two effects:
- Geometry buys field-association accuracy. Holding the verifier fixed, the layout-aware extractor lifts field accuracy from 59% to 100%. On two-column forms, text-only extraction scores 0%; layout scores 100%.
- The schema verifier buys arithmetic validity. Holding the extractor fixed, the verifier lifts record validity 91% to 100% by recomputing an OCR-corrupted total from subtotal + tax.
flowchart LR
subgraph Source["Document source (env-selectable)"]
SYN[synthetic<br/>boxes + ground truth · offline]
PDF[pdf<br/>pdfplumber · optional]
end
Source --> TOK[Tokens with bounding boxes]
subgraph Extract["Extractor (env-selectable)"]
LAY[layout<br/>geometry: right-of / below / columns]
TXT[text<br/>reading-order · ablation]
LLM[ollama / openai<br/>optional extras]
end
TOK --> Extract
Extract --> REC[Record: fields + line items]
REC --> VER[Schema verifier<br/>amounts → subtotal → +tax → total]
VER --> FINAL[Reconciled record]
SYN -.ground truth.-> SCORE[Score vs. ground-truth record]
FINAL --> SCORE
SCORE --> M[field acc · cell F1 · doc exact · validity]
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
pytest -q # 38 tests, all offline
documind compare --doctype invoice --seed 1 # four configs head-to-head- Architecture — document types, extractor design, verifier, null control, synthetic data
- Evaluation — benchmark setup, results table, dissociation, reproduce commands
-
Configuration — env vars, backend matrix,
.env.example - Development — setup, code structure, how to add a new document type or extractor