-
Notifications
You must be signed in to change notification settings - Fork 0
Development
Rana Faraz edited this page Jun 23, 2026
·
1 revision
git clone https://github.com/ranafaraz/DocuMind.git
cd DocuMind
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
pytest -q # 38 tests, all should passpython -m evals.harness # full benchmark, writes evals/RESULTS.md
python -m evals.gate # CI quality gate (asserts dissociation shape)ruff check .
ruff format .
mypy src/documindsrc/documind/
types.py -- Token, BoundingBox, Record, LineItem dataclasses
config.py -- Settings (reads env vars), doctype/backend registry
normalize.py -- Value canonicalisation, add_money (integer cents), OCR noise
schema.py -- Field and table schema definitions per doc type
geometry.py -- Spatial predicates: right_of, below, in_column_band
documents/
synthetic.py -- Deterministic document generator + scramble_layout
io_pdf.py -- pdfplumber adapter (optional backend)
extract/
base.py -- Shared: label finding, table-region detection, value cleaning
layout.py -- Geometry-based value association (right-of / below / columns)
text.py -- Reading-order value association (ablation)
ollama.py -- Ollama LLM extractor (optional)
openai.py -- OpenAI extractor (optional)
verify.py -- SchemaVerifier: arithmetic reconciliation, is_valid
pipeline.py -- End-to-end pipeline: source → extract → verify → score
cli.py -- documind extract | compare | render | extract-pdf | eval
evals/
metrics.py -- field_acc, cell_f1, doc_exact, validity aggregation
harness.py -- Runs all configs × seeds × doc types, writes RESULTS.md
gate.py -- CI gate: asserts dissociation shape and null collapse
tests/ -- 38 pytest tests (unit + integration, all offline)
examples/
run_extractor.py -- Minimal usage example
docs/
ARCHITECTURE.md -- Design write-up
DECISIONS.md -- Key decisions and their rationale
-
Define the schema in
schema.py: add a new entry to the field-schema registry with field names, expected value types, and the arithmetic constraints (for the verifier). -
Add layout rules in
documents/synthetic.py: implement agenerate_<doctype>function that assigns bounding boxes according to the document's layout (e.g., grid, mixed, single-column). Register the function inDocTypeRegistry. - Add ground-truth generation: the same function should also produce the ground-truth record alongside the tokens.
-
Update
schema.pyverifier rules if the new type has different arithmetic relationships. -
Add tests in
tests/covering at least: a round-trip (generate → extract → score), the null test (scramble → score collapses), and the verifier on a corrupted value. -
Run
evals/harness.pywith the new doc type and check that both effects still dissociate.
-
Subclass
BaseExtractorin a new file underextract/. The base class providesfind_labels,detect_table_region, andclean_value— your extractor implementsassociate_value(label_token, tokens, table_region). -
Register the backend in
config.py'sEXTRACTOR_REGISTRY. - Import lazily if the extractor requires an optional dependency: check for the import at call time, not at module load, and raise a helpful error or fall back to the offline path.
- Add a test that runs the extractor on a synthetic document and checks field accuracy.
-
Verify the ablation property: the new extractor should share
base.py's canonicalisation and label-finding; only value association should differ. Check thatreceipt(single-column control) gives comparable accuracy to the existinglayoutextractor.
-
Never diverge
extract/base.pybetween extractors. All extractors must use the same label-finding and table-region detection code. If you need to change base.py, the change must apply to all extractors so the head-to-head comparison remains fair. -
Money arithmetic in integer cents. Use
normalize.add_money; never use float sums for amounts. The verifier's tolerance is sub-cent. -
Deterministic generator. Seed from the fixed
_SALT, neverhash()(which is per-process randomised and makes the benchmark non-reproducible across sessions). -
The verifier must not fix mis-associated fields. It only recomputes
totalfromsubtotal + tax. If you add new verifier logic, ensure it only touches arithmetic, not field assignment.