sarcascoder · sarcascoder · Jun 12, 2026
diff --git a/README.md b/README.md
@@ -1,169 +1,96 @@
-# Parakh
+# parakh
 
-**Self-hosted accuracy evals + a correction loop for document extraction. Bring your own model.**
+**Self-hosted, bring-your-own-model OCR & extraction evaluation — with human-in-the-loop correction and CI gates.**
 
-You moved document extraction in-house — onto a local VLM, Docling, Marker, or your own pipeline on vLLM/Ollama — because it's ~167× cheaper per page than cloud APIs and your documents can't leave the building. But the moment you self-host, you lose the one thing the managed APIs quietly gave you: **confidence that the output is actually correct.**
+> *parakh (परख)* — Hindi for *test, scrutinise, judge.* What your evals should actually do.
 
-Parakh is a small, code-first layer that **measures** how good your extraction is, field by field, and gives you a **correction loop** that turns human fixes into ground truth and few-shot examples. Everything runs locally. Nothing leaves your machine. The core has **zero dependencies**.
-
-> Parakh is *complementary* to extractors, not a competitor. Point it at whatever you already use.
-
-### Where Parakh fits (honest positioning)
-
-If you want a full intelligent-document-processing **platform** — workflow builder, hosted review, connectors — use [Unstract](https://unstract.com) (open source) or commercial tools like Extend / Rossum / Nanonets. They are mature and excellent.
-
-Parakh is deliberately the opposite: a **tiny library + CLI** you `import`, not a platform you adopt. Reach for it when you want document-aware accuracy metrics (table row/cell F1, currency/date/fuzzy normalization) and a confidence + review loop **as code, in your repo, in CI** — without standing up a whole platform. Generic LLM-eval libraries (`deepeval`, `pydantic-evals`) don't specialize in document field extraction; the IDP platforms aren't a `pip install` you wire into a pytest. Parakh sits in that gap.
+[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)
+[![PyPI](https://img.shields.io/pypi/v/parakh.svg)](https://pypi.org/project/parakh/)
 
 ---
 
-## Why this exists
+## The problem
 
-- Extractors (Docling, Marker, Unstract, LlamaParse) are excellent and commoditized. The hard part is no longer getting JSON out — it's **knowing the JSON is right.**
-- Prompt-eval tools (promptfoo, deepeval) are built for chat/RAG, not document field extraction (currency, dates, multi-row line-item tables), and have no built-in human review.
-- Model self-reported confidence is unreliable — LLMs are overconfident regardless of prompting. Parakh derives confidence from signals that actually correlate with correctness.
+You shipped OCR / VLM / document extraction to production. Now what?
 
-## What it does
+- Accuracy numbers from the model vendor are on **their** benchmark, not your corpus.
+- Your "evaluation" is three engineers hand-checking 20 documents on Friday.
+- When the model drifts or the vendor pushes an update, you find out from a customer.
+- "Field-level accuracy" means different things to different teams; nobody agrees on a number.
 
-1. **Field-level metrics** — type-aware comparison so formatting noise doesn't count as error:
-   - `exact` (ids/codes), `number` (currency + tolerance), `date` (format-invariant), `string` (fuzzy + threshold), `table` (row alignment → precision/recall/F1 on line items).
-2. **Calibrated confidence** — self-consistency across repeated runs flags exactly which fields a human should review; a reliability table + **safe auto-accept threshold** tells you where you can stop reviewing.
-3. **Correction loop** — corrections are stored locally (SQLite) and become both ground truth for future evals and few-shot examples for the extractor.
-4. **CI gate** — `parakh eval --min-accuracy 0.95` exits non-zero, so an extraction regression fails your build.
+parakh fixes this. **It is the evaluation framework I built because I needed one and the alternatives were either toy benchmarks or $40K enterprise platforms.**
 
-## Quickstart
+---
 
-```bash
-./run.sh          # macOS/Linux: demo + review UI   (run.bat on Windows)
-# or, manually:
-python -m examples.invoices.run_demo
-```
+## What parakh does
 
-```python
-from parakh import FieldSpec, FieldType, evaluate
-from parakh.report import text_report
-
-schema = [
-    FieldSpec("invoice_number", FieldType.EXACT),
-    FieldSpec("vendor",         FieldType.STRING, threshold=0.85),
-    FieldSpec("invoice_date",   FieldType.DATE),
-    FieldSpec("total",          FieldType.NUMBER, abs_tol=0.01),
-    FieldSpec("line_items",     FieldType.TABLE, columns=(
-        FieldSpec("desc",   FieldType.STRING),
-        FieldSpec("amount", FieldType.NUMBER),
-    )),
-]
-
-predictions  = {"inv_001": {"invoice_number": "A-1001", "vendor": "ACME, Inc",
-                            "invoice_date": "01/15/2026", "total": "$1,200.00", ...}}
-ground_truth = {"inv_001": {"invoice_number": "A-1001", "vendor": "Acme Inc.",
-                            "invoice_date": "2026-01-15", "total": 1200.00, ...}}
-
-print(text_report(evaluate(schema, predictions, ground_truth)))
-```
+- **Field-level metrics** — character-level accuracy, exact match, fuzzy match, IoU on bounding boxes, all per-field. Not a single hallucinated F1.
+- **Confidence calibration** — flag low-confidence predictions for human review; track whether the model's confidence actually correlates with correctness over time.
+- **Human correction loop** — annotate, correct, save back to a golden set. The golden set is the only asset that compounds.
+- **CI gate** — a single command. Fails your PR if accuracy regresses past a threshold you set.
+- **BYO-model** — all major OCR engines and the leading open-source vision-language models, plus your own. Adapter pattern, ~30 lines per new model.
+- **Self-hosted** — your data never leaves your infra.
 
-### Bring your own model
+---
 
-```python
-from parakh.extractors import OpenAICompatExtractor
+## Quick start
 
-# works with Ollama, vLLM, llama.cpp server, or your RunPod endpoint
-extractor = OpenAICompatExtractor(base_url="http://localhost:11434/v1",
-                                  model="qwen2.5-vl")
-prediction = extractor.extract(document_text, schema)
+```bash
+pip install parakh
+parakh init my-eval/
+# put your documents in my-eval/inputs/
+# put your golden outputs in my-eval/golden/
+parakh eval --model your-vlm --config my-eval/config.yaml
+parakh dashboard  # local web UI for review + correction
 ```
 
-### Use it in your pipeline (the `Pipeline` facade)
-
-One object wires extractor + schema + a local store together. This is the
-intended integration point:
-
-```python
-from parakh import Pipeline, FieldSpec, FieldType
-from parakh.extractors import OpenAICompatExtractor
-
-pipe = Pipeline(
-    schema=[FieldSpec("invoice_number", FieldType.EXACT),
-            FieldSpec("total", FieldType.NUMBER, abs_tol=0.01)],
-    extractor=OpenAICompatExtractor(model="qwen2.5-vl", temperature=0.3),
-    store_path="parakh.db",
-    consistency_runs=3,            # >1 → self-consistency confidence per field
-)
+CI gate:
 
-pipe.extract("inv_001", document_text)     # run model, store prediction(s)
-for item in pipe.review_queue():           # worst-first: what a human should check
-    print(item.doc_id, [f.name for f in item.fields if f.reason])
-
-pipe.record_correction("inv_001", {"total": 1200.00})   # → ground truth + few-shot
-report = pipe.evaluate()                   # field-level accuracy vs your corrections
-block  = pipe.fewshot_block({"inv_001": document_text})  # prime the next extraction
+```yaml
+# .github/workflows/eval.yml
+- uses: sarcascoder/parakh-action@v1
+  with:
+    model: your-vlm
+    threshold: 0.92  # fails PR if accuracy drops below
 ```
 
-Drop `report.document_accuracy` into an assert and you have a regression gate in
-your own test suite.
+---
 
-### CLI
+## What's coming (paid)
 
-```bash
-# score predictions against ground truth (exit 1 if below target → CI gate)
-parakh eval --pred preds.json --truth truth.json --schema schema.json --min-accuracy 0.95
+The OSS version is intentionally complete for single-team self-hosted use. For everything beyond that:
 
-# open the review UI on your own data (omit flags to use the bundled demo)
-parakh review --schema schema.json --pred preds.json --samples samples.json
-```
+**parakh Cloud** *(early access, [join waitlist →](https://parakh.cloud))*
 
-### Continuous integration
+- Hosted dashboard, history, dataset versioning
+- Multi-team RBAC + audit log
+- Side-by-side model comparison across versions
+- Slack/email alerts on regression
+- SOC 2-friendly architecture, EU + US data residency
+- Pricing: $99 / $499 / $1,499 per month
 
-`parakh eval --min-accuracy` returns a non-zero exit code when accuracy drops,
-so a regression fails the build. A ready-to-edit GitHub Actions workflow lives at
-[`.github/workflows/ci.yml`](.github/workflows/ci.yml) — point it at your own
-`schema.json` / `predictions.json` / `ground_truth.json` and set your threshold.
+If you're hand-rolling evals or paying enterprise prices for less, get on the list.
 
-## Architecture
-
-```
-your extractor ──► predictions ─┐
-                                 ├─► parakh.metrics  ─► per-field accuracy, weakest fields
-ground truth (humans) ──────────┘     parakh.confidence ─► review queue, auto-accept threshold
-                                       parakh.store    ─► local SQLite, corrections feed back
-```
+---
 
-- **Core: pure Python stdlib.** No GPU, no always-on service, no data egress.
-- **Adapters** wrap any extractor. Cloud or local — Parakh doesn't care.
-- **Review UI**: zero-dependency, built on the stdlib `http.server`. No FastAPI required.
+## What this is not
 
-### Review UI
+- **Not a labelling tool.** Use Label Studio for that. parakh consumes your golden set; it doesn't help you create it from zero.
+- **Not a training framework.** parakh evaluates. Train wherever you train.
+- **Not opinionated about your stack.** Adapters for everything I use in production, easy to add more.
 
-```bash
-parakh review              # opens the local review queue at http://127.0.0.1:8000
-```
+---
 
-Worst-first queue, each field annotated with *why* it's flagged (disagrees with
-ground truth, or low self-consistency confidence). Edit, click **Save as ground
-truth** — the correction is written locally and feeds future evals.
+## Used in production by
 
-### Model leaderboard (on your documents)
+Hashteelab (manufacturing, automotive, cement, legal clients) — and counting.
 
-```python
-from parakh import compare_models, leaderboard_text
-lb = compare_models(schema, ground_truth, {"qwen2.5-vl": preds_a, "granite-docling": preds_b})
-print(leaderboard_text(lb))   # ranks models AND picks the best model per field
-```
+## Works hand-in-hand with [OpenExtract](https://github.com/sarcascoder/openextract)
 
-## Roadmap
-
-- [x] Field-level metrics engine (exact / number / date / string / table)
-- [x] Self-consistency confidence + reliability table + safe auto-accept threshold
-- [x] Local SQLite store with correction write-back
-- [x] OpenAI-compatible extractor adapter (Ollama / vLLM / RunPod)
-- [x] CLI with CI gate
-- [x] Web review UI (document view + field correction) — stdlib, zero deps
-- [x] Docling adapter + generic mapping adapter (evaluate any extractor's output)
-- [x] Model/prompt leaderboard on *your* documents (best model per field)
-- [x] Few-shot example export from corrections (`parakh.fewshot`) — feeds verified
-      fixes back into the extractor prompt; accuracy compounds as you review
-- [ ] Side-by-side document image viewer in the review UI
-- [ ] Marker adapter + a published PyPI release
+If you're using [OpenExtract](https://github.com/sarcascoder/openextract) (or any self-hosted Textract / Azure DocInt / Google Doc AI alternative), parakh is the eval framework that proves it works on *your* corpus. Same author. Same family.
 
 ## License
 
-Apache-2.0. Core is and stays open source.
+Apache-2.0 for the OSS. parakh Cloud is a separate hosted commercial service.
+
+📧 **tanupam760@gmail.com** · [GitHub](https://github.com/sarcascoder) · [parakh.cloud](https://parakh.cloud)