Secure · Fast · Reliable · Cost-Effective
The document normalization engine RAG has always needed.
Why DOCNEST • Installation • Quick Start • Python API • PDF Parsing • How It Works • Benchmark • Providers • Roadmap
Every RAG pipeline ingests documents the same broken way:
PDF → extract text → split every 512 chars → embed → store → hope
What gets silently destroyed:
| Source | What blind chunking loses |
|---|---|
| Financial report | Table row 45.2% | Q3 | Europe has no column headers |
| Legal contract | Clause split mid-sentence across two chunks |
| API documentation | Code example separated from its description |
| Research paper | Figure caption disconnected from its analysis |
The LLM receives noise and returns approximate answers. This is not a retrieval problem — it is an ingestion problem.
Take a financial report with a revenue table. Here is what each approach gives your LLM:
❌ Blind chunking (LangChain / LlamaIndex default)
chunk_1: "45.2% Q3 Europe 38.1% Q2 Europe 41.7% Q3"
chunk_2: "Asia 29.3% Q2 Asia Americas 52.1% Q3 Ame"
The LLM has numbers. It has no idea what they mean.
✅ DOCNEST
{
"section": "§4.2 Revenue by Region",
"table": {
"caption": "Quarterly revenue breakdown by region",
"headers": ["Region", "Q2 Revenue", "Q3 Revenue", "Change"],
"rows": [
["Europe", "38.1%", "45.2%", "+7.1pp"],
["Asia", "29.3%", "41.7%", "+12.4pp"],
["Americas", "n/a", "52.1%", "—"]
]
},
"summary": "Q3 revenue grew across all regions, led by Asia (+12.4pp)."
}The LLM knows exactly what the numbers mean, where they came from, and how they relate.
DOCNEST reads the structure of a document before touching the content. Every heading becomes a navigable §section. Every table is preserved as { caption, headers, rows[] }. Every section gets a one-sentence summary, a keyword index, and a quantized embedding — computed once at ingest, used forever.
The output is a .udf file — a self-contained portable knowledge base you can share by email, copy to USB, or upload to S3.
pip install docnest-ai pymupdffrom docnest.parsers.pymupdf_pdf import PyMuPDFParser
from docnest.normalizer import SectionNormaliser
from docnest.writer import UDFWriter
from docnest.reader import UDFIndex
# Parse → normalise → save (no LLM, no API key needed)
raw = PyMuPDFParser().parse("your-document.pdf")
doc = SectionNormaliser().normalise(raw)
UDFWriter().write(doc, "my-doc.udf")
# Query
idx = UDFIndex.load("my-doc.udf")
result = idx.query(
"What was Q3 revenue?",
llm_provider="groq",
llm_model="llama-3.3-70b-versatile",
llm_api_key="gsk_...", # free at console.groq.com
)
print(result.answer) # "Q3 revenue was $38M, up 22% YoY."
print(result.layer_used) # 1 — answered from index, 0 LLM tokenspip install docnest-ai# Fast PDF parsing (no ML, no downloads) — recommended for most PDFs
pip install pymupdf
# ML-quality PDF parsing (tables, scanned docs) — requires more RAM
pip install docling
# Fast approximate nearest-neighbour search (large document sets)
pip install faiss-cpu
# Persistent cross-session vector store
pip install chromadb
# Everything at once
pip install docnest-ai pymupdf docling faiss-cpu chromadb# Fastest — PyMuPDF parser, local embeddings, no LLM required
docnest convert report.pdf --pdf-engine pymupdf --fast
# Full quality — Docling parser + LLM enrichment (Groq is free-tier friendly)
docnest convert report.pdf --llm-provider groq --llm-model llama-3.3-70b-versatile
# Large PDF (>30 pages) — auto page-chunked, full ML quality, bounded RAM
docnest convert big-report.pdf --llm-provider openai --llm-model gpt-4o-mini
# With metadata tags
docnest convert report.pdf \
--owner "Alice Smith" \
--department "Finance" \
--tags "q4,2024,revenue"docnest query report.udf "What was Q3 revenue?"
docnest query report.udf "What are the key risks?" --layers 0,1,2docnest inspect report.udfdocnest view report.udf # opens browser
docnest view report.udf --out report.htmldocnest library init ./docs/
docnest library add ./docs/ report.udf
docnest library add ./docs/ contract.udf
docnest library list ./docs/
docnest library search ./docs/ "revenue forecast"
docnest library remove ./docs/ old-report.udffrom docnest.pipeline import DocNestPipeline
# ── Option 1: Fully local — no API keys, no internet after first download ──
pipeline = DocNestPipeline(
llm_provider="ollama",
llm_model="llama3.2", # ollama pull llama3.2
emb_provider="huggingface",
emb_model="all-MiniLM-L6-v2", # downloaded automatically on first run
)
pipeline.convert("report.pdf") # → report.udf
# ── Option 2: Groq LLM + local HuggingFace embeddings (recommended) ──
pipeline = DocNestPipeline(
llm_provider="groq",
llm_model="llama-3.3-70b-versatile",
llm_api_key="gsk_...", # or set GROQ_API_KEY env var
emb_provider="huggingface",
emb_model="all-MiniLM-L6-v2",
)
pipeline.convert("report.pdf")
# ── Option 3: OpenAI for both LLM and embeddings ──
pipeline = DocNestPipeline(
llm_provider="openai",
llm_model="gpt-4o-mini",
llm_api_key="sk-...",
emb_provider="openai",
emb_model="text-embedding-3-small",
emb_api_key="sk-...",
)
pipeline.convert("report.pdf")
# ── Option 4: Skip intelligence (no LLM, fastest) ──
pipeline = DocNestPipeline(skip_intelligence=True)
pipeline.convert("report.pdf") # embeddings only, no section summaries
# ── Option 5: Custom output path + progress callback ──
pipeline = DocNestPipeline(
llm_provider="groq",
llm_api_key="gsk_...",
)
pipeline.convert(
"report.pdf",
output_path="./output/report.udf",
on_stage_complete=lambda stage, _: print(f"✓ {stage}"),
)from docnest.reader import UDFIndex
# Load the .udf file (instant — no LLM needed to load)
idx = UDFIndex.load("report.udf")
# ── Simple query — escalates through layers automatically ──
result = idx.query(
"What was Q3 revenue?",
llm_provider="groq",
llm_model="llama-3.3-70b-versatile",
llm_api_key="gsk_...",
)
print(result.answer) # "Q3 revenue was $38M, up 22% YoY."
print(result.citations) # ["§3.1"]
print(result.layer_used) # 1 (BM25+cosine, 0 tokens!)
print(result.tokens_used) # 0
# ── Use FAISS for faster search on large documents ──
idx = UDFIndex.load("report.udf", vector="faiss")
# ── ChromaDB — persists across sessions ──
idx = UDFIndex.load("report.udf", vector="chroma", persist_dir="./chroma_store")
# ── Multiple queries on same index (load once, reuse) ──
questions = [
"What were the key risks?",
"What is the revenue forecast for 2025?",
"Who are the main competitors?",
]
for q in questions:
r = idx.query(q, llm_provider="groq", llm_api_key="gsk_...")
print(f"[L{r.layer_used}] {r.answer[:120]}")from docnest.parsers.pdf import DoclingPDFParser
from docnest.parsers.pymupdf_pdf import PyMuPDFParser
from docnest.parsers.factory import ParserFactory
# ── DoclingPDFParser — full ML quality (tables, headings, scanned pages) ──
# Small PDF (≤30 pages) — runs Docling directly
parser = DoclingPDFParser()
raw = parser.parse("report.pdf")
# Large PDF — auto page-chunked in 30-page pieces, same full ML quality
raw = DoclingPDFParser().parse("600-page-annual-report.pdf")
# Explicit chunk size — tune to your RAM
raw = DoclingPDFParser(chunk_pages=10).parse("large.pdf") # low RAM machine
raw = DoclingPDFParser(chunk_pages=50).parse("large.pdf") # high RAM machine
# Scanned PDF (OCR) — requires Docling's Tesseract integration
raw = DoclingPDFParser(ocr=True).parse("scanned-contract.pdf")
# ── PyMuPDFParser — fast, zero ML, works on any machine ──
parser = PyMuPDFParser()
raw = parser.parse("report.pdf")
# ── ParserFactory — auto-selects the right parser by file extension ──
factory = ParserFactory() # default: Docling for PDF
factory = ParserFactory(pdf_engine="pymupdf") # lightweight: PyMuPDF for PDF
raw = factory.get("report.pdf").parse("report.pdf")
raw = factory.get("report.docx").parse("report.docx")
raw = factory.get("data.xlsx").parse("data.xlsx")
raw = factory.get("sales.csv").parse("sales.csv") # also .tsv
raw = factory.get("page.html").parse("page.html")
raw = factory.get("notes.md").parse("notes.md")
# Inspect what was parsed
print(f"Sections: {len(raw.sections)}")
for s in raw.sections:
print(f" L{s.level} {s.title} ({len(s.tables)} tables)")from docnest.parsers.factory import ParserFactory
factory = ParserFactory()
# Switch to PyMuPDF for this session
factory.set_pdf_engine("pymupdf")
# Switch back to Docling
factory.set_pdf_engine("docling")
# Register a completely custom parser
from docnest.parsers.base import IParser
class MyParser(IParser):
def supports(self, path: str) -> bool:
return path.endswith(".myformat")
def parse(self, path: str):
...
factory.register(MyParser())DOCNEST gives you two PDF parsers. Choose based on your document type and available RAM:
DoclingPDFParser |
PyMuPDFParser |
|
|---|---|---|
| Table quality | ✅ ML-grade (TableFormer) | |
| Scanned PDFs | ✅ OCR support | ❌ Text-only |
| Heading detection | ✅ Semantic (Docling layout) | |
| RAM usage | ~1–2 GB (ML models) | ~50 MB |
| First-run download | ~1 GB models | None |
| Speed | Slower | Very fast |
| Best for | Financial reports, contracts, research papers | Resumes, simple reports, fast prototyping |
DoclingPDFParser auto-chunks large PDFs — no quality loss, bounded RAM:
# Auto-chunking kicks in for PDFs > 30 pages
# Peak RAM ≈ RAM_per_chunk, not RAM_for_whole_file
raw = DoclingPDFParser().parse("600-page-report.pdf")
# Tune chunk size to your machine
raw = DoclingPDFParser(chunk_pages=10).parse("report.pdf") # 8 GB RAM
raw = DoclingPDFParser(chunk_pages=30).parse("report.pdf") # 16 GB RAM (default)
raw = DoclingPDFParser(chunk_pages=60).parse("report.pdf") # 32 GB+ RAM
# Page images are disabled by default (saves 50-200 MB per page)
# Only enable if you specifically need image assets
raw = DoclingPDFParser(generate_images=True).parse("report.pdf")How chunking works:
- PyMuPDF splits the PDF into N-page temp files
- Docling runs with full ML quality on each chunk
- Sections are merged — output is identical to processing the whole file at once
- Temp files are deleted immediately after each chunk
If you hit std::bad_alloc or OSError: paging file too small, fall back to PyMuPDF:
from docnest.parsers.pymupdf_pdf import PyMuPDFParser
raw = PyMuPDFParser().parse("huge.pdf") # Zero ML, always succeedsDOCNEST runs a 6-stage normalization pipeline on every document:
Stage 1 Structure Extraction (Docling / PyMuPDF) headings, tables, lists, hierarchy
Stage 2 Section Assignment (rule-based) §1, §1.1, §1.2 … every heading = §id
Stage 3 Table Normalization (normaliser) { caption, headers, rows[] } JSON
Stage 4 Section Summarization (LLM) one sentence per section
Stage 5 Document Intelligence (LLM) summary, insights[], key_numbers[]
Stage 6 Embedding + Quantize (local) BM25 keywords + float16 vectors
Stages 1, 2, 3, and 6 run locally — zero LLM cost.
Stages 4–5 call an LLM once per document. Every future query benefits for free.
The result is a .udf file — a self-contained, portable knowledge base:
document.udf (zip)
├── manifest.json format version, embedding model, quantization, DocMeta
├── catalogue.json section index + BM25 keywords + intelligence
├── content.json full section text (loaded on demand)
├── embeddings.bin flat float16 binary blob (~87% smaller than base64)
└── assets/ images, structured tables
embeddings.binis a flat binary blob:N × D × 2 bytes(float16).
Legacy base64-per-section format is still read for backward compatibility.
DOCNEST resolves queries without sending full documents to the LLM:
| Layer | Mechanism | Tokens | Latency | Cost |
|---|---|---|---|---|
| 0 | Pre-computed (summary, insights, key_numbers) | 0 | < 1 ms | $0 |
| 1 | BM25 + cosine → navigate to §section | 0 | < 20 ms | $0 |
| 2 | Section-scoped LLM (~300 tokens) | ~300 | 1–3 s | ~$0.0001 |
| 3 | Multi-section synthesis (~900 tokens) | ~900 | 2–5 s | ~$0.0003 |
| 4 | Full document fallback | ~4,000+ | 5–15 s | ~$0.001 |
| — | Naive RAG (blind chunking) | ~4,000–8,000 | 5–15 s | ~$0.001–0.002 |
| — | Real traditional RAG (full PDF) | ~80,000–120,000 | 5–15 s | ~$0.10–0.15 |
Layers 0 and 1 answer ~70% of real-world questions with zero LLM cost.
DocNest averages ~2,600 tokens on structured files — 67–71% fewer tokens than traditional RAG.
| Format | Parser | Notes |
|---|---|---|
| PDF (text-based) | DoclingPDFParser / PyMuPDFParser |
Full heading hierarchy, table extraction |
| PDF (scanned) | DoclingPDFParser(ocr=True) |
OCR via Docling's Tesseract integration |
| DOCX | DocxParser |
Word documents with styles and heading levels |
| XLSX | ExcelParser |
Each sheet → section, all tables preserved |
| CSV / TSV | CSVParser |
Auto-detect delimiter (, ; | \t); encoding cascade; full table preserved |
| HTML | HTMLParser |
h1–h6 hierarchy via BeautifulSoup |
| Markdown | MarkdownParser |
ATX and Setext headings via mistletoe |
All external dependencies sit behind swappable interfaces. Change the string — no other code changes.
| Value | Notes |
|---|---|
"groq" |
Fast, generous free tier — recommended for getting started |
"openai" |
GPT-4o-mini, GPT-4o |
"ollama" |
Fully local — ollama pull llama3.2 |
"anthropic" |
Claude Haiku, Claude Sonnet |
"google" |
Gemini Flash, Gemini Pro |
"mistral" |
Mistral Large, Mixtral |
"together" |
Together AI hosted models |
"cohere" |
Command R+ |
"bedrock" |
AWS Bedrock (boto3 required) |
| Value | Notes |
|---|---|
"huggingface" |
Local — downloads model once, then offline. Default. |
"openai" |
text-embedding-3-small / text-embedding-3-large |
"ollama" |
Local via Ollama (nomic-embed-text, etc.) |
"google" |
Vertex AI / Gemini embeddings |
"cohere" |
embed-english-v3.0 |
"bedrock" |
AWS Bedrock Titan embeddings |
"nvidia" |
NVIDIA NIM embeddings |
"mistral" |
Mistral embeddings |
| Value | Install | Best for |
|---|---|---|
"numpy" (default) |
built-in | Small docs, zero extra deps |
"faiss" |
pip install faiss-cpu |
Fast ANN on large collections |
"chroma" |
pip install chromadb |
Persistent cross-session store |
from docnest.reader import UDFIndex
idx = UDFIndex.load("report.udf") # numpy (default)
idx = UDFIndex.load("report.udf", vector="faiss") # FAISS
idx = UDFIndex.load("report.udf", vector="chroma", # ChromaDB
persist_dir="./store")| Value | Notes |
|---|---|
"auto" |
Picks best available — bm25 → tfidf → keyword |
"bm25" |
BM25Okapi — best keyword recall |
"tfidf" |
TF-IDF — good fallback |
"keyword" |
Simple term overlap — zero deps |
88 questions · 10 documents · 5 formats — the most comprehensive evaluation of any open-source RAG ingestion library.
Scored using honest factual analysis (not keyword overlap heuristics). Model: Cerebras qwen-3-235b-a22b-instruct-2507.
| Format | Document | Qs | Score | Pass Rate |
|---|---|---|---|---|
| 📊 XLSX | Acme Corp Financial Workbook (10 sheets, formulas, multi-table) | 15 | 8.8 / 10 | 87% |
| 📝 DOCX | TechVision Annual Report (29 sections, 12 tables) | 14 | 9.9 / 10 | ✅ 100% |
| 🌐 HTML | NexusAPI v3 Developer Reference (rate limits, endpoints, SDKs) | 14 | 9.9 / 10 | ✅ 100% |
| 📋 MD | CloudMesh Architecture Spec (DR tiers, observability, compliance) | 15 | 10.0 / 10 | ✅ 100% |
| IPCC AR6 — Summary for Policymakers (122 sections) | 5 | 10.0 / 10 | ✅ 100% | |
| BIS Annual Economic Report 2024 | 5 | 10.0 / 10 | ✅ 100% | |
| GPT-3 Paper — Language Models are Few-Shot Learners | 5 | 7.8 / 10 | 60% | |
| Attention Is All You Need — Transformer Paper | 5 | 8.4 / 10 | 80% | |
| Llama 2 — Open Foundation and Fine-Tuned Chat Models | 5 | 10.0 / 10 | ✅ 100% | |
| Constitutional AI — Harmlessness from AI Feedback | 5 | 10.0 / 10 | ✅ 100% |
| Metric | Value |
|---|---|
| Honest accuracy | 9.55 / 10 |
| Pass rate (≥ 7/10) | 95.5% — 84 / 88 questions |
| Real errors | 4 out of 88 questions |
| Perfect-scoring formats | DOCX · HTML · MD · 3 × PDF |
| Documents | 10 across 5 formats |
| Evaluator | Cerebras qwen-3-235b-a22b-instruct-2507 |
6 out of 10 documents score 10.0/10. Only 4 real retrieval errors in the entire test suite.
Traditional RAG sends a fixed blind slice of every document for each question.
DocNest retrieves only the relevant sections — anywhere in the document.
| Document | Format | Qs | Traditional RAG | DocNest | Saving |
|---|---|---|---|---|---|
| TechVision Annual Report | DOCX | 14 | 35,224 | 11,789 | −67% |
| CloudMesh Architecture Spec | MD | 15 | 29,835 | 8,650 | −71% |
| NexusAPI Reference | HTML | 14 | 19,418 | 9,631 | −50% |
| Acme Corp Financials | XLSX | 15 | 48,840 | 38,991 | −20% |
| Constitutional AI Paper | 5 | 17,280 | 9,216 | −47% | |
| IPCC AR6 | 5 | 17,280 | 22,946 | +33%¹ | |
| GPT-3 Paper | 5 | 17,280 | 46,637 | +170%¹ | |
| Llama 2 Paper | 5 | 17,280 | 46,888 | +171%¹ | |
| Total — 88 questions | 88 | 236,645 | 232,805 | −2% |
¹ Traditional baseline = first 3,456 tokens (blind top-slice — misses content buried in the document).
DocNest retrieves relevant chunks from anywhere in a 50-page paper. More tokens, but the right tokens.
On a real pipeline sending the full document: a 50-page PDF ≈ 80,000–120,000 tokens per question.
DocNest retrieves ~9,000 tokens — a 90 %+ reduction while answering more accurately.
Why this matters for cost:
| Scenario | Model | Tokens / question | Cost / 1,000 Qs |
|---|---|---|---|
| Traditional (full-doc) | GPT-4o-mini | ~100,000 | ~$150 |
| Traditional (blind slice) | GPT-4o-mini | 3,456 | ~$5.20 — misses deep content |
| DocNest | GPT-4o-mini | ~2,600 avg structured · ~9,300 PDF | ~$4–14 — correct answers |
| # | Document | Question | Root Cause |
|---|---|---|---|
| 1 | Acme XLSX | Highest monthly revenue | Retrieved December row instead of November |
| 2 | Acme XLSX | Total Enterprise tier ARR | Missing 4 accounts from retrieval (7,290 vs 7,600) |
| 3 | GPT-3 Paper | Architecture table (96L/96H/d=12288) | Dense figures-embedded table not indexed by PyMuPDF |
| 4 | Attention Paper | Dropout rate P_drop | Ablation row (0) ranked above method section (0.1) |
All 4 errors are retrieval/indexing issues — not LLM hallucination.
DOCX, HTML, MD, and 3 out of 6 PDFs have zero errors.
# Fastest — Cerebras (free tier, no daily token limit)
pip install docnest-ai
export CEREBRAS_API_KEY="your-key"
python eval/rag_accuracy_eval.py --model cerebras/qwen-3-235b-a22b-instruct-2507 --run-id my_run
# Or with Groq (free tier)
export GROQ_API_KEY="gsk_..."
python eval/rag_accuracy_eval.py --model groq/llama-3.3-70b-versatile --run-id my_run
# Or with Gemini
export GOOGLE_API_KEY="your-key"
python eval/rag_accuracy_eval.py --model gemini-2.0-flash --run-id my_run| Phase | Description | Status |
|---|---|---|
| 1 | Core parsers — PDF (Docling + PyMuPDF), DOCX, XLSX, HTML, MD | ✅ Done |
| 2 | Embedding + quantization (10+ providers via LangChain) | ✅ Done |
| 3 | Intelligence engine (summary, insights, key_numbers) | ✅ Done |
| 4 | UDF writer + reader + five-layer query engine | ✅ Done |
| 5 | Large PDF chunking — full ML quality, bounded RAM | ✅ Done |
| 6 | Library mode — multi-document cross-search | ✅ Done |
| 7 | PyPI release pip install docnest-ai |
✅ Done |
| 8 | PPTX parser — slides, speaker notes, embedded tables | 📋 Planned |
| 9 | EPUB parser — chapters, footnotes, embedded images | 📋 Planned |
| 10 | XML parser — configurable tag-to-section mapping | 📋 Planned |
| 11 | CSV / TSV parser — auto-header detection, delimiter sniffing, encoding cascade | ✅ Done |
| 12 | JSON / JSONL parser — nested object flattening, array tables | 📋 Planned |
| 13 | ODT / ODS parser — OpenDocument text and spreadsheet | 📋 Planned |
| 14 | RTF parser — rich text format documents | 📋 Planned |
| 15 | LaTeX / TeX parser — sections, equations, bibliography | 📋 Planned |
| 16 | Connectors: GitHub, Confluence, Notion, Jira | 📋 Planned |
| 17 | Hierarchical supervisor+worker for datasets > 200 MB | 📋 Planned |
Track detailed progress: ROADMAP.md
In-depth articles on how DOCNEST works and the problems it solves:
| # | Title | Published |
|---|---|---|
| 1 | My RAG app confidently told my client the wrong answer. I spent 3 days debugging the wrong thing. | May 2026 |
DOCNEST is community-first. We are building this in the open and want contributors at every level.
| Area | Good for |
|---|---|
| 🧩 New parser (PPTX, EPUB, RST) | Familiar with Docling or document formats |
| 🔌 New vector backend (Qdrant, Weaviate) | Vector database experience |
| 🔌 New connector (SharePoint, Linear) | API integration experience |
| 🧪 Test fixtures | Any skill level — add sample documents for testing |
| 📖 Documentation | Any skill level — improve examples, fix typos |
| 🐛 Bug reports | Any skill level — try it, break it, report it |
| 💡 Architecture discussion | Senior engineers — open a Discussion |
See CONTRIBUTING.md for the full guide.
Give us a ⭐ if DOCNEST solves a problem you have — it helps others find the project.
Full implementation spec: SPEC_DOCNEST_PYPI.md
Covers: architecture, SOLID compliance, design patterns, interfaces, concrete classes, code snippets, test plan, dependency costs.
Open format spec: github.com/tailorgunjan93/udf-spec
MIT — free for commercial use. See LICENSE.
| Product | Description |
|---|---|
| docnest | This library — document normalization engine |
| udf-spec | Open specification for the .udf format |
| knovex | Desktop RAG app powered by DOCNEST |
| udf-reader-vscode | VS Code extension for .udf files |
🔒 Secure · ⚡ Fast · 🛡️ Reliable · 💰 Cost-Effective
Built with ❤️ for the RAG community · github.com/tailorgunjan93/docnest