📐 System Design:
High Level Design → docs/HLD.md
SamvidAI is in active design and prototyping.
- Core architecture and OpticalRAG pipeline are defined
- Experiments use synthetic or publicly available documents
- No confidential or client legal data is used
- Benchmarks are prototype / controlled estimates
- Legal workflow validation is ongoing via practicing professionals
The project prioritizes correctness, safety, and validation before scale.
SamvidAI is a next-generation legal document intelligence system built to analyze long, complex contracts (50–300+ pages) using layout-aware, vision-first retrieval.
Traditional OCR + NLP systems flatten documents into text, losing structure and introducing hallucinations.
SamvidAI instead introduces OpticalRAG — a multimodal retrieval-augmented architecture that preserves spatial context while reducing cost and latency.
We do not replace attorneys. We empower them.
SamvidAI automates extraction, retrieval, and risk flagging so legal professionals can focus on judgment, validation, and strategy.
Human-in-the-loop review is a first-class design principle.
SamvidAI follows a documentation-first, system-design-driven approach.
All major architectural and design decisions are formally documented and versioned.
The High Level Design document covers:
- End-to-end system architecture
- OpticalRAG design rationale
- Component responsibilities
- Model & LLM strategy (Gemini 2.5 Pro usage)
- Data flow, security, ethics, and deployment
- Scalability, risks, and future extensions
📄 Read on GitHub
👉 docs/HLD.md
⬇️ Download full design document (DOCX, 90+ pages)
👉 docs/HLD.docx
The DOCX version is the authoritative long-form design, suitable for deep review and offline reading.
Legal contracts are:
- Long and dense
- Highly structured (clauses, tables, headers)
- Extremely risk-sensitive
Existing approaches fail because:
- OCR destroys layout semantics
- Full-document LLM ingestion is expensive
- Long-context hallucinations are common
- Clause hierarchy and visual grouping are ignored
OpticalRAG is a vision-first RAG pipeline that:
- Treats documents as visual data
- Retrieves only relevant regions
- Converts to text only when necessary
- 🔻 Significant token reduction
- ⚡ Faster inference
- 🧭 Layout-aware reasoning
- 📄 Scales to very long contracts
- 🧠 Reduced hallucinations
Traditional RAG pipelines fail on massive legal documents due to lossy OCR and limited context windows.
OpticalRAG solves this by design.
PDF Contract
↓
High-Resolution Page Images (300 DPI)
↓
Layout-Aware Segmentation (LayoutLMv3)
↓
Semantic Regions (Clauses, Tables, Headers)
↓
Multimodal Embeddings (Text + Vision)
↓
Vector Retrieval (Query-Aware)
↓
LLM Reasoning on Relevant Regions Only
- Preserves spatial and structural context
- Prevents lost-in-the-middle failures
- Optimized for consumer GPUs
- LLM is used for reasoning, not retrieval
- Vision-first document understanding
- Hierarchical retrieval (page → section → clause)
- Query-aware region selection
Identifies potentially risky clauses such as:
- One-sided obligations
- Unusual termination rights
- Missing liability protections
Risk levels:
- 🔴 High Risk
- 🟠 Review Needed
- 🟢 Standard
Role-specific summaries:
- Executive overview
- Key obligations
- Financial exposure
- Termination & liability highlights
- Attorneys can accept or reject AI findings
- Feedback enables iterative improvement
- Designed for assistive decision-making
- Python 3.10+
- FastAPI
- Streamlit
- LayoutLMv3
- OpenCV
- PaddleOCR
- OpenCLIP (ViT-H/14)
- BGE / E5
- ChromaDB
- Gemini 2.5 Pro (primary, cloud reasoning)
- Qwen2.5-7B / Mistral-7B (local fallback, quantized)
Gemini is used strictly for reasoning over retrieved regions, not full-document ingestion.
git clone https://github.com/your-username/SamvidAI.git
cd SamvidAIpython -m venv venv
source venv/bin/activate # Linux / Mac
venv\Scripts\activate # Windowspip install -r requirements.txt| Component | Specification |
|---|---|
| GPU | RTX 4060 (8 GB VRAM) |
| RAM | 24 GB |
| CPU | 12-core |
| OS | Windows / Linux |
Optimized for consumer GPUs (RTX 4060, 8 GB VRAM) using quantization.
- 4-bit quantized LLMs
- Batched embeddings
- Lazy region loading
- LayoutLMv3: ~2.1 GB
- OpenCLIP: ~1.8 GB
- LLM (7B, 4-bit): ~3.5 GB
- ✅ Runs comfortably on consumer GPUs
uvicorn api.main:app --reloadstreamlit run ui/streamlit_app.py1. Upload a contract PDF
2. Ask a question (e.g. "What are termination risks?")
3. View:
- Highlighted contract regions
- Risk flags
- Explanations
4. Accept or reject AI findings
Based on controlled experiments and synthetic contracts. Not production guarantees.
| Metric | OCR + Text RAG | SamvidAI (OpticalRAG) |
|---|---|---|
| Tokens to LLM | High | Significantly Lower |
| Latency | High | Reduced |
| Layout Accuracy | Poor | High |
| Hallucination Risk | High | Lower |
| Approach | Estimated Cost |
|---|---|
| Full-Text GPT-4 | ~$4.20 |
| OCR + RAG | ~$1.90 |
| SamvidAI | ~$0.65 |
SamvidAI/
│
├── README.md # Product-facing overview (FIRST IMPRESSION)
├── WEBSITE.md # Landing page copy
├── DEMO.md # Demo links + walkthrough
│
├── docs/ # SYSTEM & ENGINEERING (AUTHORITATIVE DESIGN)
│ ├── HLD.md # High-Level Design
│ ├── HLD.docx # High-Level Design (Long-form, downloadable)
│ ├── LLD.md # Low-Level Design
│ ├── ARCHITECTURE.md # Component & deployment architecture
│ ├── PIPELINE.md # End-to-end data & inference pipeline
│ ├── DATA_REPORTS.md # Metrics, charts, evaluations
│ ├── EXPERIMENTS.md # Ablations, experiments
│ ├── BENCHMARKS.md # Performance comparisons
│ ├── SECURITY.md # Security considerations
│ ├── ETHICS.md # Ethics & safety
│
├── research/ # SCIENTIFIC THINKING
│ ├── related_work.md # Prior research & models
│ ├── papers.md # Paper summaries & links
│ ├── findings.md # Your insights & failures
│
├── product/ # FOUNDER MODE
│ ├── roadmap.md # 30-90-365 day plan
│ ├── monetization.md # Business model
│ ├── user_personas.md # Target users
│ ├── go_to_market.md # Distribution strategy
│
├── src/ # CODE
│ └── samvidai/
│ ├── __init__.py
│ │
│ ├── ingestion/ # PDF → image → layout
│ │ ├── __init__.py
│ │ ├── pdf_to_image.py
│ │ └── preprocess.py
│ │
│ ├── layout/ # Layout-aware segmentation
│ │ ├── __init__.py
│ │ └── layoutlm.py
│ │
│ ├── retrieval/ # OpticalRAG core
│ │ ├── __init__.py
│ │ ├── embeddings.py
│ │ ├── vector_store.py
│ │ └── retriever.py
│ │
│ ├── risk_engine/ # Clause classification & risk scoring
│ │ ├── __init__.py
│ │ ├── classifier.py
│ │ └── scorer.py
│ │
│ ├── llm/ # LLM interfaces
│ │ ├── __init__.py
│ │ ├── prompts.py
│ │ └── inference.py
│ │
│ └── utils/
│ ├── __init__.py
│ └── logger.py
│
├── api/ # BACKEND
│ └── main.py # FastAPI app
│
├── ui/ # FRONTEND
│ └── streamlit_app.py
│
├── assets/ # VISUALS
│ ├── images/
│ ├── videos/
│ └── diagrams/
│
├── tests/
│ └── TESTING_PLAN.md/
├── docker/
│ └── Dockerfile
│
├── requirements.txt
└── .gitignore
SamvidAI incorporates modern retrieval and LLM research, including:
- Hierarchical RAG
- Query-aware retrieval
- Late chunking
- Lost-in-the-middle mitigation
- Contrastive multimodal embeddings
- Hybrid rule-based + LLM reasoning
- Human-in-the-loop active learning
- PDF → image conversion
- Layout segmentation
- Multimodal retrieval
- Query-aware chunking
- Clause classification
- Red / Amber / Green scoring
- Attorney validation
- Feedback storage
- Latency tuning
- Dataset-driven improvements
SamvidAI is built with a startup-first mindset:
- Solves a real legal pain point
- Optimized for limited hardware
- Open-source friendly
- Enterprise-ready foundation
The long-term goal is to evolve SamvidAI into a full legal intelligence platform for contract review, compliance, and dispute risk forecasting.
Contributions, ideas, and discussions are welcome.
If you're interested in:
- Legal AI
- Multimodal RAG
- Human-in-the-loop systems
You’ll feel right at home here.
MIT License
If you like this project, ⭐ star the repo and join the journey.