🧠 SamvidAI

Intelligent Contract Analysis Engine powered by OpticalRAG

📐 System Design:
High Level Design → docs/HLD.md

🚧 Current Project Status

SamvidAI is in active design and prototyping.

Core architecture and OpticalRAG pipeline are defined
Experiments use synthetic or publicly available documents
No confidential or client legal data is used
Benchmarks are prototype / controlled estimates
Legal workflow validation is ongoing via practicing professionals

The project prioritizes correctness, safety, and validation before scale.

🧭 What is SamvidAI?

SamvidAI is a next-generation legal document intelligence system built to analyze long, complex contracts (50–300+ pages) using layout-aware, vision-first retrieval.

Traditional OCR + NLP systems flatten documents into text, losing structure and introducing hallucinations.
SamvidAI instead introduces OpticalRAG — a multimodal retrieval-augmented architecture that preserves spatial context while reducing cost and latency.

🧠 Philosophy

We do not replace attorneys. We empower them.

SamvidAI automates extraction, retrieval, and risk flagging so legal professionals can focus on judgment, validation, and strategy.

Human-in-the-loop review is a first-class design principle.

📘 Architecture & Design Documentation

SamvidAI follows a documentation-first, system-design-driven approach.
All major architectural and design decisions are formally documented and versioned.

🔹 High Level Design (HLD)

The High Level Design document covers:

End-to-end system architecture
OpticalRAG design rationale
Component responsibilities
Model & LLM strategy (Gemini 2.5 Pro usage)
Data flow, security, ethics, and deployment
Scalability, risks, and future extensions

📄 Read on GitHub
👉 docs/HLD.md

⬇️ Download full design document (DOCX, 90+ pages)
👉 docs/HLD.docx

The DOCX version is the authoritative long-form design, suitable for deep review and offline reading.

❌ The Problem

Legal contracts are:

Long and dense
Highly structured (clauses, tables, headers)
Extremely risk-sensitive

Existing approaches fail because:

OCR destroys layout semantics
Full-document LLM ingestion is expensive
Long-context hallucinations are common
Clause hierarchy and visual grouping are ignored

✅ The Solution — OpticalRAG

OpticalRAG is a vision-first RAG pipeline that:

Treats documents as visual data
Retrieves only relevant regions
Converts to text only when necessary

Key Benefits

🔻 Significant token reduction
⚡ Faster inference
🧭 Layout-aware reasoning
📄 Scales to very long contracts
🧠 Reduced hallucinations

🧠 OpticalRAG Architecture

Traditional RAG pipelines fail on massive legal documents due to lossy OCR and limited context windows.

OpticalRAG solves this by design.

PDF Contract
↓
High-Resolution Page Images (300 DPI)
↓
Layout-Aware Segmentation (LayoutLMv3)
↓
Semantic Regions (Clauses, Tables, Headers)
↓
Multimodal Embeddings (Text + Vision)
↓
Vector Retrieval (Query-Aware)
↓
LLM Reasoning on Relevant Regions Only

Why OpticalRAG Works

Preserves spatial and structural context
Prevents lost-in-the-middle failures
Optimized for consumer GPUs
LLM is used for reasoning, not retrieval

🧩 Core Capabilities

🔍 Layout-Aware Retrieval

Vision-first document understanding
Hierarchical retrieval (page → section → clause)
Query-aware region selection

⚠️ Risk Flagging (Prototype)

Identifies potentially risky clauses such as:

One-sided obligations
Unusual termination rights
Missing liability protections

Risk levels:

🔴 High Risk
🟠 Review Needed
🟢 Standard

📝 Smart Summarization

Role-specific summaries:

Executive overview
Key obligations
Financial exposure
Termination & liability highlights

👨‍⚖️ Human-in-the-Loop Review

Attorneys can accept or reject AI findings
Feedback enables iterative improvement
Designed for assistive decision-making

🛠️ Tech Stack

Core

Python 3.10+
FastAPI
Streamlit

Vision & Layout

LayoutLMv3
OpenCV
PaddleOCR

Embeddings & Retrieval

OpenCLIP (ViT-H/14)
BGE / E5
ChromaDB

LLMs

Gemini 2.5 Pro (primary, cloud reasoning)
Qwen2.5-7B / Mistral-7B (local fallback, quantized)

Gemini is used strictly for reasoning over retrieved regions, not full-document ingestion.

💻 Local Setup (Prototype)

1️⃣ Clone Repository

git clone https://github.com/your-username/SamvidAI.git
cd SamvidAI

2️⃣ Create Virtual Environment

python -m venv venv
source venv/bin/activate   # Linux / Mac
venv\Scripts\activate      # Windows

3️⃣ Install Dependencies

pip install -r requirements.txt

🎮 GPU Usage & Hardware

Tested Configuration

Component	Specification
GPU	RTX 4060 (8 GB VRAM)
RAM	24 GB
CPU	12-core
OS	Windows / Linux

Optimized for consumer GPUs (RTX 4060, 8 GB VRAM) using quantization.

Memory Optimization

4-bit quantized LLMs
Batched embeddings
Lazy region loading

Peak VRAM Usage

LayoutLMv3: ~2.1 GB
OpenCLIP: ~1.8 GB
LLM (7B, 4-bit): ~3.5 GB
✅ Runs comfortably on consumer GPUs

▶️ Run Demo Locally

Start Backend

uvicorn api.main:app --reload

Launch UI

streamlit run ui/streamlit_app.py

Demo Flow

1. Upload a contract PDF

2. Ask a question (e.g. "What are termination risks?")

3. View:

Highlighted contract regions
Risk flags
Explanations

4. Accept or reject AI findings

📊 Benchmarks (Internal Evaluation)

Contract Size: 120 Pages

Based on controlled experiments and synthetic contracts. Not production guarantees.

Metric	OCR + Text RAG	SamvidAI (OpticalRAG)
Tokens to LLM	High	Significantly Lower
Latency	High	Reduced
Layout Accuracy	Poor	High
Hallucination Risk	High	Lower

💰 Cost Comparison (Per Document)

Approach	Estimated Cost
Full-Text GPT-4	~$4.20
OCR + RAG	~$1.90
SamvidAI	~$0.65

➡ ~65% cost reduction

📂 Project Structure

SamvidAI/
│
├── README.md                  # Product-facing overview (FIRST IMPRESSION)
├── WEBSITE.md                 # Landing page copy
├── DEMO.md                    # Demo links + walkthrough
│
├── docs/                      # SYSTEM & ENGINEERING (AUTHORITATIVE DESIGN)
│   ├── HLD.md                 # High-Level Design
│   ├── HLD.docx               # High-Level Design (Long-form, downloadable)    
│   ├── LLD.md                 # Low-Level Design
│   ├── ARCHITECTURE.md        # Component & deployment architecture
│   ├── PIPELINE.md            # End-to-end data & inference pipeline
│   ├── DATA_REPORTS.md        # Metrics, charts, evaluations
│   ├── EXPERIMENTS.md         # Ablations, experiments
│   ├── BENCHMARKS.md          # Performance comparisons
│   ├── SECURITY.md            # Security considerations
│   ├── ETHICS.md              # Ethics & safety
│
├── research/                  # SCIENTIFIC THINKING
│   ├── related_work.md        # Prior research & models
│   ├── papers.md              # Paper summaries & links
│   ├── findings.md            # Your insights & failures
│
├── product/                   # FOUNDER MODE
│   ├── roadmap.md             # 30-90-365 day plan
│   ├── monetization.md        # Business model
│   ├── user_personas.md       # Target users
│   ├── go_to_market.md        # Distribution strategy
│
├── src/                       # CODE
│   └── samvidai/
│       ├── __init__.py
│       │
│       ├── ingestion/         # PDF → image → layout
│       │   ├── __init__.py
│       │   ├── pdf_to_image.py
│       │   └── preprocess.py
│       │
│       ├── layout/            # Layout-aware segmentation
│       │   ├── __init__.py
│       │   └── layoutlm.py
│       │
│       ├── retrieval/         # OpticalRAG core
│       │   ├── __init__.py
│       │   ├── embeddings.py
│       │   ├── vector_store.py
│       │   └── retriever.py
│       │
│       ├── risk_engine/       # Clause classification & risk scoring
│       │   ├── __init__.py
│       │   ├── classifier.py
│       │   └── scorer.py
│       │
│       ├── llm/               # LLM interfaces
│       │   ├── __init__.py
│       │   ├── prompts.py
│       │   └── inference.py
│       │
│       └── utils/
│           ├── __init__.py
│           └── logger.py
│
├── api/                       # BACKEND
│   └── main.py                # FastAPI app
│
├── ui/                        # FRONTEND
│   └── streamlit_app.py
│
├── assets/                    # VISUALS
│   ├── images/
│   ├── videos/
│   └── diagrams/
│
├── tests/
│    └── TESTING_PLAN.md/
├── docker/
│   └── Dockerfile
│
├── requirements.txt
└── .gitignore

🧪 Research Techniques Used

SamvidAI incorporates modern retrieval and LLM research, including:

Hierarchical RAG
Query-aware retrieval
Late chunking
Lost-in-the-middle mitigation
Contrastive multimodal embeddings
Hybrid rule-based + LLM reasoning
Human-in-the-loop active learning

🗺️ Roadmap

Phase 1 — Ingestion

PDF → image conversion
Layout segmentation

Phase 2 — OpticalRAG

Multimodal retrieval
Query-aware chunking

Phase 3 — Risk Engine

Clause classification
Red / Amber / Green scoring

Phase 4 — Review UI

Attorney validation
Feedback storage

Phase 5 — Optimization

Latency tuning
Dataset-driven improvements

🎯 Vision

SamvidAI is built with a startup-first mindset:

Solves a real legal pain point
Optimized for limited hardware
Open-source friendly
Enterprise-ready foundation

The long-term goal is to evolve SamvidAI into a full legal intelligence platform for contract review, compliance, and dispute risk forecasting.

🤝 Contributing

Contributions, ideas, and discussions are welcome.

If you're interested in:

Legal AI
Multimodal RAG
Human-in-the-loop systems

You’ll feel right at home here.

📜 License

MIT License

If you like this project, ⭐ star the repo and join the journey.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
api		api
data		data
docker		docker
docs		docs
domain		domain
product		product
research		research
src/samvidai		src/samvidai
tests		tests
ui		ui
.gitignore		.gitignore
ASSUMPTIONS.md		ASSUMPTIONS.md
DEMO.md		DEMO.md
LICENSE		LICENSE
PROBLEM_STATEMENT.md		PROBLEM_STATEMENT.md
README.md		README.md
ROADMAP.md		ROADMAP.md
WEBSITE.md		WEBSITE.md
requirements.txt		requirements.txt

License

VisionExpo/SamvidAI

Folders and files

Latest commit

History

Repository files navigation