Skip to content

amannanda-22/DOCMINDPRO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 

Repository files navigation

DocuMind Pro β€” Multi-PDF RAG Intelligence System

╔══════════════════════════════════════════════════════════════════════════════╗
β•‘                                                                              β•‘
β•‘  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•—   β–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ•—   β–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ•—   β–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—          β•‘
β•‘  β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β•β•β•β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ•—  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—         β•‘
β•‘  β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β–ˆβ–ˆβ–ˆβ–ˆβ•”β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β–ˆβ–ˆβ•— β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘         β•‘
β•‘  β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β•šβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β•šβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘         β•‘
β•‘  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘ β•šβ•β• β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘ β•šβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•         β•‘
β•‘  β•šβ•β•β•β•β•β•  β•šβ•β•β•β•β•β•  β•šβ•β•β•β•β•β• β•šβ•β•β•β•β•β• β•šβ•β•     β•šβ•β•β•šβ•β•β•šβ•β•  β•šβ•β•β•β•β•šβ•β•β•β•β•β•          β•‘
β•‘                                                                              β•‘
β•‘             Multi-PDF RAG Intelligence System v1.0.0                         β•‘
β•‘ FastAPI Β· Streamlit Β· Gemini 2.5 Flash Β· ChromaDB Β· BM25 Β· RAGAS Β· Docker   β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Architecture

                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                           β”‚         Streamlit Frontend (8501)          β”‚
                           β”‚ Premium UI β€’ Chat β€’ Metrics β€’ Upload       β”‚
                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                              β”‚ HTTP REST
                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                           β”‚          FastAPI Backend (8000)            β”‚
                           β”‚ Upload β€’ Query β€’ Metrics β€’ Cache           β”‚
                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                              β”‚
                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                         β”‚          Document Processing Pipeline    β”‚
                         β”‚                                          β”‚
                         β”‚ PDF Upload (Multiple PDFs)               β”‚
                         β”‚              β”‚                           β”‚
                         β”‚ PDFPlumber Extraction                    β”‚
                         β”‚              β”‚                           β”‚
                         β”‚ Semantic Chunking                        β”‚
                         β”‚              β”‚                           β”‚
                         β”‚ Gemini Embedding-001 (3072 Dimensions)   β”‚
                         β”‚              β”‚                           β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚               β”‚
                                 β–Ό               β–Ό
                         ChromaDB Vector DB     BM25 Index
                                 β”‚               β”‚
                                 β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β–Ό
                           Reciprocal Rank Fusion (RRF)
                                        β”‚
                                        β–Ό
                           CrossEncoder Re-ranking
                                        β”‚
                                        β–Ό
                           Gemini 2.5 Flash Generator
                                        β”‚
                                        β–Ό
                      Citation-aware Answer + Confidence Score
                                        β”‚
                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                           β–Ό                         β–Ό
                    Semantic Cache             RAGAS Evaluation
                           β”‚                         β”‚
                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β–Ό
                                 SQLite Metrics

Features

  • Multi-PDF Upload & Indexing
  • PDFPlumber Text Extraction
  • Semantic-aware Chunking
  • Gemini Embedding-001 (3072-Dimensional Embeddings)
  • ChromaDB Vector Database
  • BM25 Sparse Retrieval
  • Hybrid Retrieval Pipeline
  • Reciprocal Rank Fusion (RRF)
  • CrossEncoder Re-ranking
  • Gemini 2.5 Flash Answer Generation
  • Citation-aware Responses
  • Confidence Score
  • Semantic Cache
  • RAGAS Evaluation
  • SQLite Metrics Dashboard
  • FastAPI REST API
  • Streamlit Premium Dashboard
  • LangSmith Tracing Support
  • Apify Web Context (Optional)
  • Docker & Docker Compose Support
  • Dark Premium Responsive UI

RAG Pipeline

Document Indexing

  1. Upload multiple PDF documents.
  2. Extract text page-by-page using PDFPlumber.
  3. Perform semantic-aware chunking.
  4. Generate Gemini Embedding-001 vectors.
  5. Store embeddings in ChromaDB.
  6. Build BM25 sparse index.

Query Pipeline

  1. Receive user question.
  2. Check Semantic Cache.
  3. Dense Retrieval (ChromaDB Top-20)
  4. Sparse Retrieval (BM25 Top-20)
  5. Reciprocal Rank Fusion
  6. CrossEncoder Re-ranking
  7. Context Assembly
  8. Gemini 2.5 Flash Response Generation
  9. Citation Generation
  10. Confidence Score Calculation
  11. Background RAGAS Evaluation
  12. Store Metrics in SQLite

Quick Start

Prerequisites

  • Python 3.11+
  • Docker Desktop
  • Google Gemini API Key

Clone Project

git clone <repository-url>
cd documind-pro

Configure Environment

cp .env.example .env

Fill your API keys.

Install

python -m venv .venv

# Windows
.venv\Scripts\activate

pip install -r requirements.txt

Run Backend

uvicorn api.main:app --reload

Backend

http://localhost:8000

Swagger

http://localhost:8000/docs

Run Frontend

streamlit run frontend/app.py

Frontend

http://localhost:8501

Run Docker

docker compose up --build

API Reference

Method Endpoint Description
POST /upload Upload and index PDFs
POST /query Ask questions
GET /documents List indexed documents
DELETE /documents/{doc_id} Delete document
GET /metrics Evaluation metrics
GET /cache/stats Cache statistics
DELETE /cache Clear semantic cache
GET /health Health check

Required API Keys

Environment Variable Service Required
GOOGLE_API_KEY Google Gemini Yes
LANGCHAIN_API_KEY LangSmith Optional
APIFY_API_TOKEN Apify Optional

Manual Configuration

  1. Copy .env.example β†’ .env
  2. Add Google Gemini API Key.
  3. Start Backend.
  4. Start Frontend.

Project Structure

documind-pro/
β”œβ”€β”€ api/
β”œβ”€β”€ cache/
β”œβ”€β”€ evaluation/
β”œβ”€β”€ frontend/
β”œβ”€β”€ pipeline/
β”œβ”€β”€ rag/
β”œβ”€β”€ retrieval/
β”œβ”€β”€ workflows/
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .env.example
└── README.md

Tech Stack

Layer Technology
Backend FastAPI
Frontend Streamlit
LLM Gemini 2.5 Flash
Embeddings Gemini Embedding-001
Orchestration LangChain
Dense Retrieval ChromaDB
Sparse Retrieval BM25
Re-ranking HuggingFace CrossEncoder
Evaluation RAGAS
Database SQLite
Web Context Apify (Optional)
Observability LangSmith
Containerization Docker
Language Python 3.11

Troubleshooting

Invalid Gemini API Key

Verify your GOOGLE_API_KEY in .env.

Embedding Model Error

Use:

models/gemini-embedding-001

Gemini Rate Limit

Wait for the free-tier quota to reset or use another API key.

ChromaDB Issues

Delete the local data/chroma directory and re-index your documents.

Gmail Notification Not Working

Reconnect Gmail OAuth inside n8n.

API Not Reachable

Ensure the backend is running on port 8000 before starting Streamlit.

License

MIT License

Author

Aman Nanda

B.Sc. Artificial Intelligence

Generative AI β€’ RAG Engineering β€’ LLM Applications β€’ AI Automation β€’ FastAPI β€’ Streamlit β€’ LangChain β€’ Gemini β€’ ChromaDB β€’ Docker

About

Production-grade Multi-PDF RAG Intelligence System with FastAPI, Streamlit, LangChain, Gemini, ChromaDB, BM25, CrossEncoder, and RAGAS.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors