| title | Document AI Analyst |
|---|---|
| emoji | π§ |
| colorFrom | indigo |
| colorTo | purple |
| sdk | docker |
| app_port | 7860 |
| pinned | true |
| license | mit |
| short_description | Enterprise Agentic RAG β upload PDFs and chat with AI |
βββββββ βββββββ ββββββββ ββββββ ββββββββββββββββββββββββββββββββββββ ββββββ ββββ ββββββββββββ
ββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββββββββββββ
βββββββββββ βββββββββ βββββββββββββββββββββββββββββββββββ βββ ββββββββββββββ βββ βββ
βββββββ βββ βββββββββ βββββββββββββββββββββββββββββββββββ βββ ββββββββββββββββββ βββ
βββ βββββββββββ βββ ββββββββββββββββββββββββββββββ βββ βββ ββββββ ββββββ βββ
βββ βββββββ βββ βββ ββββββββββββββββββββββββββββββ βββ βββ ββββββ βββββ βββ
βββββββ ββββββ βββββββ
ββββββββββββββββββββββββ
βββββββββββββββββββ ββββ
βββββββββββββββββββ βββ
βββ ββββββ ββββββββββββ
βββ ββββββ βββ βββββββ
Upload Β· Embed Β· Retrieve Β· Chat β A production-grade AI document assistant built end-to-end with an agentic RAG pipeline, streaming responses, and per-user data isolation.
## π GirlScript Summer of Code 2026
This project is an official participant in GirlScript Summer of Code 2026 (GSSoC'26) and welcomes contributions from the community.
Features Β· Tech Stack Β· Getting Started Β· Architecture Β· RAG Pipeline Β· API Reference Β· Deployment Β· Contributing
Thanks to all the amazing people who have contributed to PDF-Assistant-RAG! π
π Want to join them? Check out CONTRIBUTING.md for contribution guidelines and look for good first issues to get started!
PDF-Assistant-RAG is a complete, production-ready AI document assistant that lets users upload complex PDFs, financial reports, legal contracts, and research papers β then chat with an AI that provides accurate, cited answers powered by a multi-stage Retrieval-Augmented Generation pipeline.
The system uses hybrid search (vector + BM25) with Reciprocal Rank Fusion and cross-encoder reranking to find the most relevant document chunks, streams AI-generated answers token-by-token, and highlights exact source citations with page numbers β all inside a modern Next.js frontend with JWT-secured per-user data isolation.
Contributor note: see docs/ARCHITECTURE.md for a route-by-route system map, request-flow diagrams, and Swagger/OpenAPI documentation guidance.
graph TD
subgraph Frontend["Frontend (Next.js 16)"]
UI["Dashboard UI (React + Zustand)"]
Chat["Chat Panel (SSE Streaming)"]
Viewer["PDF Viewer"]
end
subgraph Backend["Backend (FastAPI 0.115+)"]
API["API Router (/api/v1)"]
Auth["Auth (JWT + bcrypt)"]
DB[(PostgreSQL / SQLite)]
Redis[(Redis)]
subgraph RAG["RAG Pipeline"]
Upload["Celery Ingestion Task"]
Embed["Local Embeddings (all-MiniLM-L6-v2)"]
EmbedCache["Embedding Cache (Redis + LRU)"]
BM25["BM25 Index"]
Retriever["Hybrid Retriever (Vector + BM25 + RRF)"]
Rerank["Cross-Encoder Reranker (BGE-v2-m3)"]
Agent["Agent / Generator"]
end
end
subgraph Storage["Storage"]
Chroma[(ChromaDB)]
Uploads[("File Storage")]
end
subgraph External["External Services"]
HF["HuggingFace Inference API (Qwen2.5-72B)"]
end
UI <-->|REST / Auth| API
Chat <-->|SSE Streaming| API
Viewer -->|Serve PDF| API
API <--> Auth
API <--> DB
API --> Upload
Upload --> Embed
Embed --> EmbedCache
Embed -->|Store Vectors| Chroma
Upload --> BM25
API <--> Retriever
Retriever -->|Semantic Search| Chroma
Retriever -->|Keyword Search| BM25
Retriever --> Rerank
Rerank --> Agent
Agent <-->|LLM Generation| HF
Upload -->|Store Files| Uploads
Redis <-->|Task Queue| Upload
- User uploads a document via the Next.js frontend.
- FastAPI queues a Celery ingestion task backed by Redis.
- The worker chunks the document, generates local embeddings (cached via Redis/LRU), builds a BM25 index, and stores vectors in ChromaDB.
- At query time, hybrid search merges vector and BM25 results via Reciprocal Rank Fusion.
- A cross-encoder reranker refines the top candidates.
- The agent assembles a prompt and calls the HuggingFace Inference API.
- Streamed SSE tokens are returned to the frontend chat panel.
| Technology | Purpose | |
|---|---|---|
| Next.js 14 | React framework (App Router) | |
| TypeScript | Frontend language | |
| Tailwind CSS | Utility-first styling |
- π€ Discord Bot Integration
- β‘ Celery + Redis Background PDF Processing
- π§ Email Verification Workflow
- π§ RAGAS Evaluation Pipeline
- π Response Caching with Redis
- π³ Optimized Docker Deployment
|
|
|
PDF-Assistant-RAG/
β
βββ backend/
β βββ app/
β β βββ main.py # FastAPI app β lifespan, middleware, routers
β β βββ config.py # Pydantic settings (env vars)
β β βββ models.py # SQLAlchemy ORM models
β β βββ schemas.py # Pydantic request/response schemas
β β βββ database.py # Engine, session, migrations
β β βββ auth.py # JWT helpers
β β βββ tasks.py # Celery task definitions
β β β
β β βββ routes/
β β β βββ auth.py # Register, login, OAuth
β β β βββ documents.py # Upload, list, delete, status
β β β βββ chat.py # Ask, stream, history, export
β β β βββ health.py # Deep health check endpoint
β β β βββ admin.py # Admin stats
β β β βββ workspaces.py # Workspace management
β β β
β β βββ rag/
β β β βββ chunker.py # PDF/DOCX/TXT extraction + chunking
β β β βββ embeddings.py # Local embeddings + Redis/LRU cache
β β β βββ vectorstore.py # ChromaDB operations
β β β βββ bm25.py # BM25 index per document
β β β βββ retriever.py # Hybrid search + RRF + reranking
β β β βββ reranker.py # Cross-encoder reranker
β β β βββ vision.py # Image caption extraction
β β β βββ url_extractor.py # PDF URL/link extraction
β β β βββ graph_builder.py # Knowledge graph (GraphRAG)
β β β βββ agent.py # LLM answer generation
β β β βββ summarizer.py # Document summarisation
β β β
β β βββ services/
β β βββ document_ingestion.py # End-to-end ingestion pipeline
β β
β βββ tests/ # pytest test suite
β βββ requirements.txt
β βββ migrate_add_extracted_urls.py
β
βββ frontend/
β βββ src/
β β βββ app/ # Next.js App Router pages
β β βββ components/ # React components
β β βββ store/ # Zustand state stores
β β βββ lib/ # API client, auth, utilities
β β βββ services/ # API service layer
β βββ e2e/ # Playwright E2E + snapshot tests
β βββ next.config.ts
β βββ playwright.config.ts
β
βββ docs/
β βββ ARCHITECTURE.md
β
βββ .github/
β βββ workflows/
β βββ ci.yml # Backend CI
β βββ e2e.yml # Playwright E2E + visual regression
β βββ deploy.yml # Docker build (main branch)
β βββ devsecops.yml # Security scans
β
βββ docker-compose.yml # CPU + GPU + debug profiles + log rotation
βββ Dockerfile # Multi-stage backend build
βββ frontend/Dockerfile # Multi-stage frontend build (nginx)
βββ .env.example
Python 3.11+
Node.js 20+
Docker + Docker Compose (recommended)
HuggingFace API token β huggingface.co/settings/tokens (free)
git clone https://github.com/param20h/PDF-Assistant-RAG.git
cd PDF-Assistant-RAGcp .env.example .envEdit .env:
SECRET_KEY=your-strong-random-secret
DATABASE_URL=postgresql://pdf_rag_user:pdf_rag_pass@localhost:5432/pdf_rag
HF_TOKEN=hf_your_huggingface_token_here
UPLOAD_DIR=./data/uploads
CHROMA_PERSIST_DIR=./data/chroma_db
CELERY_BROKER_URL=redis://localhost:6379/0
CELERY_RESULT_BACKEND=redis://localhost:6379/1Get your free HuggingFace token at huggingface.co/settings/tokens
FRONTEND_URL=http://localhost:3000
MAIL_USERNAME=your_smtp_username
MAIL_PASSWORD=your_smtp_or_gmail_app_password
MAIL_FROM=your_sender_email@example.com
MAIL_SERVER=smtp.gmail.com
MAIL_PORT=587
MAIL_STARTTLS=True
MAIL_SSL_TLS=FalseWithout SMTP settings, registration returns a local verification link so contributors can test without email credentials.
# CPU-only (no GPU needed)
docker compose --profile cpu up --build
# GPU-accelerated (requires NVIDIA Container Toolkit)
docker compose --profile gpu up --build
# Also start pgAdmin at http://localhost:5050
docker compose --profile cpu --profile debug up --build| Service | URL |
|---|---|
| Frontend | http://localhost:3000 |
| Backend API | http://localhost:7860 |
| API Docs | http://localhost:7860/docs |
| pgAdmin | http://localhost:5050 (debug profile) |
# Backend
cd backend
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
uvicorn app.main:app --reload --port 7860
# Celery worker (separate terminal)
celery -A app.celery_app.celery_app worker --loglevel=info
# Frontend (separate terminal)
cd frontend
npm install
npm run devcrawl4ai-setup βββββββββββββββββββββββββββββββββββββββββββββββ
β PDF / DOCX / TXT / MD Upload β
βββββββββββββββββββββ¬ββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β PyMuPDF / pdfplumber / python-docx Parser β
β + Image caption extraction β
β + PDF URL/link annotation extraction β
βββββββββββββββββββββ¬ββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Recursive Character Text Splitter β
β chunk_size=1000 | overlap=200 β
βββββββββββββββββββββ¬ββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β all-MiniLM-L6-v2 (local embeddings) β
β 384-dim Β· Redis + LRU cache (24h TTL) β
ββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β
ββββββββββββββββ΄βββββββββββββββ
βΌ βΌ
ββββββββββββββββββββ βββββββββββββββββββββββ
β ChromaDB vectors β β BM25 keyword index β
β (per-user coll.) β β (per-document .pkl)β
ββββββββββββββββββββ βββββββββββββββββββββββ
ββ At Query Time ββ
User Question βββΆ Embed (cached) βββΆ Vector Search (Top-K=20)
β
ββββΆ BM25 Search (Top-K=20)
β
βΌ
Reciprocal Rank Fusion (RRF, k=60)
β
βΌ
BGE-Reranker-v2-m3 Cross-Encoder (Top-K=8)
β
βΌ
Prompt Assembly (system + context + question)
β
βΌ
Qwen2.5-72B-Instruct (HF Inference API)
β
βΌ
Streamed SSE tokens βββΆ Frontend ChatPanel
| Method | Endpoint | Auth | Description |
|---|---|---|---|
POST |
/api/v1/auth/register |
β | Create a new user account |
POST |
/api/v1/auth/login |
β | Login and receive JWT tokens |
GET |
/api/v1/auth/me |
β | Get current user profile |
POST |
/api/v1/documents/upload |
β | Upload PDF/DOCX/TXT and enqueue ingestion (202) |
POST |
/api/v1/documents/urlupload |
β | Crawl a URL and ingest as document |
GET |
/api/v1/documents/ |
β | List documents (pagination + ?q= name filter) |
GET |
/api/v1/documents/{id} |
β | Get document metadata (incl. extracted URLs) |
GET |
/api/v1/documents/{id}/status |
β | Poll ingestion progress |
DELETE |
/api/v1/documents/{id} |
β | Soft-delete document |
POST |
/api/v1/chat/ask/stream |
β | Ask a question (SSE streaming) |
GET |
/api/v1/chat/history/{doc_id} |
β | Get chat history for a document |
DELETE |
/api/v1/chat/history/{doc_id} |
β | Clear chat history |
GET |
/api/v1/chat/export/{doc_id} |
β | Export transcript as MD / TXT / PDF |
GET |
/api/v1/chat/sessions |
β | List chat sessions |
POST |
/api/v1/chat/sessions |
β | Create chat session |
GET |
/api/v1/health/status |
β | Deep health check (DB, Redis, Celery, ChromaDB) |
GET |
/api/health |
β | Basic liveness check |
Full interactive docs at
/docs(Swagger UI) when running locally.
| Variable | Required | Default | Description |
|---|---|---|---|
SECRET_KEY |
β | β | JWT signing secret. Generate: python -c "import secrets; print(secrets.token_urlsafe(32))" |
HF_TOKEN |
β | β | HuggingFace API token for LLM inference |
DATABASE_URL |
β | sqlite:///./data/app.db |
SQLAlchemy connection string (SQLite or PostgreSQL) |
CELERY_BROKER_URL |
β | redis://localhost:6379/0 |
Redis broker for Celery |
CELERY_RESULT_BACKEND |
β | redis://localhost:6379/1 |
Redis backend for Celery results |
REDIS_URL |
β | β | Redis URL for response + embedding cache |
UPLOAD_DIR |
β | ./data/uploads |
File storage directory |
CHROMA_PERSIST_DIR |
β | ./data/chroma_db |
ChromaDB persistence directory |
EMBEDDING_MODEL |
β | sentence-transformers/all-MiniLM-L6-v2 |
Local embedding model |
EMBEDDING_CACHE_TTL |
β | 86400 |
Embedding cache TTL in seconds (24h) |
LLM_MODEL |
β | Qwen/Qwen2.5-72B-Instruct |
HuggingFace model for answer generation |
LLM_TEMPERATURE |
β | 0.3 |
LLM sampling temperature |
RERANKER_MODEL |
β | BAAI/bge-reranker-v2-m3 |
Cross-encoder reranker model |
USE_HYBRID_SEARCH |
β | True |
Enable BM25 + vector hybrid search |
RRF_K |
β | 60 |
RRF smoothing constant |
CHUNK_SIZE |
β | 1000 |
Characters per document chunk |
CHUNK_OVERLAP |
β | 200 |
Overlap between consecutive chunks |
TOP_K_RETRIEVAL |
β | 20 |
Candidates retrieved from vector store |
TOP_K_RERANK |
β | 8 |
Final chunks after reranking |
VISION_PROVIDER |
β | β | Set to openai to use GPT-4o-mini for image captions |
OPENAI_API_KEY |
β | β | Required when VISION_PROVIDER=openai |
ENVIRONMENT |
β | development |
Set to production to lock CORS |
FRONTEND_URL |
β | http://localhost:3000 |
Public frontend URL for OAuth + email links |
NEXT_PUBLIC_API_URL |
β | http://localhost:7860 |
Backend URL injected at frontend build time |
| Command | Description |
|---|---|
uvicorn app.main:app --reload |
Start FastAPI with hot reload |
celery -A app.celery_app.celery_app worker --loglevel=info |
Start Celery worker |
python migrate_add_extracted_urls.py |
Run URL extraction column migration |
python scripts/run_ragas_eval.py --user-id <id> |
Run RAGAS evaluation (vector vs GraphRAG) |
| Command | Description |
|---|---|
npm run dev |
Start Next.js dev server |
npm run build |
Production build |
npm run test |
Run Vitest unit tests |
npm run test:e2e |
Run Playwright E2E tests |
npx playwright test e2e/snapshots.spec.ts --update-snapshots |
Regenerate visual regression baselines |
| Command | Description |
|---|---|
docker compose --profile cpu up --build |
Full stack β CPU only |
docker compose --profile gpu up --build |
Full stack β GPU accelerated |
docker compose --profile debug up |
Also start pgAdmin at http://localhost:5050 |
docker compose down |
Stop all containers |
GPU profile requires NVIDIA Container Toolkit.
- Fork this repo and create a new Space at huggingface.co/new-space (SDK: Docker)
- Set Space secrets:
HF_TOKEN,SECRET_KEY,DATABASE_URL - Push to the
hfremote:
git remote add hf https://<username>:<HF_TOKEN>@huggingface.co/spaces/<username>/<space-name>
git push hf maindocker compose --profile cpu up -d --build
# App at http://your-server:7860
# Frontend at http://your-server:3000This project is participating in GirlScript Summer of Code! We welcome contributors of all skill levels.
Branch Strategy:
| Branch | Purpose |
|---|---|
main |
Production β HuggingFace deployed (admin only) |
dev |
All contributor PRs target here |
feature/* / fix/* / docs/* |
Your working branches |
# Always branch from dev
git checkout -b feature/my-feature upstream/devQuick links:
- π Good First Issues
- π Contributing Guide
- π¬ Discussions
Distributed under the MIT License. See LICENSE for more information.
Built with π by the open-source community
If you found this project helpful, please give it a β β it helps contributors discover it!