Skip to content

NaitikVerma6776/PDF-Assistant-RAG

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

791 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
title Document AI Analyst
emoji 🧠
colorFrom indigo
colorTo purple
sdk docker
app_port 7860
pinned true
license mit
short_description Enterprise Agentic RAG β€” upload PDFs and chat with AI

β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ•—   β–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—
β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β•β•β•    β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β•β•β•β–ˆβ–ˆβ•”β•β•β•β•β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β•β•β•β•šβ•β•β–ˆβ–ˆβ•”β•β•β•β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ•—  β–ˆβ–ˆβ•‘β•šβ•β•β–ˆβ–ˆβ•”β•β•β•
β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—      β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—   β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β–ˆβ–ˆβ•— β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘
β–ˆβ–ˆβ•”β•β•β•β• β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β•      β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘β•šβ•β•β•β•β–ˆβ–ˆβ•‘β•šβ•β•β•β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β•šβ•β•β•β•β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β•šβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘
β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘         β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘ β•šβ–ˆβ–ˆβ–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘
β•šβ•β•     β•šβ•β•β•β•β•β• β•šβ•β•         β•šβ•β•  β•šβ•β•β•šβ•β•β•β•β•β•β•β•šβ•β•β•β•β•β•β•β•šβ•β•β•šβ•β•β•β•β•β•β•   β•šβ•β•   β•šβ•β•  β•šβ•β•β•šβ•β•  β•šβ•β•β•β•   β•šβ•β•

                        β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—
                        β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β•β•β•
                        β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ–ˆβ•—
                        β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘
                        β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•
                        β•šβ•β•  β•šβ•β•β•šβ•β•  β•šβ•β• β•šβ•β•β•β•β•β•

Enterprise Agentic Retrieval-Augmented Generation System


FastAPI Next.js Python PostgreSQL ChromaDB HuggingFace Celery Docker License: MIT


Upload Β· Embed Β· Retrieve Β· Chat β€” A production-grade AI document assistant built end-to-end with an agentic RAG pipeline, streaming responses, and per-user data isolation.


## 🌟 GirlScript Summer of Code 2026

This project is an official participant in GirlScript Summer of Code 2026 (GSSoC'26) and welcomes contributions from the community.


Features Β· Tech Stack Β· Getting Started Β· Architecture Β· RAG Pipeline Β· API Reference Β· Deployment Β· Contributing


🀝 Contributors

Thanks to all the amazing people who have contributed to PDF-Assistant-RAG! πŸŽ‰


🌟 Want to join them? Check out CONTRIBUTING.md for contribution guidelines and look for good first issues to get started!



🌟 Overview

PDF-Assistant-RAG is a complete, production-ready AI document assistant that lets users upload complex PDFs, financial reports, legal contracts, and research papers β€” then chat with an AI that provides accurate, cited answers powered by a multi-stage Retrieval-Augmented Generation pipeline.

The system uses hybrid search (vector + BM25) with Reciprocal Rank Fusion and cross-encoder reranking to find the most relevant document chunks, streams AI-generated answers token-by-token, and highlights exact source citations with page numbers β€” all inside a modern Next.js frontend with JWT-secured per-user data isolation.


πŸ—οΈ Architecture

Contributor note: see docs/ARCHITECTURE.md for a route-by-route system map, request-flow diagrams, and Swagger/OpenAPI documentation guidance.

graph TD
    subgraph Frontend["Frontend (Next.js 16)"]
        UI["Dashboard UI (React + Zustand)"]
        Chat["Chat Panel (SSE Streaming)"]
        Viewer["PDF Viewer"]
    end

    subgraph Backend["Backend (FastAPI 0.115+)"]
        API["API Router (/api/v1)"]
        Auth["Auth (JWT + bcrypt)"]
        DB[(PostgreSQL / SQLite)]
        Redis[(Redis)]

        subgraph RAG["RAG Pipeline"]
            Upload["Celery Ingestion Task"]
            Embed["Local Embeddings (all-MiniLM-L6-v2)"]
            EmbedCache["Embedding Cache (Redis + LRU)"]
            BM25["BM25 Index"]
            Retriever["Hybrid Retriever (Vector + BM25 + RRF)"]
            Rerank["Cross-Encoder Reranker (BGE-v2-m3)"]
            Agent["Agent / Generator"]
        end
    end

    subgraph Storage["Storage"]
        Chroma[(ChromaDB)]
        Uploads[("File Storage")]
    end

    subgraph External["External Services"]
        HF["HuggingFace Inference API (Qwen2.5-72B)"]
    end

    UI <-->|REST / Auth| API
    Chat <-->|SSE Streaming| API
    Viewer -->|Serve PDF| API
    API <--> Auth
    API <--> DB
    API --> Upload
    Upload --> Embed
    Embed --> EmbedCache
    Embed -->|Store Vectors| Chroma
    Upload --> BM25
    API <--> Retriever
    Retriever -->|Semantic Search| Chroma
    Retriever -->|Keyword Search| BM25
    Retriever --> Rerank
    Rerank --> Agent
    Agent <-->|LLM Generation| HF
    Upload -->|Store Files| Uploads
    Redis <-->|Task Queue| Upload
Loading

πŸ”„ System Flow Overview

  1. User uploads a document via the Next.js frontend.
  2. FastAPI queues a Celery ingestion task backed by Redis.
  3. The worker chunks the document, generates local embeddings (cached via Redis/LRU), builds a BM25 index, and stores vectors in ChromaDB.
  4. At query time, hybrid search merges vector and BM25 results via Reciprocal Rank Fusion.
  5. A cross-encoder reranker refines the top candidates.
  6. The agent assembles a prompt and calls the HuggingFace Inference API.
  7. Streamed SSE tokens are returned to the frontend chat panel.

πŸ›  Tech Stack

Backend

Technology Purpose
FastAPI Async web framework + routing
Python 3.11 Runtime environment
PostgreSQL / SQLite Relational database (SQLAlchemy ORM)
JWT + bcrypt Authentication & password hashing
ChromaDB Local vector store (embeddings)
HuggingFace Inference API LLM answer generation
sentence-transformers Local embedding model (all-MiniLM-L6-v2)

Frontend

Technology Purpose
Next.js 14 React framework (App Router)
TypeScript Frontend language
Tailwind CSS Utility-first styling

AI / ML Pipeline

Technology Purpose
sentence-transformers (all-MiniLM-L6-v2) Generates vector embeddings for document chunks
ChromaDB Stores + retrieves embeddings locally
HuggingFace Inference API Generates answers from retrieved context
BAAI/bge-reranker-v2-m3 Cross-encoder reranking for retrieval quality
Knowledge Graph (GraphRAG) Entity extraction + relationship graphs
PyMuPDF + pdfplumber + python-docx Document text extraction

DevOps & Tooling

Technology Purpose
Docker Multi-Stage Containerised deployment
GitHub Actions CI/CD (E2E, security, deploy)
Playwright E2E + visual regression tests
Prometheus + Grafana Metrics & observability
HuggingFace Spaces Production deployment

✨ Key Features

πŸ†• Recent Updates

  • πŸ€– Discord Bot Integration
  • ⚑ Celery + Redis Background PDF Processing
  • πŸ“§ Email Verification Workflow
  • 🧠 RAGAS Evaluation Pipeline
  • πŸš€ Response Caching with Redis
  • 🐳 Optimized Docker Deployment

πŸ‘€ Users

  • πŸ” JWT-secured register, login & email verification
  • πŸ“„ Upload PDF, DOCX, TXT, and Markdown
  • 🌐 URL ingestion via web crawler
  • πŸ’¬ Ask questions in natural language
  • 🌊 Streaming AI responses token-by-token
  • πŸ“š Inline source citations with page numbers
  • πŸ“₯ Export chat as Markdown, TXT, or PDF
  • πŸ—‚οΈ Per-user complete data isolation

πŸ€– RAG Pipeline

  • πŸ”ͺ Smart recursive text chunking (configurable size & overlap)
  • 🧠 Local embeddings β€” no data leaves your machine
  • ⚑ Embedding cache (Redis + LRU) β€” skip redundant computation
  • πŸ” Hybrid search β€” vector + BM25 merged via RRF
  • πŸ† Cross-encoder reranking for precision answers
  • πŸ–ΌοΈ Image caption extraction from PDF figures
  • πŸ”— URL extraction from PDF link annotations
  • πŸ—ΊοΈ Knowledge graph (GraphRAG) per document

βš™οΈ Engineering

  • πŸš€ Async FastAPI with SSE streaming
  • πŸ—„οΈ PostgreSQL metadata + ChromaDB vectors
  • πŸ”„ Celery + Redis async ingestion pipeline
  • 🐳 Multi-stage Docker with CPU & GPU profiles
  • πŸ“Š Prometheus metrics + Grafana dashboard
  • 🩺 Deep health endpoint β€” DB, Redis, Celery, ChromaDB
  • πŸ”’ Rate limiting, CORS, file validation, JWT expiry
  • πŸ§ͺ Playwright E2E + visual regression tests

πŸ“ Project Structure

PDF-Assistant-RAG/
β”‚
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ main.py                 # FastAPI app β€” lifespan, middleware, routers
β”‚   β”‚   β”œβ”€β”€ config.py               # Pydantic settings (env vars)
β”‚   β”‚   β”œβ”€β”€ models.py               # SQLAlchemy ORM models
β”‚   β”‚   β”œβ”€β”€ schemas.py              # Pydantic request/response schemas
β”‚   β”‚   β”œβ”€β”€ database.py             # Engine, session, migrations
β”‚   β”‚   β”œβ”€β”€ auth.py                 # JWT helpers
β”‚   β”‚   β”œβ”€β”€ tasks.py                # Celery task definitions
β”‚   β”‚   β”‚
β”‚   β”‚   β”œβ”€β”€ routes/
β”‚   β”‚   β”‚   β”œβ”€β”€ auth.py             # Register, login, OAuth
β”‚   β”‚   β”‚   β”œβ”€β”€ documents.py        # Upload, list, delete, status
β”‚   β”‚   β”‚   β”œβ”€β”€ chat.py             # Ask, stream, history, export
β”‚   β”‚   β”‚   β”œβ”€β”€ health.py           # Deep health check endpoint
β”‚   β”‚   β”‚   β”œβ”€β”€ admin.py            # Admin stats
β”‚   β”‚   β”‚   └── workspaces.py       # Workspace management
β”‚   β”‚   β”‚
β”‚   β”‚   β”œβ”€β”€ rag/
β”‚   β”‚   β”‚   β”œβ”€β”€ chunker.py          # PDF/DOCX/TXT extraction + chunking
β”‚   β”‚   β”‚   β”œβ”€β”€ embeddings.py       # Local embeddings + Redis/LRU cache
β”‚   β”‚   β”‚   β”œβ”€β”€ vectorstore.py      # ChromaDB operations
β”‚   β”‚   β”‚   β”œβ”€β”€ bm25.py             # BM25 index per document
β”‚   β”‚   β”‚   β”œβ”€β”€ retriever.py        # Hybrid search + RRF + reranking
β”‚   β”‚   β”‚   β”œβ”€β”€ reranker.py         # Cross-encoder reranker
β”‚   β”‚   β”‚   β”œβ”€β”€ vision.py           # Image caption extraction
β”‚   β”‚   β”‚   β”œβ”€β”€ url_extractor.py    # PDF URL/link extraction
β”‚   β”‚   β”‚   β”œβ”€β”€ graph_builder.py    # Knowledge graph (GraphRAG)
β”‚   β”‚   β”‚   β”œβ”€β”€ agent.py            # LLM answer generation
β”‚   β”‚   β”‚   └── summarizer.py       # Document summarisation
β”‚   β”‚   β”‚
β”‚   β”‚   └── services/
β”‚   β”‚       └── document_ingestion.py  # End-to-end ingestion pipeline
β”‚   β”‚
β”‚   β”œβ”€β”€ tests/                      # pytest test suite
β”‚   β”œβ”€β”€ requirements.txt
β”‚   └── migrate_add_extracted_urls.py
β”‚
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ app/                    # Next.js App Router pages
β”‚   β”‚   β”œβ”€β”€ components/             # React components
β”‚   β”‚   β”œβ”€β”€ store/                  # Zustand state stores
β”‚   β”‚   β”œβ”€β”€ lib/                    # API client, auth, utilities
β”‚   β”‚   └── services/               # API service layer
β”‚   β”œβ”€β”€ e2e/                        # Playwright E2E + snapshot tests
β”‚   β”œβ”€β”€ next.config.ts
β”‚   └── playwright.config.ts
β”‚
β”œβ”€β”€ docs/
β”‚   └── ARCHITECTURE.md
β”‚
β”œβ”€β”€ .github/
β”‚   └── workflows/
β”‚       β”œβ”€β”€ ci.yml                  # Backend CI
β”‚       β”œβ”€β”€ e2e.yml                 # Playwright E2E + visual regression
β”‚       β”œβ”€β”€ deploy.yml              # Docker build (main branch)
β”‚       └── devsecops.yml           # Security scans
β”‚
β”œβ”€β”€ docker-compose.yml              # CPU + GPU + debug profiles + log rotation
β”œβ”€β”€ Dockerfile                      # Multi-stage backend build
β”œβ”€β”€ frontend/Dockerfile             # Multi-stage frontend build (nginx)
└── .env.example

πŸš€ Getting Started

Prerequisites


1. Clone the Repository

git clone https://github.com/param20h/PDF-Assistant-RAG.git
cd PDF-Assistant-RAG

2. Configure Environment

cp .env.example .env

Edit .env:

SECRET_KEY=your-strong-random-secret
DATABASE_URL=postgresql://pdf_rag_user:pdf_rag_pass@localhost:5432/pdf_rag
HF_TOKEN=hf_your_huggingface_token_here
UPLOAD_DIR=./data/uploads
CHROMA_PERSIST_DIR=./data/chroma_db
CELERY_BROKER_URL=redis://localhost:6379/0
CELERY_RESULT_BACKEND=redis://localhost:6379/1

Get your free HuggingFace token at huggingface.co/settings/tokens

Email Verification Setup (optional)

FRONTEND_URL=http://localhost:3000
MAIL_USERNAME=your_smtp_username
MAIL_PASSWORD=your_smtp_or_gmail_app_password
MAIL_FROM=your_sender_email@example.com
MAIL_SERVER=smtp.gmail.com
MAIL_PORT=587
MAIL_STARTTLS=True
MAIL_SSL_TLS=False

Without SMTP settings, registration returns a local verification link so contributors can test without email credentials.

3. Run with Docker (recommended)

# CPU-only (no GPU needed)
docker compose --profile cpu up --build

# GPU-accelerated (requires NVIDIA Container Toolkit)
docker compose --profile gpu up --build

# Also start pgAdmin at http://localhost:5050
docker compose --profile cpu --profile debug up --build
Service URL
Frontend http://localhost:3000
Backend API http://localhost:7860
API Docs http://localhost:7860/docs
pgAdmin http://localhost:5050 (debug profile)

4. Run Locally (without Docker)

# Backend
cd backend
python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -r requirements.txt
uvicorn app.main:app --reload --port 7860

# Celery worker (separate terminal)
celery -A app.celery_app.celery_app worker --loglevel=info

# Frontend (separate terminal)
cd frontend
npm install
npm run dev

5. Set up crawl4ai (URL Upload Feature β€” optional)

crawl4ai-setup

🧠 RAG Pipeline

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚         PDF / DOCX / TXT / MD Upload        β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                                        β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   PyMuPDF / pdfplumber / python-docx Parser β”‚
                    β”‚   + Image caption extraction                β”‚
                    β”‚   + PDF URL/link annotation extraction      β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                                        β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚      Recursive Character Text Splitter      β”‚
                    β”‚   chunk_size=1000  |  overlap=200           β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                                        β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  all-MiniLM-L6-v2  (local embeddings)       β”‚
                    β”‚  384-dim Β· Redis + LRU cache (24h TTL)      β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                   β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β–Ό                             β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚ ChromaDB vectors β”‚         β”‚  BM25 keyword index β”‚
         β”‚ (per-user coll.) β”‚         β”‚  (per-document .pkl)β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                          ── At Query Time ──

  User Question ──▢ Embed (cached) ──▢ Vector Search (Top-K=20)
                         β”‚
                         β”œβ”€β”€β–Ά BM25 Search (Top-K=20)
                         β”‚
                         β–Ό
              Reciprocal Rank Fusion (RRF, k=60)
                         β”‚
                         β–Ό
          BGE-Reranker-v2-m3 Cross-Encoder (Top-K=8)
                         β”‚
                         β–Ό
        Prompt Assembly (system + context + question)
                         β”‚
                         β–Ό
        Qwen2.5-72B-Instruct (HF Inference API)
                         β”‚
                         β–Ό
        Streamed SSE tokens ──▢ Frontend ChatPanel

πŸ“‘ API Reference

Method Endpoint Auth Description
POST /api/v1/auth/register ❌ Create a new user account
POST /api/v1/auth/login ❌ Login and receive JWT tokens
GET /api/v1/auth/me βœ… Get current user profile
POST /api/v1/documents/upload βœ… Upload PDF/DOCX/TXT and enqueue ingestion (202)
POST /api/v1/documents/urlupload βœ… Crawl a URL and ingest as document
GET /api/v1/documents/ βœ… List documents (pagination + ?q= name filter)
GET /api/v1/documents/{id} βœ… Get document metadata (incl. extracted URLs)
GET /api/v1/documents/{id}/status βœ… Poll ingestion progress
DELETE /api/v1/documents/{id} βœ… Soft-delete document
POST /api/v1/chat/ask/stream βœ… Ask a question (SSE streaming)
GET /api/v1/chat/history/{doc_id} βœ… Get chat history for a document
DELETE /api/v1/chat/history/{doc_id} βœ… Clear chat history
GET /api/v1/chat/export/{doc_id} βœ… Export transcript as MD / TXT / PDF
GET /api/v1/chat/sessions βœ… List chat sessions
POST /api/v1/chat/sessions βœ… Create chat session
GET /api/v1/health/status ❌ Deep health check (DB, Redis, Celery, ChromaDB)
GET /api/health ❌ Basic liveness check

Full interactive docs at /docs (Swagger UI) when running locally.


πŸ“¦ Environment Variables

Variable Required Default Description
SECRET_KEY βœ… β€” JWT signing secret. Generate: python -c "import secrets; print(secrets.token_urlsafe(32))"
HF_TOKEN βœ… β€” HuggingFace API token for LLM inference
DATABASE_URL ❌ sqlite:///./data/app.db SQLAlchemy connection string (SQLite or PostgreSQL)
CELERY_BROKER_URL ❌ redis://localhost:6379/0 Redis broker for Celery
CELERY_RESULT_BACKEND ❌ redis://localhost:6379/1 Redis backend for Celery results
REDIS_URL ❌ β€” Redis URL for response + embedding cache
UPLOAD_DIR ❌ ./data/uploads File storage directory
CHROMA_PERSIST_DIR ❌ ./data/chroma_db ChromaDB persistence directory
EMBEDDING_MODEL ❌ sentence-transformers/all-MiniLM-L6-v2 Local embedding model
EMBEDDING_CACHE_TTL ❌ 86400 Embedding cache TTL in seconds (24h)
LLM_MODEL ❌ Qwen/Qwen2.5-72B-Instruct HuggingFace model for answer generation
LLM_TEMPERATURE ❌ 0.3 LLM sampling temperature
RERANKER_MODEL ❌ BAAI/bge-reranker-v2-m3 Cross-encoder reranker model
USE_HYBRID_SEARCH ❌ True Enable BM25 + vector hybrid search
RRF_K ❌ 60 RRF smoothing constant
CHUNK_SIZE ❌ 1000 Characters per document chunk
CHUNK_OVERLAP ❌ 200 Overlap between consecutive chunks
TOP_K_RETRIEVAL ❌ 20 Candidates retrieved from vector store
TOP_K_RERANK ❌ 8 Final chunks after reranking
VISION_PROVIDER ❌ β€” Set to openai to use GPT-4o-mini for image captions
OPENAI_API_KEY ❌ β€” Required when VISION_PROVIDER=openai
ENVIRONMENT ❌ development Set to production to lock CORS
FRONTEND_URL ❌ http://localhost:3000 Public frontend URL for OAuth + email links
NEXT_PUBLIC_API_URL ❌ http://localhost:7860 Backend URL injected at frontend build time

πŸ“œ Scripts

Backend (backend/)

Command Description
uvicorn app.main:app --reload Start FastAPI with hot reload
celery -A app.celery_app.celery_app worker --loglevel=info Start Celery worker
python migrate_add_extracted_urls.py Run URL extraction column migration
python scripts/run_ragas_eval.py --user-id <id> Run RAGAS evaluation (vector vs GraphRAG)

Frontend (frontend/)

Command Description
npm run dev Start Next.js dev server
npm run build Production build
npm run test Run Vitest unit tests
npm run test:e2e Run Playwright E2E tests
npx playwright test e2e/snapshots.spec.ts --update-snapshots Regenerate visual regression baselines

Docker

Command Description
docker compose --profile cpu up --build Full stack β€” CPU only
docker compose --profile gpu up --build Full stack β€” GPU accelerated
docker compose --profile debug up Also start pgAdmin at http://localhost:5050
docker compose down Stop all containers

GPU profile requires NVIDIA Container Toolkit.


🌐 Deployment

HuggingFace Spaces

  1. Fork this repo and create a new Space at huggingface.co/new-space (SDK: Docker)
  2. Set Space secrets: HF_TOKEN, SECRET_KEY, DATABASE_URL
  3. Push to the hf remote:
git remote add hf https://<username>:<HF_TOKEN>@huggingface.co/spaces/<username>/<space-name>
git push hf main

Self-Hosted / VPS

docker compose --profile cpu up -d --build
# App at http://your-server:7860
# Frontend at http://your-server:3000

🀝 Contributing

This project is participating in GirlScript Summer of Code! We welcome contributors of all skill levels.

Branch Strategy:

Branch Purpose
main Production β€” HuggingFace deployed (admin only)
dev All contributor PRs target here
feature/* / fix/* / docs/* Your working branches
# Always branch from dev
git checkout -b feature/my-feature upstream/dev

Quick links:


πŸ“„ License

Distributed under the MIT License. See LICENSE for more information.



Built with πŸ’™ by the open-source community

If you found this project helpful, please give it a ⭐ β€” it helps contributors discover it!


Stack


⬆ Back to top

About

PDF-Assistant-RAG

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 56.5%
  • TypeScript 41.3%
  • CSS 1.3%
  • Other 0.9%