ChatWithPDF is an intelligent question-answering system that uses Retrieval-Augmented Generation (RAG) to provide precise answers from your PDF documents. Upload documents and ask the AI natural questions about their content.
uses RAG architecture to extract text from a PDF chunk and store it as vectors and retrieve them to provide a well generated answer as an AI Chatbot (GPT2)
- PDF Text Extraction: Accurate text extraction from PDF documents using PyMuPDF
- Semantic Search: Finds relevant content using sentence transformers and cosine similarity
- Intelligent Q&A: Answers questions using GPT-2 with context from your documents
- Persistent Memory: ChromaDB vector database maintains knowledge between sessions
- Advanced NLP: Lemmatization, query expansion, and contextual understanding
- Interactive CLI: Easy-to-use command-line interface
Clone & Setup python -m venv docchat_env source docchat_env/bin/activate git clone cd document-chat Read the requirements file for further info pip install -r requirements.txt Done
📖 Uploaded: api_documentation.pdf
Q: How do I authenticate with the API? A: Authentication requires an API key sent in the Authorization header as "Bearer YOUR_API_KEY". Rate limits apply to all authenticated requests.
Q: What error codes might I encounter? A: Common error codes include 400 (Bad Request), 401 (Unauthorized), 403 (Forbidden), and 429 (Too Many Requests).
or
📖 Uploaded: harry_potter.pdf
Q: Who is Voldemort? A: Voldemort, also known as He-Who-Must-Not-Be-Named, is the dark wizard who murdered Harry Potter's parents and seeks to conquer the wizarding world.
Q: What are Hermione's main traits? A: Hermione Granger is known for her intelligence, dedication to learning, and loyalty to her friends Harry and Ron.
| Component | Model/Technology | Version | Size | Purpose |
|---|---|---|---|---|
| Embedding Model | all-MiniLM-L6-v2 | v2.2.2 | 90MB | Text vectorization |
| Language Model | GPT-2 | 4.30.0 | 548MB | Response generation |
| Text Processing | NLTK WordNet | 3.8.1 | 35MB | Lemmatization |
| Vector Database | ChromaDB | 0.4.15 | Varies | Semantic search |
| PDF Processing | PyMuPDF | 1.23.8 | 15MB | Text extraction |
| Metric | Value | Notes |
|---|---|---|
| Vector Dimension | 384 | all-MiniLM-L6-v2 output |
| Storage Format | HNSW | Hierarchical Navigable Small World |
| Distance Metric | Cosine Similarity | Range: 0.0-1.0 |
| Indexing Speed | ~1000 vectors/sec | CPU-bound |
| Query Speed | ~5ms/query | With index |
NOTE: This current version does not support OCR capabilities and can process text only
Thank you this project is open sourced under MIT Lisence
