🚀 A Retrieval-Augmented Generation (RAG) system designed to answer questions from Bengali PDF documents, including scanned files using OCR.
This project combines OCR, semantic embeddings, hybrid retrieval, and LLM generation to build an intelligent document question-answering system.
This system processes Bengali PDFs (including scanned documents), extracts text using OCR, and enables users to ask questions in natural language.
It uses:
- 🔍 Semantic Search (Vector Embeddings)
- 📄 Keyword Search (BM25)
- 🤖 LLM (Google Gemini) for answer generation
- 📥 Supports scanned and text-based PDFs
- 🔤 OCR extraction using Tesseract
- 🧠 Multilingual embeddings (LaBSE)
- 🔎 Hybrid search:
- Vector similarity (ChromaDB)
- BM25 keyword ranking
- 💬 Natural language question answering
- 🌐 Works with Bengali + English queries
- Python
- pdf2image
- pytesseract (OCR)
- sentence-transformers (LaBSE)
- ChromaDB (Vector DB)
- rank-bm25
- LangChain
- Google Gemini API
flowchart LR
A["PDF Input"] --> B["OCR"]
B --> C["Text Cleaning"]
C --> D["Chunking"]
D --> E["Embeddings - LaBSE"]
E --> F["ChromaDB"]
E --> G["BM25"]
F --> H["Hybrid Retrieval"]
G --> H
H --> I["Gemini LLM"]
I --> J["Answer"]
- RagPipeLine/
- 📓
RagPipelineChromadb.ipynb— Main implementation (Notebook) - 📁 Input PDF files
- 📄
README.md— Project documentation
- 📓
👉 Open and run:
RagPipelineChromadb.ipynb- User Question:
who is anupam? - RAG System:
Based on the provided context, Anupam is a character from the story 'Aparichita' ('অপরিচিতা গল্প'). The text also mentions Anupam's mother and uncle.
- User Question:
অনুপমের ভাষায় কাকে সুপুরুষ বলা হয়েছে? - RAG System:
প্রদত্ত তথ্য অনুসারে, শম্ভুনাথকে সুপুরুষ বলা হয়েছে।
- User Question:
কাকে অনুপমের ভাগ্য দেবতা বলে উল্লেখ করা হয়েছে? - RAG System:
প্রদত্ত তথ্য অনুসারে, অনুপমের মামাকে 'ভাগ্য দেবতার প্রধান এজেন্ট' বলা হয়েছে।
- OCR accuracy may vary for low-quality scans
- Bengali text preprocessing can introduce noise
- Retrieval quality depends on chunking strategy
- Improve OCR accuracy with better preprocessing
- Add UI (Streamlit / Web App)
- Support multiple document uploads
- Implement evaluation metrics (accuracy, recall)
- 📖 Educational document analysis
- 🏛️ Digital archives search
- 📑 Automated document QA systems
- 🌍 Bengali NLP applications