A Retrieval-Augmented Generation (RAG) system for insurance documents that allows users to ask questions about insurance policies and get accurate answers based on the content of processed documents.
This project implements a complete RAG pipeline for insurance documents, specifically designed to:
- Process and extract text from insurance PDFs (policies, brochures, etc.)
- Create embeddings and vector indices for efficient retrieval
- Find relevant document sections for user queries
- Generate accurate answers using an LLM with the retrieved context
- PDF processing with PyMuPDF (Fitz)
- Text chunking with intelligent overlap
- Embedding generation using Sentence Transformers
- Efficient vector search with FAISS
- OpenAI-powered answer generation with LangChain
- User-friendly Streamlit interface
insurance-rag-assistant/
├── data/
│ ├── raw/ # Original PDFs
│ └── processed/ # Processed text chunks
├── src/
│ ├── data_processing/
│ │ ├── __init__.py
│ │ ├── pdf_loader.py # PDF extraction functionality
│ │ └── text_splitter.py # Text chunking logic
│ ├── indexing/
│ │ ├── __init__.py
│ │ ├── embeddings.py # Embedding generation
│ │ └── vector_store.py # Vector database operations
│ ├── retrieval/
│ │ ├── __init__.py
│ │ └── retriever.py # Similarity search logic
│ ├── llm/
│ │ ├── __init__.py
│ │ └── llm_chain.py # LLM handling and prompting
│ └── app/
│ ├── __init__.py
│ └── main.py # Streamlit app
├── notebooks/
│ ├── data_exploration.ipynb
│ └── prototype.ipynb
├── tests/
│ └── test_pipeline.py
├── requirements.txt
├── README.md
└── .env # For API keys
- Clone this repository:
git clone https://github.com/yourusername/insurance-rag-assistant.git
cd insurance-rag-assistant- Create a virtual environment and install dependencies:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt