A Streamlit application that allows you to upload PDF files, process them using LangChain and ChromaDB, and chat with them using OpenAI's GPT models. The app provides source citations showing which page each answer came from.
- 📄 PDF Upload: Upload and process PDF files
- 🔍 Document Chunking: Automatically splits PDFs into manageable chunks
- 💾 Vector Storage: Stores embeddings in ChromaDB for efficient retrieval
- 💬 Chat Interface: Interactive chat interface to ask questions about your PDFs
- 📑 Source Citation: Shows which page and section each answer came from
- 🧠 Memory: Maintains conversation context across multiple questions
- Python 3.11
- OpenAI API key
-
Clone this repository or download the files
-
Install the required packages:
pip install -r requirements.txt- Set up your OpenAI API key:
Create a .env file in the project directory:
OPENAI_API_KEY=your_openai_api_key_here
Or set it as an environment variable:
# On Windows (PowerShell)
$env:OPENAI_API_KEY="your_openai_api_key_here"
# On Linux/Mac
export OPENAI_API_KEY="your_openai_api_key_here"- Run the Streamlit app:
streamlit run app.py-
Open your browser and navigate to the URL shown (typically
http://localhost:8501) -
Upload a PDF file using the sidebar
-
Click "Process PDF" to process the document
-
Start asking questions in the chat interface!
- PDF Processing: When you upload a PDF, it's loaded and split into chunks using LangChain's
RecursiveCharacterTextSplitter - Embedding Generation: Each chunk is embedded using OpenAI's embeddings model
- Vector Storage: Embeddings are stored in ChromaDB for fast similarity search
- Question Answering: When you ask a question:
- The question is embedded and used to find the most relevant chunks
- The relevant chunks are passed to GPT-3.5-turbo along with the question
- The model generates an answer based on the retrieved context
- Source information (page numbers and snippets) is extracted and displayed
.
├── app.py # Main Streamlit application
├── requirements.txt # Python dependencies
├── README.md # This file
├── .env # Environment variables (create this)
└── chroma_db/ # ChromaDB database (created automatically)
- The app uses GPT-3.5-turbo for chat. You can modify the model in
app.pyif needed - ChromaDB data is persisted in the
chroma_db/directory - Each PDF processing creates chunks that are stored and can be queried
- The chat maintains context within a session
- OpenAI API Key Error: Make sure you've set the
OPENAI_API_KEYin your.envfile or environment variables - PDF Processing Error: Ensure the PDF file is not corrupted or password-protected
- Memory Issues: For very large PDFs, you may need to adjust the chunk size in
app.py
See DEPLOY.md for detailed instructions on how to deploy this project to GitHub.
Quick steps:
- Install Git from https://git-scm.com/downloads
- Initialize repository:
git init - Add files:
git add . - Commit:
git commit -m "Initial commit" - Create repository on GitHub
- Push:
git push -u origin main
Important: Never commit your .env file with your API key! The .gitignore file is already configured to exclude it.
MIT License