A Multi-Source AI Research Assistant — Drop in PDFs, web URLs, or plain text, then ask questions and get cited answers.
Like having a research assistant who has read all your documents.
graph LR
A[PDF / URL / Text] --> B[data_loader.py]
B --> C[chunking.py]
C --> D[embedding_pipeline.py]
D --> E[(ChromaDB)]
F[User Query] --> G[search.py]
G --> E
G --> H[generator.py]
H --> I[Cited Answer]
J[app.py — Streamlit] --> B
J --> F
- Multi-format ingestion — PDFs, web URLs, and plain text in one interface
- Inline citations — Every answer references the exact source chunks used
- Persistent vector store — Documents survive app restarts (ChromaDB)
- Streaming responses — Real-time token generation via Groq API
- MMR search — Maximal Marginal Relevance for diverse, non-redundant results
- Document summarization — One-click summary across all loaded sources
- Chat history — Conversational context maintained across questions
| Component | Technology |
|---|---|
| Document Loading | PyMuPDF, BeautifulSoup |
| Chunking | LangChain Text Splitters |
| Embeddings | Sentence Transformers (all-MiniLM-L6-v2) |
| Vector Store | ChromaDB (persistent) |
| LLM | Groq API (Llama 3.3 70B) |
| UI | Streamlit |
| Containerization | Docker |
# 1. Clone
git clone https://github.com/yourusername/ResearchMind.git
cd ResearchMind
# 2. Create virtual environment
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # macOS/Linux
# 3. Install dependencies
pip install -r requirements.txt
# 4. Set API key
cp .env.example .env
# Edit .env and add your Groq API key (free at https://console.groq.com)
# 5. Run
streamlit run app.pyOpen http://localhost:8501 in your browser.
docker-compose up --buildResearchMind/
├── app.py # Streamlit UI
├── config.py # Centralized configuration
├── src/
│ ├── data_loader.py # PDF, URL, text ingestion
│ ├── chunking.py # Fixed-size & semantic splitting
│ ├── embedding_pipeline.py # Sentence Transformers + ChromaDB
│ ├── search.py # Similarity + MMR retrieval
│ └── generator.py # Groq LLM streaming + citations
├── .streamlit/
│ ├── config.toml # Streamlit server config
│ └── theme.toml # Dark theme settings
├── Dockerfile # Container image
├── docker-compose.yml # Local Docker setup
├── render.yaml # Render deployment blueprint
├── requirements.txt # Python dependencies
└── .env.example # API key template
- Ingest — Upload PDFs, paste URLs, or type text. Each source is parsed with metadata (page numbers, titles, URLs).
- Chunk — Documents are split into overlapping chunks (1000 chars, 200 overlap) to fit embedding context windows.
- Embed — Chunks are encoded with
all-MiniLM-L6-v2and stored in ChromaDB. - Retrieve — User queries are embedded and matched against stored chunks using cosine similarity or MMR.
- Generate — Top-K chunks are injected into a prompt sent to Groq's Llama 3.3 70B, which generates a cited answer.
MIT