Skip to content

RudreshTrivedi013/ResearchMind

Repository files navigation

🧠 ResearchMind

A Multi-Source AI Research Assistant — Drop in PDFs, web URLs, or plain text, then ask questions and get cited answers.

Like having a research assistant who has read all your documents.


🏗️ Architecture

graph LR
    A[PDF / URL / Text] --> B[data_loader.py]
    B --> C[chunking.py]
    C --> D[embedding_pipeline.py]
    D --> E[(ChromaDB)]
    F[User Query] --> G[search.py]
    G --> E
    G --> H[generator.py]
    H --> I[Cited Answer]
    J[app.py — Streamlit] --> B
    J --> F
Loading

✨ Features

  • Multi-format ingestion — PDFs, web URLs, and plain text in one interface
  • Inline citations — Every answer references the exact source chunks used
  • Persistent vector store — Documents survive app restarts (ChromaDB)
  • Streaming responses — Real-time token generation via Groq API
  • MMR search — Maximal Marginal Relevance for diverse, non-redundant results
  • Document summarization — One-click summary across all loaded sources
  • Chat history — Conversational context maintained across questions

🛠️ Tech Stack

Component Technology
Document Loading PyMuPDF, BeautifulSoup
Chunking LangChain Text Splitters
Embeddings Sentence Transformers (all-MiniLM-L6-v2)
Vector Store ChromaDB (persistent)
LLM Groq API (Llama 3.3 70B)
UI Streamlit
Containerization Docker

🚀 Quick Start

Local Development

# 1. Clone
git clone https://github.com/yourusername/ResearchMind.git
cd ResearchMind

# 2. Create virtual environment
python -m venv .venv
.venv\Scripts\activate     # Windows
# source .venv/bin/activate  # macOS/Linux

# 3. Install dependencies
pip install -r requirements.txt

# 4. Set API key
cp .env.example .env
# Edit .env and add your Groq API key (free at https://console.groq.com)

# 5. Run
streamlit run app.py

Open http://localhost:8501 in your browser.

Docker

docker-compose up --build

📁 Project Structure

ResearchMind/
├── app.py                    # Streamlit UI
├── config.py                 # Centralized configuration
├── src/
│   ├── data_loader.py        # PDF, URL, text ingestion
│   ├── chunking.py           # Fixed-size & semantic splitting
│   ├── embedding_pipeline.py # Sentence Transformers + ChromaDB
│   ├── search.py             # Similarity + MMR retrieval
│   └── generator.py          # Groq LLM streaming + citations
├── .streamlit/
│   ├── config.toml           # Streamlit server config
│   └── theme.toml            # Dark theme settings
├── Dockerfile                # Container image
├── docker-compose.yml        # Local Docker setup
├── render.yaml               # Render deployment blueprint
├── requirements.txt          # Python dependencies
└── .env.example              # API key template

📖 How It Works

  1. Ingest — Upload PDFs, paste URLs, or type text. Each source is parsed with metadata (page numbers, titles, URLs).
  2. Chunk — Documents are split into overlapping chunks (1000 chars, 200 overlap) to fit embedding context windows.
  3. Embed — Chunks are encoded with all-MiniLM-L6-v2 and stored in ChromaDB.
  4. Retrieve — User queries are embedded and matched against stored chunks using cosine similarity or MMR.
  5. Generate — Top-K chunks are injected into a prompt sent to Groq's Llama 3.3 70B, which generates a cited answer.

📄 License

MIT

About

Multi-Source AI Research Assistant — Drop in PDFs, URLs, or text and get cited answers powered by RAG (LangChain + ChromaDB + Groq)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors