rag-starter-template

A minimal, production-ready starter template for Retrieval-Augmented Generation (RAG) using Python, OpenAI embeddings, ChromaDB, and OpenAI responses.

🚀 Features

🧠 Core RAG Pipeline

Multi-format document loading (.txt, .md, .pdf)
Text chunking with configurable size and overlap
Embedding generation using OpenAI
Vector storage and retrieval with ChromaDB
Answer generation using retrieved context
Source tracking and citation support

⚡ Performance & Storage

Persistent local vector storage (ChromaDB)
Skips re-indexing if data already exists
Rebuild support via CLI flag

🔍 Retrieval Enhancements

Metadata filtering by source file
Duplicate control (limit chunks per source)
Custom collection names for multiple indexes

🛠️ Usability

Command-line configuration
Centralized configuration file
Markdown output (outputs/result.md)
Graceful handling of no-result scenarios

📊 Observability & Quality

Structured logging
Evaluation harness
PASS / FAIL / CHECK reporting

📂 Supported Input Types

.txt
.md
.pdf

⚙️ Command-Line Options

Run:

python main.py --rebuild --chunk-size 250 --overlap 40 --top-k 4 What is RAG?

Available flags

Option	Description
`--rebuild`	Rebuild the local ChromaDB index
`--chunk-size`	Control chunk size
`--overlap`	Control chunk overlap
`--top-k`	Number of chunks to retrieve
`--source`	Restrict retrieval to a specific file
`--collection-name`	Use a custom ChromaDB collection
`--max-per-source`	Limit chunks per source file

Examples

python main.py --collection-name policy_docs --rebuild What is RAG?
python main.py --max-per-source 2 What are the common steps in a RAG pipeline?

🛡️ Error Handling

Handles common failure scenarios gracefully:

Missing API key
Empty data/ folder
Unreadable or empty documents
PDF extraction failures
No relevant retrieval results

Additional behavior:

Logs warnings and errors using Python logging
Returns a friendly fallback answer if no results are found
Still generates outputs/result.md even on failure cases

🏗️ Project Structure

main.py                        # Main RAG pipeline
app/
  loaders/
    text_loader.py            # Load txt/md files
    pdf_loader.py             # Load PDF files
  chunkers/
    simple_chunker.py         # Text chunking
  embeddings/
    openai_embedder.py        # Embedding generation
  vectorstores/
    chroma_store.py           # Vector DB operations
  retrieval/
    qa.py                     # Answer generation

config/
  settings.py                 # Central configuration

evaluation/
  questions.json              # Test questions
  run_eval.py                 # Evaluation runner

data/                         # Input documents
outputs/                      # Generated results

⚡ Setup

Create and activate a virtual environment
Install dependencies:

pip install -r requirements.txt

Create .env from .env.example
Add your OpenAI API key
Add documents to data/
Run:

python main.py

🔄 Rebuilding the Index

If you change documents:

python main.py --rebuild

Or:

python main.py --rebuild What is chunking?

🧠 Evaluation Harness

Run:

python -m evaluation.run_eval

Output:

outputs/evaluation_results.md

Includes:

Question
Expected sources
Retrieved sources
Matched sources
Generated answer
PASS / FAIL / CHECK status
Summary counts

📌 Notes

PDF extraction works best for text-based PDFs
Scanned PDFs may require OCR
ChromaDB data is stored in chroma_db/ (ignored by Git)
First run builds the index, later runs reuse it
CLI arguments override config defaults
Config values are stored in config/settings.py

🧭 Roadmap

Near-term

Improve evaluation scoring
Enhance configuration flexibility

Future

Support additional file types (.docx, .csv, .json)
Add hybrid search (keyword + vector)
Add MCP server integration
Support multiple vector stores (FAISS, Pinecone, etc.)

⭐ If you found this useful

Give the repo a star ⭐ and feel free to fork or extend it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rag-starter-template

🚀 Features

🧠 Core RAG Pipeline

⚡ Performance & Storage

🔍 Retrieval Enhancements

🛠️ Usability

📊 Observability & Quality

📂 Supported Input Types

⚙️ Command-Line Options

Available flags

Examples

🛡️ Error Handling

🏗️ Project Structure

⚡ Setup

🔄 Rebuilding the Index

🧠 Evaluation Harness

📌 Notes

🧭 Roadmap

Near-term

Future

⭐ If you found this useful

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
app		app
config		config
data		data
evaluation		evaluation
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Generation.		Generation.
README.md		README.md
Retrieval-Augmented		Retrieval-Augmented
for		for
main.py		main.py
requirements.txt		requirements.txt
stands		stands
test.py		test.py

Folders and files

Latest commit

History

Repository files navigation

rag-starter-template

🚀 Features

🧠 Core RAG Pipeline

⚡ Performance & Storage

🔍 Retrieval Enhancements

🛠️ Usability

📊 Observability & Quality

📂 Supported Input Types

⚙️ Command-Line Options

Available flags

Examples

🛡️ Error Handling

🏗️ Project Structure

⚡ Setup

🔄 Rebuilding the Index

🧠 Evaluation Harness

📌 Notes

🧭 Roadmap

Near-term

Future

⭐ If you found this useful

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages