Skip to content

Add RAG service with FastAPI, RAGEngine, config, and deps#3

Open
NaveenBuidl wants to merge 1 commit into
mainfrom
codex/build-minimal-rag-app-with-fastapi-ee2e7h
Open

Add RAG service with FastAPI, RAGEngine, config, and deps#3
NaveenBuidl wants to merge 1 commit into
mainfrom
codex/build-minimal-rag-app-with-fastapi-ee2e7h

Conversation

@NaveenBuidl
Copy link
Copy Markdown
Owner

Motivation

  • Provide a small Retrieval-Augmented Generation (RAG) service to index a PDF/text corpus and serve queries with optional Groq-powered generation.
  • Centralize runtime configuration in a YAML file and environment variables so credentials (like GROQ_API_KEY) are loaded from .env and config.yaml.
  • Add required dependencies and a lightweight API so the RAG functionality can be run as a service.

Description

  • Add app/config.py to load settings from config.yaml and GROQ_API_KEY from environment via python-dotenv into a Settings dataclass.
  • Implement app/rag.py containing RAGEngine which uses chromadb persistent client, a SentenceTransformer embedding function, and PyMuPDF to extract text from PDFs, chunk documents, index chunks, and perform similarity queries.
  • Add app/main.py exposing a FastAPI app with a startup ingest() call and a /query POST endpoint that returns retrieved chunks, sources, and an optional Groq-generated answer when GROQ_API_KEY is set.
  • Include config.yaml defaults, .env.example with GROQ_API_KEY placeholder, and requirements.txt listing runtime dependencies (fastapi, uvicorn, pymupdf, chromadb, sentence-transformers, groq, python-dotenv, pyyaml).

Testing

  • No automated tests were executed as part of this rollout.

Codex Task

Copilot AI review requested due to automatic review settings April 8, 2026 07:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a minimal FastAPI-based Retrieval-Augmented Generation (RAG) microservice that can ingest a local corpus into a persistent ChromaDB collection and serve similarity queries with optional Groq-based generation, with runtime configuration sourced from config.yaml and environment variables.

Changes:

  • Introduces RAGEngine for PDF/text ingestion, chunking, embedding, ChromaDB persistence, and query-time retrieval + optional Groq completion.
  • Adds configuration loading via YAML + .env and exposes a FastAPI app with startup ingestion and a /query endpoint.
  • Adds initial runtime dependency list, default config.yaml, and .env.example.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
requirements.txt Adds runtime dependencies for FastAPI, ChromaDB, embeddings, PDF parsing, Groq, and config loading.
config.yaml Provides default runtime configuration (corpus path, chunking, retrieval params, model, Chroma persistence).
app/config.py Implements settings loading from YAML + .env into a Settings dataclass.
app/rag.py Implements ingestion + retrieval and optional Groq answer generation.
app/main.py Exposes the service via FastAPI startup ingestion and /query endpoint.
.env.example Documents required Groq API key env var.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread config.yaml
@@ -0,0 +1,8 @@
corpus_path: "D:/Evalens/corpus/intercom_external/raw_pdfs"
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corpus_path default is a machine-specific absolute Windows path, which will break on other environments and in CI/containers. Consider making the default a relative path (e.g., ./corpus) and/or sourcing it from an environment variable instead of hardcoding a local drive path.

Suggested change
corpus_path: "D:/Evalens/corpus/intercom_external/raw_pdfs"
corpus_path: "./corpus"

Copilot uses AI. Check for mistakes.
Comment thread config.yaml
retrieval_k: 4
model: "llama-3.1-8b-instant"
embedding_model: "sentence-transformers/all-MiniLM-L6-v2"
chroma_path: ".chroma"
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default chroma_path is set to .chroma, but the repository’s ignore patterns currently exclude chroma_db/ rather than .chroma/. This makes it easy to accidentally commit the persistent vector DB; consider aligning the default path with the ignored directory name or updating ignore rules accordingly.

Suggested change
chroma_path: ".chroma"
chroma_path: "chroma_db"

Copilot uses AI. Check for mistakes.
Comment thread app/config.py
Comment on lines +35 to +37

return Settings(
corpus_path=cfg.get("corpus_path", "D:/Evalens/corpus/intercom_external/raw_pdfs"),
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_settings() falls back to a machine-specific absolute Windows corpus_path. If config.yaml is missing/misconfigured in another environment, the service will fail in a non-obvious way. Prefer a portable default (relative path) and/or allow overriding via an env var (e.g., CORPUS_PATH).

Suggested change
return Settings(
corpus_path=cfg.get("corpus_path", "D:/Evalens/corpus/intercom_external/raw_pdfs"),
default_corpus_path = cfg_file.parent / "corpus"
corpus_path = os.getenv("CORPUS_PATH") or cfg.get("corpus_path") or str(default_corpus_path)
return Settings(
corpus_path=corpus_path,

Copilot uses AI. Check for mistakes.
Comment thread app/rag.py
Comment on lines +60 to +62
existing_count = self._collection.count()
if existing_count > 0:
return {"indexed_files": 0, "indexed_chunks": existing_count, "skipped": True}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ingest() permanently skips indexing when the collection already has any documents (count() > 0). With a persistent Chroma DB, this means newly added/updated corpus files will never be indexed unless the DB is manually deleted. Consider tracking ingested sources (e.g., by file mtime/hash) and upserting new chunks, or providing an explicit force_reindex/reset option.

Copilot uses AI. Check for mistakes.
Comment thread app/rag.py
Comment on lines +80 to +83
for idx, chunk in enumerate(chunks):
ids.append(f"{source_path.stem}-{idx}")
docs.append(chunk)
metas.append({"source": str(source_path), "chunk_index": idx})
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chunk IDs are derived from source_path.stem and idx, which can collide when different files share the same stem (e.g., report.pdf and report.txt, or duplicates in different subfolders), causing Chroma add() failures or overwrites. Consider incorporating the full relative path (or a stable hash of it) and the suffix into the ID.

Copilot uses AI. Check for mistakes.
Comment thread app/rag.py
for idx, chunk in enumerate(chunks):
ids.append(f"{source_path.stem}-{idx}")
docs.append(chunk)
metas.append({"source": str(source_path), "chunk_index": idx})
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source metadata stores the full server filesystem path and is later returned in the API response. This can leak internal directory structure to clients. Consider storing/returning only a safe identifier (e.g., basename, relative path within corpus, or a document ID) and keep absolute paths server-side only.

Suggested change
metas.append({"source": str(source_path), "chunk_index": idx})
metas.append({"source": source_path.name, "chunk_index": idx})

Copilot uses AI. Check for mistakes.
Comment thread app/main.py
Comment on lines +19 to +21
@app.on_event("startup")
def startup_event() -> None:
rag.ingest()
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running rag.ingest() synchronously during FastAPI startup can block the service from becoming ready for a long time (potentially causing health-check failures/timeouts) when the corpus is large. Consider moving ingestion to a background task, a separate admin endpoint/CLI, or making startup ingestion optional via configuration.

Copilot uses AI. Check for mistakes.
Comment thread app/main.py
Comment on lines +19 to +21
@app.on_event("startup")
def startup_event() -> None:
rag.ingest()
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ingestion at startup can race in multi-worker deployments (e.g., uvicorn --workers N) where multiple processes call ingest() simultaneously against the same persistent Chroma path, potentially causing duplicate-ID errors or DB corruption. Consider ensuring single-worker ingestion, adding an inter-process lock, or using a dedicated one-off ingestion job.

Copilot uses AI. Check for mistakes.
Comment thread app/main.py
Comment on lines +30 to +31
except Exception as e: # minimal tracer-bullet error handling
raise HTTPException(status_code=500, detail=str(e)) from e
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning detail=str(e) for all unexpected exceptions can leak internal error messages, file paths, and implementation details to clients. Prefer returning a generic 500 message and logging the exception server-side (with a request/correlation ID if possible).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants