Add RAG service with FastAPI, RAGEngine, config, and deps#3
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a minimal FastAPI-based Retrieval-Augmented Generation (RAG) microservice that can ingest a local corpus into a persistent ChromaDB collection and serve similarity queries with optional Groq-based generation, with runtime configuration sourced from config.yaml and environment variables.
Changes:
- Introduces
RAGEnginefor PDF/text ingestion, chunking, embedding, ChromaDB persistence, and query-time retrieval + optional Groq completion. - Adds configuration loading via YAML +
.envand exposes a FastAPI app with startup ingestion and a/queryendpoint. - Adds initial runtime dependency list, default
config.yaml, and.env.example.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
requirements.txt |
Adds runtime dependencies for FastAPI, ChromaDB, embeddings, PDF parsing, Groq, and config loading. |
config.yaml |
Provides default runtime configuration (corpus path, chunking, retrieval params, model, Chroma persistence). |
app/config.py |
Implements settings loading from YAML + .env into a Settings dataclass. |
app/rag.py |
Implements ingestion + retrieval and optional Groq answer generation. |
app/main.py |
Exposes the service via FastAPI startup ingestion and /query endpoint. |
.env.example |
Documents required Groq API key env var. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1,8 @@ | |||
| corpus_path: "D:/Evalens/corpus/intercom_external/raw_pdfs" | |||
There was a problem hiding this comment.
corpus_path default is a machine-specific absolute Windows path, which will break on other environments and in CI/containers. Consider making the default a relative path (e.g., ./corpus) and/or sourcing it from an environment variable instead of hardcoding a local drive path.
| corpus_path: "D:/Evalens/corpus/intercom_external/raw_pdfs" | |
| corpus_path: "./corpus" |
| retrieval_k: 4 | ||
| model: "llama-3.1-8b-instant" | ||
| embedding_model: "sentence-transformers/all-MiniLM-L6-v2" | ||
| chroma_path: ".chroma" |
There was a problem hiding this comment.
Default chroma_path is set to .chroma, but the repository’s ignore patterns currently exclude chroma_db/ rather than .chroma/. This makes it easy to accidentally commit the persistent vector DB; consider aligning the default path with the ignored directory name or updating ignore rules accordingly.
| chroma_path: ".chroma" | |
| chroma_path: "chroma_db" |
|
|
||
| return Settings( | ||
| corpus_path=cfg.get("corpus_path", "D:/Evalens/corpus/intercom_external/raw_pdfs"), |
There was a problem hiding this comment.
load_settings() falls back to a machine-specific absolute Windows corpus_path. If config.yaml is missing/misconfigured in another environment, the service will fail in a non-obvious way. Prefer a portable default (relative path) and/or allow overriding via an env var (e.g., CORPUS_PATH).
| return Settings( | |
| corpus_path=cfg.get("corpus_path", "D:/Evalens/corpus/intercom_external/raw_pdfs"), | |
| default_corpus_path = cfg_file.parent / "corpus" | |
| corpus_path = os.getenv("CORPUS_PATH") or cfg.get("corpus_path") or str(default_corpus_path) | |
| return Settings( | |
| corpus_path=corpus_path, |
| existing_count = self._collection.count() | ||
| if existing_count > 0: | ||
| return {"indexed_files": 0, "indexed_chunks": existing_count, "skipped": True} |
There was a problem hiding this comment.
ingest() permanently skips indexing when the collection already has any documents (count() > 0). With a persistent Chroma DB, this means newly added/updated corpus files will never be indexed unless the DB is manually deleted. Consider tracking ingested sources (e.g., by file mtime/hash) and upserting new chunks, or providing an explicit force_reindex/reset option.
| for idx, chunk in enumerate(chunks): | ||
| ids.append(f"{source_path.stem}-{idx}") | ||
| docs.append(chunk) | ||
| metas.append({"source": str(source_path), "chunk_index": idx}) |
There was a problem hiding this comment.
Chunk IDs are derived from source_path.stem and idx, which can collide when different files share the same stem (e.g., report.pdf and report.txt, or duplicates in different subfolders), causing Chroma add() failures or overwrites. Consider incorporating the full relative path (or a stable hash of it) and the suffix into the ID.
| for idx, chunk in enumerate(chunks): | ||
| ids.append(f"{source_path.stem}-{idx}") | ||
| docs.append(chunk) | ||
| metas.append({"source": str(source_path), "chunk_index": idx}) |
There was a problem hiding this comment.
source metadata stores the full server filesystem path and is later returned in the API response. This can leak internal directory structure to clients. Consider storing/returning only a safe identifier (e.g., basename, relative path within corpus, or a document ID) and keep absolute paths server-side only.
| metas.append({"source": str(source_path), "chunk_index": idx}) | |
| metas.append({"source": source_path.name, "chunk_index": idx}) |
| @app.on_event("startup") | ||
| def startup_event() -> None: | ||
| rag.ingest() |
There was a problem hiding this comment.
Running rag.ingest() synchronously during FastAPI startup can block the service from becoming ready for a long time (potentially causing health-check failures/timeouts) when the corpus is large. Consider moving ingestion to a background task, a separate admin endpoint/CLI, or making startup ingestion optional via configuration.
| @app.on_event("startup") | ||
| def startup_event() -> None: | ||
| rag.ingest() |
There was a problem hiding this comment.
Ingestion at startup can race in multi-worker deployments (e.g., uvicorn --workers N) where multiple processes call ingest() simultaneously against the same persistent Chroma path, potentially causing duplicate-ID errors or DB corruption. Consider ensuring single-worker ingestion, adding an inter-process lock, or using a dedicated one-off ingestion job.
| except Exception as e: # minimal tracer-bullet error handling | ||
| raise HTTPException(status_code=500, detail=str(e)) from e |
There was a problem hiding this comment.
Returning detail=str(e) for all unexpected exceptions can leak internal error messages, file paths, and implementation details to clients. Prefer returning a generic 500 message and logging the exception server-side (with a request/correlation ID if possible).
Motivation
GROQ_API_KEY) are loaded from.envandconfig.yaml.Description
app/config.pyto load settings fromconfig.yamlandGROQ_API_KEYfrom environment viapython-dotenvinto aSettingsdataclass.app/rag.pycontainingRAGEnginewhich useschromadbpersistent client, aSentenceTransformerembedding function, andPyMuPDFto extract text from PDFs, chunk documents, index chunks, and perform similarity queries.app/main.pyexposing a FastAPI app with a startupingest()call and a/queryPOST endpoint that returns retrieved chunks, sources, and an optional Groq-generated answer whenGROQ_API_KEYis set.config.yamldefaults,.env.examplewithGROQ_API_KEYplaceholder, andrequirements.txtlisting runtime dependencies (fastapi,uvicorn,pymupdf,chromadb,sentence-transformers,groq,python-dotenv,pyyaml).Testing
Codex Task