Add RAG service: PDF ingestion, Chroma DB indexing, and FastAPI query endpoint#2
Add RAG service: PDF ingestion, Chroma DB indexing, and FastAPI query endpoint#2NaveenBuidl wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a minimal RAG microservice that ingests a PDF corpus into ChromaDB and serves retrieval (and optional Groq-generated answers) via a FastAPI endpoint, configured through config.yaml + environment variables.
Changes:
- Introduces
RAGEnginefor PDF ingestion, chunking, Chroma indexing, retrieval, and optional Groq chat completions. - Adds a FastAPI app that ingests on startup and exposes a
/queryPOST endpoint. - Adds initial configuration (
config.yaml,.env.example) and runtime dependencies (requirements.txt).
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| requirements.txt | Adds runtime dependencies for FastAPI, PDF parsing, embeddings, and Chroma storage. |
| config.yaml | Provides initial RAG settings (corpus path, chunking, retrieval, models, Chroma persistence). |
| app/config.py | Loads YAML config and reads GROQ_API_KEY from environment into a Settings dataclass. |
| app/rag.py | Implements ingestion/indexing and query + optional generation workflow. |
| app/main.py | Exposes FastAPI app with startup ingestion and /query endpoint. |
| .env.example | Documents GROQ_API_KEY environment variable. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1,8 @@ | |||
| corpus_path: "D:/Evalens/corpus/intercom_external/raw_pdfs" | |||
There was a problem hiding this comment.
config.yaml hard-codes an absolute Windows path for corpus_path. This makes the service non-portable and will fail on other machines/OSes; consider using a repo-relative default (e.g., ./corpus) and/or leaving this as a placeholder value documented in README/.env instead of committing a machine-specific path.
| corpus_path: "D:/Evalens/corpus/intercom_external/raw_pdfs" | |
| corpus_path: "./corpus" |
| return Settings( | ||
| corpus_path=cfg.get("corpus_path", "D:/Evalens/corpus/intercom_external/raw_pdfs"), | ||
| chunk_size=int(cfg.get("chunk_size", 1000)), | ||
| chunk_overlap=int(cfg.get("chunk_overlap", 150)), | ||
| retrieval_k=int(cfg.get("retrieval_k", 4)), | ||
| model=cfg.get("model", "llama-3.1-8b-instant"), | ||
| embedding_model=cfg.get("embedding_model", "sentence-transformers/all-MiniLM-L6-v2"), | ||
| chroma_path=cfg.get("chroma_path", ".chroma"), | ||
| collection_name=cfg.get("collection_name", "intercom_pdfs"), |
There was a problem hiding this comment.
load_settings() defaults corpus_path to an absolute Windows path. Even if config.yaml is changed later, this fallback will still break portability when the key is missing; prefer a relative default (or require corpus_path in config) and optionally support overriding via an environment variable.
| def ingest(self) -> dict[str, Any]: | ||
| corpus = Path(self.settings.corpus_path) | ||
| if not corpus.exists(): | ||
| raise FileNotFoundError(f"Corpus folder not found: {corpus}") | ||
|
|
||
| pdf_paths = sorted(corpus.glob("*.pdf")) | ||
| if not pdf_paths: | ||
| return {"indexed_files": 0, "indexed_chunks": 0} |
There was a problem hiding this comment.
ingest() raises FileNotFoundError when the corpus directory is missing, which will crash the FastAPI startup hook. The PR description mentions ingestion should skip when the corpus is missing; consider returning a structured "skipped" result (or logging and continuing) instead of raising here.
| @app.on_event("startup") | ||
| def startup_event() -> None: | ||
| rag.ingest() | ||
|
|
There was a problem hiding this comment.
startup_event() calls rag.ingest() without handling errors. As written, a missing corpus directory (or any ingestion error) will prevent the app from starting; consider catching expected exceptions (e.g., missing corpus) and logging them so /query can still operate (or at least fail with a clear startup message).
| except Exception as e: # minimal tracer-bullet error handling | ||
| raise HTTPException(status_code=500, detail=str(e)) from e |
There was a problem hiding this comment.
The blanket 500 handler returns detail=str(e) to clients. This can leak internal information such as local file paths and library error messages; prefer returning a generic message for 500s and logging the exception server-side.
| @app.on_event("startup") | ||
| def startup_event() -> None: | ||
| rag.ingest() |
There was a problem hiding this comment.
FastAPI's @app.on_event("startup") hook is deprecated in favor of the lifespan interface in newer FastAPI versions. Using lifespan avoids deprecation warnings and is the forward-compatible way to run ingestion on startup.
| ids: list[str] = [] | ||
| docs: list[str] = [] | ||
| metas: list[dict[str, Any]] = [] | ||
|
|
||
| for pdf_path in pdf_paths: | ||
| with fitz.open(pdf_path) as pdf_doc: | ||
| full_text = "\n".join(page.get_text("text") for page in pdf_doc) | ||
| chunks = self._chunk_text( | ||
| full_text, | ||
| self.settings.chunk_size, | ||
| self.settings.chunk_overlap, | ||
| ) | ||
| for idx, chunk in enumerate(chunks): | ||
| ids.append(f"{pdf_path.stem}-{idx}") | ||
| docs.append(chunk) | ||
| metas.append({"source": str(pdf_path), "chunk_index": idx}) | ||
|
|
||
| if ids: | ||
| self._collection.add(ids=ids, documents=docs, metadatas=metas) | ||
|
|
There was a problem hiding this comment.
ingest() accumulates all chunk ids/documents/metadata for the entire corpus in memory before calling collection.add(). For large PDF sets this can create high peak memory usage (full_text + chunks + lists). Consider adding in batches (e.g., per-PDF or fixed-size batches) to cap memory and make ingestion more robust.
| results = self._collection.query( | ||
| query_texts=[user_query], | ||
| n_results=self.settings.retrieval_k, | ||
| include=["documents", "metadatas", "distances"], |
There was a problem hiding this comment.
collection.query() requests "distances" in include=..., but the code never uses distances. Omitting it will reduce data transfer/processing, especially when retrieval_k is larger.
| include=["documents", "metadatas", "distances"], | |
| include=["documents", "metadatas"], |
Motivation
sentence-transformersfor embeddings with optional generation via thegroqAPI whenGROQ_API_KEYis present.config.yamland environment variable sample in.env.example.Description
app/config.pywhich defines aSettingsdataclass andload_settings()to readconfig.yamlandGROQ_API_KEYfrom the environment.app/rag.pycontainingRAGEnginewhich handles PDF ingestion (viapymupdf/fitz), text chunking, indexing into a Chroma collection, nearest-neighbor retrieval, and optional answer generation through thegroqclient.app/main.pywhich exposes a FastAPI app with startup ingestion and a/queryPOST endpoint that accepts aqueryand returns retrieval results and generated answers.config.yaml,.env.examplewithGROQ_API_KEY, andrequirements.txtlisting runtime dependencies includingfastapi,uvicorn,pymupdf,chromadb,sentence-transformers,groq,python-dotenv, andpyyaml.GROQ_API_KEYis not set.Testing
Codex Task