Agentic Retrieval-Augmented Generation (RAG) service built with FastAPI, LangChain/LangGraph, Azure OpenAI, and a hybrid Chroma + BM25 retrieval stack. It powers InternalIQ, an internal engineering knowledge assistant capable of indexing PDF corpora, retrieving grounded evidence, drafting answers, and verifying responses before returning them to end users.
Table of Contents
Engineering organizations accumulate large amounts of PDF documentation (governance, operating procedures, systems knowledge). InternalIQ closes the discovery gap by:
- Converting PDFs to structured markdown via Docling, chunking semantically, and indexing into Chroma and BM25.
- Running a LangGraph-powered agent workflow that retrieves, checks relevance, drafts answers, and verifies grounding before responding.
- Providing a minimal FastAPI surface for ingestion (
POST /documents) and querying (POST /query).
- Hybrid retrieval – HybridRetriever fuses dense embeddings (Chroma) with sparse BM25 scores for robust recall.
- Deterministic agent workflow – AgentWorkflow orchestrates retrieve → relevance gate → research → verification with retry loops.
- LLM-driven guardrails – Specialized agents (relevance, research, verification) run on Azure OpenAI deployments to ensure grounded, concise answers.
- Incremental indexing – DocumentIndexer leverages Docling loaders, semantic chunking, and an SHA-256 IndexCache to avoid re-embedding unchanged PDFs.
- Persistent vector store – VectorStore persists embeddings in
./chroma_db, enabling warm restarts and offline updates. - Category-aware data lake – Documents live under
data/<category>/...so teams can scope ingestion per domain (systems, operations, governance, etc.).
Data flow:
- PDFs dropped into
data/(or uploaded via/documents) are converted to markdown (loaders), chunked (chunking), and embedded. - Hybrid retrieval fetches the most relevant documents for each query.
- The Relevance Agent (relevance_agent.py) checks whether the query is relevant to the retrieved documents and whether an answer can be generated from them; queries failing this gate short-circuit to save tokens.
- Research and Verification agents iterate to refine and ground the answer, or exhaust retry budget if the documents cannot support a response.
app/
agents/ # LLM-powered relevance, research, verification agents
ingestion/ # PDF loading, markdown chunking, index cache, document indexer
orchestration/ # LangGraph AgentState + workflow definition
retrieval/ # Hybrid retriever, Chroma vector store, BM25 index
main.py # FastAPI entrypoint exposing /query and /documents
chroma_db/ # Persisted Chroma collections
data/ # Domain-organized PDF corpus
requirements.txt # Python dependency lock
- Python 3.11+ (tested on 3.11.x)
- Poetry or
pip/venv(instructions below assumepip) - Azure OpenAI resource with:
- Chat completion deployment for the agents
- Embedding deployment for Chroma ingestion
- Local install of
poppler/libmagicif Docling requires native deps (platform specific)
# 1. Clone and enter the repo
git clone <repo-url>
cd backend
# 2. Create & activate a virtual environment
python -m venv .venv
source .venv/Scripts/activate # PowerShell: .venv\Scripts\Activate.ps1
# 3. Install Python dependencies
pip install --upgrade pip
pip install -r requirements.txt
# 4. Create the runtime directories
mkdir -p data engineering_governance engineering_operations
mkdir -p chroma_db .index_cache
# 5. Provide environment variables (see below)
cp .env.example .env # create and fill in valuesPopulate a .env file at repo root (parsed via app/config.py).
| Variable | Description |
|---|---|
API_KEY |
Azure OpenAI API key with access to both chat + embedding deployments. |
AZURE_EMBEDDING_ENDPOINT |
Base URL of the embedding Azure OpenAI resource (https://<resource>.openai.azure.com). |
AZURE_EMBEDDING_DEPLOYMENT |
Embedding deployment name used by Chroma (text-embedding-3-large, etc.). |
LLM_ENDPOINT |
Base URL for the chat completion Azure OpenAI resource. |
LLM_DEPLOYMENT |
Deployment name for the chat model powering all three agents. |
Tip: Keep embeddings and chat deployments in the same resource when possible to simplify networking and private link configuration.
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reloadPOST /query– Accepts a JSON body like{ "query": "How do we roll out change requests?" }and returns an object containing the drafted answer, verification verdict, context excerpts, and relevance label.POST /documents– Accepts multipart form-data withfile=<PDF>andcategory=<folder>to index new material on the fly.
OpenAPI docs and an interactive Swagger UI live at http://localhost:8000/docs when the server is running.
- Place PDFs under
data/<category>/your-doc.pdf. - Start the API once; DocumentIndexer performs a recursive crawl and indexes anything not cached.
curl -X POST "http://localhost:8000/documents" \
-F "file=@data/engineering_systems/system-architecture.pdf" \
-F "category=engineering_systems"- Non-PDF uploads receive a 400 error.
- Indexed docs are hashed and tracked via
.index_cache/indexed_docs.jsonto avoid re-embedding duplicates.
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{"query": "What is the SEV escalation process?"}'Successful responses resemble:
{
"answer": "SEV escalations follow a three-stage process...",
"verification": {
"Supported": "YES",
"Unsupported Claims": [],
"Contradictions": [],
"Relevant": "YES"
},
"context_used": "[Document 1 | governance.pdf] ...",
"relevance_label": "CAN_ANSWER"
}If relevance fails (NO_MATCH), the workflow short-circuits to save tokens.
- Hot reload –
uvicorn app.main:app --reloadduring development. - Formatting/Linting – Configure your preferred toolset (e.g.,
rufforblack). - Tests – Add pytest suites under
tests/(not yet included) to cover ingestion, retrieval heuristics, and agent mocks. - Observability – Logging is already structured; wire in OpenTelemetry exporters if you need tracing.
- Missing
punkttokenizer errors → let NLTK download automatically or pre-install viapython -m nltk.downloader punkt punkt_tab. FileNotFoundError: data→ ensure thedata/directory exists before starting the API.- Empty answers → verify embeddings + chat deployments are correctly configured and accessible from your environment.
- Chroma schema issues → delete
./chroma_dbto force a clean rebuild (will re-embed all content).
Need help or have ideas? File an issue or start a discussion – contributions are welcome!

