Hybrid Enterprise RAG System

A production-grade, deterministic, and highly secure Hybrid Enterprise RAG (Retrieval-Augmented Generation) Analytics System. This system merges structured SQL analytics (e.g., metric calculations like revenue and conversion rates), time-series forecasting, and unstructured qualitative retrieval (e.g., sales notes, transcripts) into a single, cohesive, AI-orchestrated query engine.

Instead of relying on autonomous agents that can hallucinate or exfiltrate data, this architecture enforces strict deterministic routing, bounded tool execution, and an LLM layer restricted solely to explanation synthesis based on verified context.

How It Works

The system processes user natural language queries through a sequence of strict, deterministic layers:

Security & Sanitization: Filters inputs for prompt injections, XSS, and enforces RBAC (Role-Based Access Control) bounds.
Intent Classification & Orchestration: A router evaluates the query and classifies the intent (METRIC, FORECAST, QUALITATIVE, or MIXED).
Execution Routing: The orchestrator invokes the corresponding deterministic engines in parallel:
- SQL Engine: Accesses the data warehouse to calculate hard metrics.
- Forecast Engine: Generates time-series projections using statistical models (ARIMA).
- Retrieval Engine: Performs vector searches (FAISS + Sentence-Transformers) mapped with Re-ranking (Cross-Encoder) for maximum qualitative accuracy.
LLM Generation: The retrieved, structured, and sanitized data is packaged into a JSON context window. The LLM is instructed only to synthesize an answer from this exact JSON. If the data is insufficient, it safely refuses to answer.
Continuous Improvement Loop: All telemetry, costs, low-confidence hits, and human feedback are actively logged into structured schemas to form an Active Learning queue.

Architecture Diagram

graph TD
    User([User Query]) --> API[FastAPI Entrypoint]
    API --> Security[Security & Sanitization Layer]
    Security --> Router[Intent Router]
    
    Router -->|METRIC| SQL[SQL Engine]
    Router -->|FORECAST| Forecast[Forecast Engine]
    Router -->|QUALITATIVE| Vector[Retrieval Engine + Re-ranker]
    
    SQL --> Context[Context Builder]
    Forecast --> Context
    Vector --> Context
    
    Context --> Guard[Hallucination Guard]
    Guard --> LLM[LLM Explanation Layer]
    LLM --> Response([Final Response])
    
    API -.-> Obs[Observability & Analytics]
    LLM -.-> CI[Continuous Improvement Loop]

Design System

The project was methodically built across the following architectural phases:

Phase 0: Baseline Scripts: Built the raw data pipelines. Downloadede CRM dataset from kaggle
Phase 1: Core System Skeleton: Established deterministic routing, API layers, and isolated tool contracts.
Phase 2: Data & Retrieval Layer: Implemented vector storage, BM25, semantic chunking, and deterministic embedding generation.
Phase 3: Accuracy Optimization Layer: Added Cross-Encoder re-ranking and latency guards to maximize Top-K precision.
Phase 4: Generation Layer: Configured the LLM context builders, strict prompt constraints, and zero-tolerance hallucination guardrails.
Phase 5: Reliability Layer: Configured circuit breakers, timeout thresholds, and graceful degradation routers to handle component failures.
Phase 6: Observability Layer: Implemented OpenTelemetry tracing, Prometheus metrics, and structured aggregator logging.
Phase 7: Security & Access Control: Bolstered tenant isolation, RBAC row-level bounds, and strict prompt injection blocking.
Phase 8: Cost Management Layer: Established dynamic query complexity scoring, token budgeting, caching layers, and LLM tier routing.
Phase 9: Governance & Compliance: Designed explicit data deletion pipelines (GDPR), embedding version tag enforcement, and immutable JSON audit trails.
Phase 10: Continuous Improvement Loop: Exposed Feedback APIs, explicit failure dashboards, and Active Learning queues tying LLM misses back to Human-in-The-Loop review points.
Phase 11: Production Evaluation: Ran full test suites to measure latency, stress bounds, and cost tracking.

Sample Test Results

User Query: "Why did we lose the Initech deal, and what was the total value lost?"

System Result:

"The Initech deal was lost primarily due to three reasons: first, our pricing was 30% higher for comparable features; second, they required an on-premise deployment which we do not offer; and third, the proof of concept took three weeks compared to the competitor's three-day trial. The total value of the missed opportunity was $450,000."

Behind the Scenes: The Orchestrator accurately tagged this as a QUALITATIVE intent, fetched the precise chunk from the vectorized lost_deal_reasons document, and successfully parsed out the $450k parameter without executing expensive forecasting or SQL metrics.

Evaluation Metrics & Results

During the Phase 11 stress test evaluation running the query "Summarize the Q3 revenue trends for our enterprise accounts, and explain why the numbers dipped in September compared to August", here are the captured execution metrics:

Metric	Result
Component Latency	Total: 6797ms (Forecast Engine: 3765ms, LLM Gen: 3021ms, Retrieval/SQL/Router: < 10ms total
Cost Per Query	$0.0077 (743 Tokens In, 11 Tokens Out)
Hallucination Rate	0.0 (Strict prevention enforced)
System Verdict	Successfully identified the query needed both SQL and Text data (`MIXED` intent). It calculated the metric perfectly but refused to hallucinate the cause of the decline when qualitative context was missing.
Production Readiness	88/100 (Deductions reflect synchronous ARIMA training blockers and regex keyword routing boundaries).

1. Executive Summary

The Hybrid Enterprise RAG System processed a 20-query benchmark subset against the production-authentic Kaggle CRM dataset using the llama-3.3-70b-versatile model.

Hallucination Rate: 0.00% (Target < 1%) -> PASS
SQL Exact Match: 100.00% (Target >= 99%) -> PASS
Forecast MAPE: 12.40% (Target < 15%) -> PASS
Router Macro F1 Proxy: 1.00 (Target >= 0.90) -> PASS (Upgraded to 70b)
P95 Latency: 75.18s (Target < 4s) -> FAIL (Outlier in Forecast Engine)
Avg Cost per Query: $0.0006

2. RAG Retrieval & Groundedness

Recall@K Proxy: 0.95 (Improved by high-density real-world memos)
Faithfulness Rating: 0.99 (Strict parameter extraction and 70b model reasoning)
Refusal Correctness: 25.0% Refusal Rate. Correctness verified for adversarial/out-of-bound queries.

3. Telemetry & Cost Sustainability

P50 Latency: 1.56s
P95 Latency: 75.18s
P99 Latency: 75.18s
Estimated Daily Cost (10k volume): $6.00
Estimated Daily Cost (100k volume): $60.00

4. Human Evaluation Loop (Simulated Sample n=20)

Executive Usefulness Score: 4.8 / 5.0 (High confidence due to real CRM data patterns)
Clarity: 4.9 / 5.0 (Synthesized reports for Medical/Software sectors)

5. Bottlenecks Identified

Forecast Engine Latency: ARIMA computations for the expanded 8,800-row pipeline dataset triggered significant delays in the projection phase (75s outlier).
Semantic Extraction Overhead: While extraction is accurate (100% router F1), the 70b model's complex prompt adds ~500ms to the routing phase compared to the 8b model.
Data Mapping Complexity: Alignment between CSV stages ('Won') and DB stages ('closed_won') is now stable but requires strict maintenance of the WarehouseLoader mapping.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
app		app
config		config
docs		docs
frontend		frontend
grafana		grafana
tests		tests
.env.example		.env.example
.gitignore		.gitignore
DEPLOYMENT.md		DEPLOYMENT.md
Dockerfile		Dockerfile
README.md		README.md
Technical Analysis and Trade-off.md		Technical Analysis and Trade-off.md
alerts.yml		alerts.yml
benchmark_rerank.py		benchmark_rerank.py
bootstrap_log.txt		bootstrap_log.txt
build_index.py		build_index.py
docker-compose.yml		docker-compose.yml
eval_results.json		eval_results.json
evaluate_retrieval.py		evaluate_retrieval.py
prometheus.yml		prometheus.yml
reload_data.py		reload_data.py
requirements.txt		requirements.txt
run_evaluation.py		run_evaluation.py
run_phase12_benchmark.py		run_phase12_benchmark.py
run_phase14_benchmark.py		run_phase14_benchmark.py
run_test.py		run_test.py
verify_fix.py		verify_fix.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hybrid Enterprise RAG System

How It Works

Architecture Diagram

Design System

Sample Test Results

Evaluation Metrics & Results

1. Executive Summary

2. RAG Retrieval & Groundedness

3. Telemetry & Cost Sustainability

4. Human Evaluation Loop (Simulated Sample n=20)

5. Bottlenecks Identified

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hybrid Enterprise RAG System

How It Works

Architecture Diagram

Design System

Sample Test Results

Evaluation Metrics & Results

1. Executive Summary

2. RAG Retrieval & Groundedness

3. Telemetry & Cost Sustainability

4. Human Evaluation Loop (Simulated Sample n=20)

5. Bottlenecks Identified

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages