RAG evaluation harness with API endpoints, scenario datasets, and failure taxonomy for regression tracking.
Most RAG projects demo answers but do not provide repeatable evaluation evidence for retrieval and grounding quality.
- API:
src/rag_eval_observatory/api/main.py - Datasets: legal + support (
src/rag_eval_observatory/datasets.py) - Evaluation engine:
src/rag_eval_observatory/evaluate.py - Error taxonomy:
src/rag_eval_observatory/taxonomy.py - Persistence: SQLite run store (
src/rag_eval_observatory/db.py)
See docs/ARCHITECTURE.md.
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
python scripts/init_db.py
uvicorn rag_eval_observatory.api.main:app --reload --port 8800Optional DB path override:
export RAG_EVAL_DB_PATH=data/runs.dbGET /healthGET /versionPOST /v1/eval/runGET /v1/eval/{run_id}GET /v1/eval/summary
Response envelope:
{
"status": "ok",
"data": {},
"meta": {"model_version": "0.1.0", "latency_ms": 0},
"error": null
}pytestBenchmark artifacts:
reports/benchmark.mdreports/metrics.json
Provides retrieval metrics (precision@k, recall@k, mrr) and answer relevance with explicit failure buckets.
- Heuristic answer relevance scoring
- Limited built-in scenarios (legal/support)
- Add persistence backend for run history
- Add LLM-judge optional evaluation mode
- Add CI regression thresholds against baseline metrics
docs/ARCHITECTURE.mddocs/CASE_STUDY.mddocs/DEMO_SCRIPT_90S.mdSECURITY.md