Skip to content

keremercin/rag-eval-observatory

Repository files navigation

rag-eval-observatory

CI Python FastAPI

RAG evaluation harness with API endpoints, scenario datasets, and failure taxonomy for regression tracking.

Problem

Most RAG projects demo answers but do not provide repeatable evaluation evidence for retrieval and grounding quality.

Architecture

  • API: src/rag_eval_observatory/api/main.py
  • Datasets: legal + support (src/rag_eval_observatory/datasets.py)
  • Evaluation engine: src/rag_eval_observatory/evaluate.py
  • Error taxonomy: src/rag_eval_observatory/taxonomy.py
  • Persistence: SQLite run store (src/rag_eval_observatory/db.py)

See docs/ARCHITECTURE.md.

Local Run

python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
python scripts/init_db.py
uvicorn rag_eval_observatory.api.main:app --reload --port 8800

Optional DB path override:

export RAG_EVAL_DB_PATH=data/runs.db

API Spec

  • GET /health
  • GET /version
  • POST /v1/eval/run
  • GET /v1/eval/{run_id}
  • GET /v1/eval/summary

Response envelope:

{
  "status": "ok",
  "data": {},
  "meta": {"model_version": "0.1.0", "latency_ms": 0},
  "error": null
}

Evaluation

pytest

Benchmark artifacts:

  • reports/benchmark.md
  • reports/metrics.json

Results

Provides retrieval metrics (precision@k, recall@k, mrr) and answer relevance with explicit failure buckets.

Limitations

  • Heuristic answer relevance scoring
  • Limited built-in scenarios (legal/support)

Roadmap

  • Add persistence backend for run history
  • Add LLM-judge optional evaluation mode
  • Add CI regression thresholds against baseline metrics

Docs

  • docs/ARCHITECTURE.md
  • docs/CASE_STUDY.md
  • docs/DEMO_SCRIPT_90S.md
  • SECURITY.md

About

RAG evaluation harness with retrieval metrics, failure taxonomy, and API-first regression workflows

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors