🛡️ LLM Regression Guard — End-to-End MLOps Pipeline for LLM Quality Assurance

An end-to-end MLOps project that combines LLM Evaluation, CI/CD Automation, and Production Observability to catch LLM quality regressions before they reach production.

This project is divided into five major stages:

Stage 1 → 📥 Dataset Builder (Schema + Category Validation)
Stage 2 → 🎯 Scorer (BERTScore + ROUGE-L + LLM Judge)
Stage 3 → 🚦 CI/CD Quality Gate (MLflow Baseline Comparison)
Stage 4 → 📊 Drift Dashboard (Streamlit + Slack Alerts)
Stage 5 → 🌓 Shadow Traffic Monitor (FastAPI + Celery + TimescaleDB)

📌 Problem Statement

This project is based on a real-world LLM-production scenario where:

Every change to a model, prompt template, or scoring configuration risks introducing silent quality regressions that escape standard unit tests.
Production LLM behaviour can drift slowly over weeks in ways no single CI run can detect.
Promoting a new model variant requires safe, evidence-driven decisions — not gut feeling.

The goal: build a fully automated MLOps system that scores every change against a golden dataset, blocks merges on regression, monitors live traffic, and surfaces drift early.

🎯 Project Objectives

🔹 Stage 1: Dataset Engineering

Validate prompts against a strict Pydantic schema
Enforce per-category minimum example counts
Produce a versioned golden dataset via DVC

🔹 Stage 2: Multi-Signal Scoring

Run inference via LiteLLM (provider-agnostic)
Compute deterministic metrics: BERTScore F1 + ROUGE-L
Score with an LLM-as-Judge rubric (factual_accuracy, instruction_following, safety)
Aggregate per category + overall and log to MLflow

🔹 Stage 3: Automated Quality Gate

Compare every PR against the production-tagged baseline
Block merges on category drops > 0.05 or overall < 0.80
Post a Markdown scorecard back onto the PR

🔹 Stage 4: Drift Observability

Visualise score trends, category × version heatmaps, shadow comparison
Fire Slack alerts when rolling 3-run averages dip below threshold

🔹 Stage 5: Live Shadow Monitoring

Silently replay 5% of /chat traffic to a candidate model
Score both production and shadow responses asynchronously
Persist deltas to TimescaleDB for promotion decisions

🏗️ Project Structure

.github
- workflows
  - quality_gate.yml
app (Stage 5 — FastAPI + Celery shadow worker)
- init.py
- main.py
- router.py
- middleware.py
- tasks.py
- db.py
config (single source of truth)
- dataset_config.yaml
- scorer_config.yaml
- model_version.yaml
- shadow_config.yaml
dashboards (Stage 4 — Streamlit drift dashboard)
- init.py
- app.py
- charts.py
- alerts.py
data
- raw
  - prompts.jsonl
- eval
  - golden_dataset.jsonl
dataset (schema + dataset-level validation)
- init.py
- schema.py
- validator.py
docs
- images
mlruns (local MLflow tracking store)
scorer (Stage 2 — scoring pipeline)
- init.py
- run_scorer.py
- inference.py
- metrics.py
- judge.py
- aggregate.py
scores_output (scores.json artefact)
- .gitkeep
scripts (CI + dataset glue)
- init.py
- build_dataset.py
- check_regression.py
- post_pr_comment.py
tests (hermetic pytest suite)
- init.py
- conftest.py
- test_schema.py
- test_scorer.py
- test_regression_check.py
.env.example
.gitignore
LICENSE
README.md
docker-compose.yml
dockerfile
dvc.yaml
dvc.lock
requirements.txt
requirements-ci.txt

🧠 Pipeline Architecture

                            ┌─────────────────────────────┐
prompts.jsonl ──► Stage 1 ──┤  build_dataset (validate)   ├──► golden_dataset.jsonl
                            └─────────────────────────────┘
                                          │
                            ┌─────────────────────────────┐
                            │  Stage 2  run_scorer        │
                            │   inference (LiteLLM)       │
                            │   metrics  (BERTScore +     │
                            │             ROUGE-L)        │
                            │   judge    (LLM rubric)     │
                            └──────────────┬──────────────┘
                                           ▼
                              scores.json  +  MLflow run
                                           │
                            ┌─────────────────────────────┐
                            │  Stage 3  check_regression  ├──► PR pass / fail
                            └─────────────────────────────┘

                            ┌─────────────────────────────┐
                            │  Stage 4  Streamlit         │
              MLflow ◄──────┤   dashboard (drift,         │
              Timescale ◄───┤   shadow comparison,        │
                            │   Slack alerts)             │
                            └─────────────────────────────┘

                            ┌─────────────────────────────┐
            live /chat ────►│  Stage 5  FastAPI app       │
                            │   middleware → 5% shadow    │
                            │   Celery task → judge LLM   │
                            │   delta → TimescaleDB       │
                            └─────────────────────────────┘

🔄 The Five Pipeline Stages — Detailed

🔹 Stage 1: 📥 Dataset Builder

Load raw prompts from data/raw/prompts.jsonl.
Validate each entry against the GoldenExample Pydantic schema:
- prompt must exceed 10 characters
- reference_response must not be blank
- category must match one of the five Literal values
Enforce per-category minimum example counts from dataset_config.yaml.
Write validated entries to data/eval/golden_dataset.jsonl.
Exit with code 1 on any schema or count error — halting the DVC/CI pipeline.

👉 scripts/build_dataset.py, dataset/schema.py, dataset/validator.py

🔹 Stage 2: 🎯 Scorer

For every example in the golden dataset:

Inference — call the LLM under test via LiteLLM 👉 scorer/inference.py
Deterministic metrics — BERTScore F1 + ROUGE-L (no LLM calls) 👉 scorer/metrics.py
Judge LLM rubric — structured scoring on factual accuracy, instruction following, safety 👉 scorer/judge.py
Aggregation — composite per example, average per category, weighted overall 👉 scorer/aggregate.py
Tracking — log every metric, param, and tag to MLflow
Artefact — write scores_output/scores.json for the next stage

👉 scorer/run_scorer.py

🔹 Stage 3: 🚦 CI/CD Quality Gate

Load the current run's scores.json.
Fetch the latest MLflow run tagged env=production (the baseline).
Compare every per-category score to the baseline.
Fail the PR (sys.exit(1)) if:
- Any category drops more than regression_alert_delta (0.05) below baseline, or
- The overall score falls below the absolute threshold (0.80).
Fallback to absolute-threshold checks when MLflow is unreachable or no baseline exists.
Post a Markdown scorecard back onto the PR via the GitHub Issues Comments API.

👉 scripts/check_regression.py, scripts/post_pr_comment.py

🔹 Stage 4: 📊 Drift Dashboard

A Streamlit observability surface backed by MLflow + TimescaleDB:

Row 1 — Four summary metric cards (latest overall, total runs, model version, gate status)
Row 2 — Side-by-side: score time-series (with version-bump markers) + category × version heatmap
Row 3 — Production vs shadow rolling-average comparison
Sidebar — Manual alert trigger, threshold reference, MLflow UI link
Slack alerts — fire when rolling 3-run averages dip below threshold (rate-limited to one per hour per category)

👉 dashboards/app.py, dashboards/charts.py, dashboards/alerts.py

🔹 Stage 5: 🌓 Shadow Traffic Monitor

A FastAPI service that silently measures quality on live traffic:

ShadowMiddleware — hashes every POST /chat request body, fires a Celery task for 5% of calls (zero added latency)
Celery worker — runs both production and shadow models, scores with the judge LLM, computes the delta
TimescaleDB hypertable — stores time-series shadow results for fast time-range queries
Hash-based routing — same prompt always lands on the same model variant (deterministic, reproducible)

👉 app/main.py, app/middleware.py, app/tasks.py, app/db.py, app/router.py

📂 File-by-File Explanation

🔹 `app/` — Stage 5 (FastAPI + Celery)

main.py — FastAPI entry point + ShadowMiddleware(shadow_pct=5) registration
router.py — POST /chat (Pydantic ChatRequest → ChatResponse) + GET /health
middleware.py — ShadowMiddleware: MD5-hash routing, fire-and-forget Celery dispatch, body-stream reconstruction
tasks.py — score_shadow_async Celery task: scores both prod and shadow models, computes delta, persists to DB
db.py — psycopg2 + TimescaleDB hypertable for shadow_evals; prompts hashed to BIGINT for privacy

🔹 `scorer/` — Stage 2 (Scoring Pipeline)

run_scorer.py — End-to-end orchestrator; loads dataset, opens MLflow run, scores every example, writes scores.json
inference.py — LiteLLM wrapper; works with OpenAI, Anthropic, Mistral, Cohere, Ollama by changing only the model string
metrics.py — Deterministic BERTScore F1 + ROUGE-L (offline, reproducible)
judge.py — LLM-as-Judge rubric with JSON-fence stripping and overall-fallback computation
aggregate.py — Composite scoring: bertscore × 0.30 + rouge_l × 0.20 + judge_overall × 0.50 per example

🔹 `scripts/` — CI + Dataset Glue

build_dataset.py — Stage 1 entry point
check_regression.py — Stage 3 entry point; the CI gate decision engine
post_pr_comment.py — Builds a Markdown scorecard and posts it directly to the PR

🔹 `dashboards/` — Stage 4 (Streamlit + Plotly)

app.py — Streamlit entry point; loads MLflow + TimescaleDB into pandas DataFrames; renders all panels
charts.py — Three Plotly chart builders: score_timeseries, category_heatmap, shadow_comparison
alerts.py — check_and_alert + Slack webhook + cooldown-managed rate limiting

🔹 `dataset/` — Schema + Validation

schema.py — GoldenExample Pydantic model with two field validators
validator.py — validate_dataset enforces per-category minimum counts

🔹 `tests/` — Hermetic Pytest Suite

conftest.py — Shared fixtures (in-memory or tmp_path-backed)
test_schema.py — 11 tests over GoldenExample — valid + invalid + boundary cases
test_scorer.py — TestComputeMetrics (real libs), TestAggregateScores (math), TestJudgeScore (mocked LiteLLM)
test_regression_check.py — Mocks MLflowClient; covers every CI gate decision path

🔹 Root Files

.env.example — template for OPENAI_API_KEY, MLFLOW_TRACKING_URI, SLACK_WEBHOOK_URL, REDIS_URL, TIMESCALE_URL
docker-compose.yml — six-service stack: app, worker, redis, mlflow, timescale, dashboard
dockerfile — python:3.11-slim shared image used by every Python service
dvc.yaml — Two-stage DVC DAG: build_dataset → run_scorer
requirements.txt — Full local dev dependencies (192 packages)
requirements-ci.txt — Slimmed CI dependencies (no Windows wheels)

⚙️ Configuration Files

🔸 `config/dataset_config.yaml`

Defines the five behavioural categories, per-category minimum example counts, and dataset-level weights.

🔸 `config/scorer_config.yaml`

The judge model, judge temperature, numerical thresholds, metric weights, MLflow experiment metadata, and per-category overall weights.

thresholds.overall = 0.80
thresholds.per_category_min = 0.72
thresholds.regression_alert_delta = 0.05
metric_weights = {bertscore: 0.30, rouge_l: 0.20, judge_score: 0.50}

🔸 `config/model_version.yaml`

The audit trail — current_model, baseline_model, prompt_template_version, system_prompt_hash, last_updated, updated_by. Changing this file re-triggers the CI workflow.

🔸 `config/shadow_config.yaml`

Shadow sampling percentage, routing strategy, promotion criteria, Celery queue name, result TTL.

📦 Datasets

🔹 Stage 1 — Golden Dataset

JSONL file with one validated GoldenExample per line
Schema fields:
- id
- prompt
- reference_response
- category
- source
- metadata (optional)

🔹 Stage 5 — Shadow Traffic Table (TimescaleDB `shadow_evals` hypertable)

ts (TIMESTAMPTZ — partition key)
prompt_hash (BIGINT)
prod_model
shadow_model
prod_score
shadow_score
delta

🛠️ Tools, Libraries & Skills Used

🤖 Core ML / LLM Stack

LiteLLM — unified provider-agnostic LLM interface
MLflow 2.11 — experiment tracking + production baseline tagging
BERTScore — semantic similarity via BERT embeddings
rouge_score — ROUGE-L (longest common subsequence)
PyTorch (CPU) — backbone for BERTScore embeddings
HuggingFace transformers + tokenizers

⚙️ MLOps / Infrastructure

DVC — versioned data + reproducible pipeline DAG
GitHub Actions — CI/CD orchestration
Docker + Docker Compose — reproducible local stack
TimescaleDB — time-series PostgreSQL extension
Redis — Celery message broker
Celery 5.3 — async task queue

🌐 Backend / Application

FastAPI 0.110 — the chat API service
Starlette / BaseHTTPMiddleware — ShadowMiddleware foundation
Pydantic 2.6 — schema validation
uvicorn — ASGI server
psycopg2-binary — PostgreSQL/TimescaleDB driver
python-dotenv / PyYAML

📈 Observability / UI

Streamlit 1.32 — drift dashboard UI
Plotly 5.20 — interactive charts
pandas — DataFrame backbone
Slack Webhooks — drift alerts

🧪 Testing

pytest 8.1
unittest.mock — mocks LiteLLM and MLflow client
pytest tmp_path — auto-cleaned temp files

🧠 Skills Demonstrated

MLOps Engineering
CI/CD Engineering with GitHub Actions
LLM Evaluation (deterministic + LLM-as-Judge)
Distributed Systems (Celery, async shadow scoring)
Data Engineering (DVC, schema-driven validation)
Observability (Streamlit + Plotly + TimescaleDB + Slack)
Production Backend (FastAPI + ASGI middleware + hypertable design)
Hermetic Testing (mocks, fixtures, separation of concerns)
Configuration Management (single-source-of-truth YAML)
Container Orchestration

🚀 How to Run

1️⃣ Clone Repository

git clone
cd "LLM Regressor MLOps"

2️⃣ Setup Environment

python -m venv venv
venv\Scripts\activate

3️⃣ Install Dependencies

pip install -r requirements.txt

4️⃣ Configure Environment Variables

copy .env.example .env
Edit .env to fill in OPENAI_API_KEY, MLFLOW_TRACKING_URI, etc.

5️⃣ Build the Golden Dataset (Stage 1)

python scripts/build_dataset.py

6️⃣ Run the Scorer (Stage 2)

python scorer/run_scorer.py

7️⃣ Run the Regression Check (Stage 3)

python scripts/check_regression.py

8️⃣ Or Run the Full DAG via DVC

dvc repro

9️⃣ Launch the FastAPI App (Stage 5)

uvicorn app.main:app --reload --port 8000

🔟 Launch the Celery Worker

celery -A app.tasks worker --loglevel=info

1️⃣1️⃣ Launch the Streamlit Dashboard (Stage 4)

streamlit run dashboards/app.py

🏷️ Promoting a Run to Production

The regression check always compares against the latest MLflow run tagged env=production.

mlflow runs set-tag <run_id> env production

All subsequent CI runs will diff against this baseline.

🧪 Testing

pytest tests/ -v

All test suites are hermetic — no network, no LLM calls, no MLflow server, no Redis, no TimescaleDB. LiteLLM and MLflowClient are mocked; file I/O uses pytest's tmp_path.

🐳 Docker Deployment

docker compose up

Service	Image / build	Port
`app`	local `dockerfile`	`8000`
`worker`	local `dockerfile`	—
`redis`	`redis:7-alpine`	`6379`
`mlflow`	`ghcr.io/mlflow/mlflow:v2.11.0`	`5000`
`timescale`	`timescale/timescaledb:latest-pg15`	`5432`
`dashboard`	local `dockerfile`	`8501`

🔁 CI / Quality Gate

The CI gate runs automatically on every push or PR that touches:

data/raw/prompts.jsonl
config/scorer_config.yaml
config/model_version.yaml
scorer/**
scripts/**

Required GitHub configuration:

Secret OPENAI_API_KEY
Variable MLFLOW_TRACKING_URI (optional)

A green run:

A blocked PR:

📈 Results

Every PR is automatically scored against a versioned golden dataset
Quality regressions are blocked at the PR level before reaching production
Per-category drift is visible on the Streamlit dashboard with version-bump markers
5% of live /chat traffic is silently evaluated for shadow-vs-production comparison
Slack alerts fire when rolling 3-run averages dip below threshold
All scoring runs are versioned in MLflow with full param + metric history

📊 Dashboard Previews

⚠️ Challenges

Handling non-deterministic LLM responses during scoring (solved via temperature=0.0 and judge fallback logic)
Reliable JSON parsing from judge LLM responses (solved via Markdown-fence stripping + overall-fallback)
Zero-latency shadow routing on production traffic (solved via hash-based async Celery dispatch + body-stream reconstruction)
MLflow connectivity edge cases in CI (solved via graceful fallback to local ./mlruns and absolute threshold checks)
Preventing Slack alert spam during dashboard refresh (solved via .alert_cooldown.json rate limiting)
Cross-platform dependency parity (Windows vs Linux CI runners) — solved via split requirements.txt / requirements-ci.txt

🧠 Key Learnings

End-to-end MLOps pipeline design for LLM applications
Building an LLM-as-Judge evaluation system with structured rubrics
Combining deterministic and LLM-based metrics for robust scoring
Designing a CI/CD quality gate that blocks regressions at the PR level
Production observability with Streamlit, Plotly, MLflow, and TimescaleDB
Shadow traffic engineering with zero added latency to live requests
Async task processing with Celery + Redis
Hermetic testing via mocks and pytest fixtures
Combining MLOps + Backend + Data Engineering + LLM Evaluation in one cohesive system

👤 Author

Prakhar Srivastava
Data Scientist | AI Engineer | Data Scientist

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.github/workflows		.github/workflows
app		app
config		config
dashboards		dashboards
data/raw		data/raw
dataset		dataset
docs/images		docs/images
scorer		scorer
scores_output		scores_output
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
dockerfile		dockerfile
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
requirements-ci.txt		requirements-ci.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🛡️ LLM Regression Guard — End-to-End MLOps Pipeline for LLM Quality Assurance

📌 Problem Statement

🎯 Project Objectives

🔹 Stage 1: Dataset Engineering

🔹 Stage 2: Multi-Signal Scoring

🔹 Stage 3: Automated Quality Gate

🔹 Stage 4: Drift Observability

🔹 Stage 5: Live Shadow Monitoring

🏗️ Project Structure

🧠 Pipeline Architecture

🔄 The Five Pipeline Stages — Detailed

🔹 Stage 1: 📥 Dataset Builder

🔹 Stage 2: 🎯 Scorer

🔹 Stage 3: 🚦 CI/CD Quality Gate

🔹 Stage 4: 📊 Drift Dashboard

🔹 Stage 5: 🌓 Shadow Traffic Monitor

📂 File-by-File Explanation

🔹 app/ — Stage 5 (FastAPI + Celery)

🔹 scorer/ — Stage 2 (Scoring Pipeline)

🔹 scripts/ — CI + Dataset Glue

🔹 dashboards/ — Stage 4 (Streamlit + Plotly)

🔹 dataset/ — Schema + Validation

🔹 tests/ — Hermetic Pytest Suite

🔹 Root Files

⚙️ Configuration Files

🔸 config/dataset_config.yaml

🔸 config/scorer_config.yaml

🔸 config/model_version.yaml

🔸 config/shadow_config.yaml

📦 Datasets

🔹 Stage 1 — Golden Dataset

🔹 Stage 5 — Shadow Traffic Table (TimescaleDB shadow_evals hypertable)

🛠️ Tools, Libraries & Skills Used

🤖 Core ML / LLM Stack

⚙️ MLOps / Infrastructure

🌐 Backend / Application

📈 Observability / UI

🧪 Testing

🧠 Skills Demonstrated

🚀 How to Run

1️⃣ Clone Repository

2️⃣ Setup Environment

3️⃣ Install Dependencies

4️⃣ Configure Environment Variables

5️⃣ Build the Golden Dataset (Stage 1)

6️⃣ Run the Scorer (Stage 2)

7️⃣ Run the Regression Check (Stage 3)

8️⃣ Or Run the Full DAG via DVC

9️⃣ Launch the FastAPI App (Stage 5)

🔟 Launch the Celery Worker

1️⃣1️⃣ Launch the Streamlit Dashboard (Stage 4)

🏷️ Promoting a Run to Production

🧪 Testing

🐳 Docker Deployment

🔁 CI / Quality Gate

📈 Results

📊 Dashboard Previews

⚠️ Challenges

🧠 Key Learnings

👤 Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

🔹 `app/` — Stage 5 (FastAPI + Celery)

🔹 `scorer/` — Stage 2 (Scoring Pipeline)

🔹 `scripts/` — CI + Dataset Glue

🔹 `dashboards/` — Stage 4 (Streamlit + Plotly)

🔹 `dataset/` — Schema + Validation

🔹 `tests/` — Hermetic Pytest Suite

🔸 `config/dataset_config.yaml`

🔸 `config/scorer_config.yaml`

🔸 `config/model_version.yaml`

🔸 `config/shadow_config.yaml`

🔹 Stage 5 — Shadow Traffic Table (TimescaleDB `shadow_evals` hypertable)

Packages