Skip to content

prakhar-189/LLM-Regression-Guard

Repository files navigation

πŸ›‘οΈ LLM Regression Guard β€” End-to-End MLOps Pipeline for LLM Quality Assurance

An end-to-end MLOps project that combines LLM Evaluation, CI/CD Automation, and Production Observability to catch LLM quality regressions before they reach production.

This project is divided into five major stages:

  • Stage 1 β†’ πŸ“₯ Dataset Builder (Schema + Category Validation)
  • Stage 2 β†’ 🎯 Scorer (BERTScore + ROUGE-L + LLM Judge)
  • Stage 3 β†’ 🚦 CI/CD Quality Gate (MLflow Baseline Comparison)
  • Stage 4 β†’ πŸ“Š Drift Dashboard (Streamlit + Slack Alerts)
  • Stage 5 β†’ πŸŒ“ Shadow Traffic Monitor (FastAPI + Celery + TimescaleDB)

CI Pipeline Workflow


πŸ“Œ Problem Statement

This project is based on a real-world LLM-production scenario where:

  • Every change to a model, prompt template, or scoring configuration risks introducing silent quality regressions that escape standard unit tests.
  • Production LLM behaviour can drift slowly over weeks in ways no single CI run can detect.
  • Promoting a new model variant requires safe, evidence-driven decisions β€” not gut feeling.

The goal: build a fully automated MLOps system that scores every change against a golden dataset, blocks merges on regression, monitors live traffic, and surfaces drift early.


🎯 Project Objectives

πŸ”Ή Stage 1: Dataset Engineering

  • Validate prompts against a strict Pydantic schema
  • Enforce per-category minimum example counts
  • Produce a versioned golden dataset via DVC

πŸ”Ή Stage 2: Multi-Signal Scoring

  • Run inference via LiteLLM (provider-agnostic)
  • Compute deterministic metrics: BERTScore F1 + ROUGE-L
  • Score with an LLM-as-Judge rubric (factual_accuracy, instruction_following, safety)
  • Aggregate per category + overall and log to MLflow

πŸ”Ή Stage 3: Automated Quality Gate

  • Compare every PR against the production-tagged baseline
  • Block merges on category drops > 0.05 or overall < 0.80
  • Post a Markdown scorecard back onto the PR

πŸ”Ή Stage 4: Drift Observability

  • Visualise score trends, category Γ— version heatmaps, shadow comparison
  • Fire Slack alerts when rolling 3-run averages dip below threshold

πŸ”Ή Stage 5: Live Shadow Monitoring

  • Silently replay 5% of /chat traffic to a candidate model
  • Score both production and shadow responses asynchronously
  • Persist deltas to TimescaleDB for promotion decisions

πŸ—οΈ Project Structure

  • .github
    • workflows
      • quality_gate.yml
  • app (Stage 5 β€” FastAPI + Celery shadow worker)
    • init.py
    • main.py
    • router.py
    • middleware.py
    • tasks.py
    • db.py
  • config (single source of truth)
    • dataset_config.yaml
    • scorer_config.yaml
    • model_version.yaml
    • shadow_config.yaml
  • dashboards (Stage 4 β€” Streamlit drift dashboard)
    • init.py
    • app.py
    • charts.py
    • alerts.py
  • data
    • raw
      • prompts.jsonl
    • eval
      • golden_dataset.jsonl
  • dataset (schema + dataset-level validation)
    • init.py
    • schema.py
    • validator.py
  • docs
    • images
  • mlruns (local MLflow tracking store)
  • scorer (Stage 2 β€” scoring pipeline)
    • init.py
    • run_scorer.py
    • inference.py
    • metrics.py
    • judge.py
    • aggregate.py
  • scores_output (scores.json artefact)
    • .gitkeep
  • scripts (CI + dataset glue)
    • init.py
    • build_dataset.py
    • check_regression.py
    • post_pr_comment.py
  • tests (hermetic pytest suite)
    • init.py
    • conftest.py
    • test_schema.py
    • test_scorer.py
    • test_regression_check.py
  • .env.example
  • .gitignore
  • LICENSE
  • README.md
  • docker-compose.yml
  • dockerfile
  • dvc.yaml
  • dvc.lock
  • requirements.txt
  • requirements-ci.txt

🧠 Pipeline Architecture

                            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
prompts.jsonl ──► Stage 1 ───  build_dataset (validate)   β”œβ”€β”€β–Ί golden_dataset.jsonl
                            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                          β”‚
                            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                            β”‚  Stage 2  run_scorer        β”‚
                            β”‚   inference (LiteLLM)       β”‚
                            β”‚   metrics  (BERTScore +     β”‚
                            β”‚             ROUGE-L)        β”‚
                            β”‚   judge    (LLM rubric)     β”‚
                            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                           β–Ό
                              scores.json  +  MLflow run
                                           β”‚
                            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                            β”‚  Stage 3  check_regression  β”œβ”€β”€β–Ί PR pass / fail
                            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                            β”‚  Stage 4  Streamlit         β”‚
              MLflow ◄───────   dashboard (drift,         β”‚
              Timescale ◄────   shadow comparison,        β”‚
                            β”‚   Slack alerts)             β”‚
                            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            live /chat ────►│  Stage 5  FastAPI app       β”‚
                            β”‚   middleware β†’ 5% shadow    β”‚
                            β”‚   Celery task β†’ judge LLM   β”‚
                            β”‚   delta β†’ TimescaleDB       β”‚
                            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”„ The Five Pipeline Stages β€” Detailed

πŸ”Ή Stage 1: πŸ“₯ Dataset Builder

  1. Load raw prompts from data/raw/prompts.jsonl.
  2. Validate each entry against the GoldenExample Pydantic schema:
    • prompt must exceed 10 characters
    • reference_response must not be blank
    • category must match one of the five Literal values
  3. Enforce per-category minimum example counts from dataset_config.yaml.
  4. Write validated entries to data/eval/golden_dataset.jsonl.
  5. Exit with code 1 on any schema or count error β€” halting the DVC/CI pipeline.

πŸ‘‰ scripts/build_dataset.py, dataset/schema.py, dataset/validator.py

πŸ”Ή Stage 2: 🎯 Scorer

For every example in the golden dataset:

  1. Inference β€” call the LLM under test via LiteLLM πŸ‘‰ scorer/inference.py
  2. Deterministic metrics β€” BERTScore F1 + ROUGE-L (no LLM calls) πŸ‘‰ scorer/metrics.py
  3. Judge LLM rubric β€” structured scoring on factual accuracy, instruction following, safety πŸ‘‰ scorer/judge.py
  4. Aggregation β€” composite per example, average per category, weighted overall πŸ‘‰ scorer/aggregate.py
  5. Tracking β€” log every metric, param, and tag to MLflow
  6. Artefact β€” write scores_output/scores.json for the next stage

πŸ‘‰ scorer/run_scorer.py

πŸ”Ή Stage 3: 🚦 CI/CD Quality Gate

  1. Load the current run's scores.json.
  2. Fetch the latest MLflow run tagged env=production (the baseline).
  3. Compare every per-category score to the baseline.
  4. Fail the PR (sys.exit(1)) if:
    • Any category drops more than regression_alert_delta (0.05) below baseline, or
    • The overall score falls below the absolute threshold (0.80).
  5. Fallback to absolute-threshold checks when MLflow is unreachable or no baseline exists.
  6. Post a Markdown scorecard back onto the PR via the GitHub Issues Comments API.

πŸ‘‰ scripts/check_regression.py, scripts/post_pr_comment.py

πŸ”Ή Stage 4: πŸ“Š Drift Dashboard

A Streamlit observability surface backed by MLflow + TimescaleDB:

  1. Row 1 β€” Four summary metric cards (latest overall, total runs, model version, gate status)
  2. Row 2 β€” Side-by-side: score time-series (with version-bump markers) + category Γ— version heatmap
  3. Row 3 β€” Production vs shadow rolling-average comparison
  4. Sidebar β€” Manual alert trigger, threshold reference, MLflow UI link
  5. Slack alerts β€” fire when rolling 3-run averages dip below threshold (rate-limited to one per hour per category)

πŸ‘‰ dashboards/app.py, dashboards/charts.py, dashboards/alerts.py

πŸ”Ή Stage 5: πŸŒ“ Shadow Traffic Monitor

A FastAPI service that silently measures quality on live traffic:

  1. ShadowMiddleware β€” hashes every POST /chat request body, fires a Celery task for 5% of calls (zero added latency)
  2. Celery worker β€” runs both production and shadow models, scores with the judge LLM, computes the delta
  3. TimescaleDB hypertable β€” stores time-series shadow results for fast time-range queries
  4. Hash-based routing β€” same prompt always lands on the same model variant (deterministic, reproducible)

πŸ‘‰ app/main.py, app/middleware.py, app/tasks.py, app/db.py, app/router.py


πŸ“‚ File-by-File Explanation

πŸ”Ή app/ β€” Stage 5 (FastAPI + Celery)

  • main.py β€” FastAPI entry point + ShadowMiddleware(shadow_pct=5) registration
  • router.py β€” POST /chat (Pydantic ChatRequest β†’ ChatResponse) + GET /health
  • middleware.py β€” ShadowMiddleware: MD5-hash routing, fire-and-forget Celery dispatch, body-stream reconstruction
  • tasks.py β€” score_shadow_async Celery task: scores both prod and shadow models, computes delta, persists to DB
  • db.py β€” psycopg2 + TimescaleDB hypertable for shadow_evals; prompts hashed to BIGINT for privacy

πŸ”Ή scorer/ β€” Stage 2 (Scoring Pipeline)

  • run_scorer.py β€” End-to-end orchestrator; loads dataset, opens MLflow run, scores every example, writes scores.json
  • inference.py β€” LiteLLM wrapper; works with OpenAI, Anthropic, Mistral, Cohere, Ollama by changing only the model string
  • metrics.py β€” Deterministic BERTScore F1 + ROUGE-L (offline, reproducible)
  • judge.py β€” LLM-as-Judge rubric with JSON-fence stripping and overall-fallback computation
  • aggregate.py β€” Composite scoring: bertscore Γ— 0.30 + rouge_l Γ— 0.20 + judge_overall Γ— 0.50 per example

πŸ”Ή scripts/ β€” CI + Dataset Glue

  • build_dataset.py β€” Stage 1 entry point
  • check_regression.py β€” Stage 3 entry point; the CI gate decision engine
  • post_pr_comment.py β€” Builds a Markdown scorecard and posts it directly to the PR

πŸ”Ή dashboards/ β€” Stage 4 (Streamlit + Plotly)

  • app.py β€” Streamlit entry point; loads MLflow + TimescaleDB into pandas DataFrames; renders all panels
  • charts.py β€” Three Plotly chart builders: score_timeseries, category_heatmap, shadow_comparison
  • alerts.py β€” check_and_alert + Slack webhook + cooldown-managed rate limiting

πŸ”Ή dataset/ β€” Schema + Validation

  • schema.py β€” GoldenExample Pydantic model with two field validators
  • validator.py β€” validate_dataset enforces per-category minimum counts

πŸ”Ή tests/ β€” Hermetic Pytest Suite

  • conftest.py β€” Shared fixtures (in-memory or tmp_path-backed)
  • test_schema.py β€” 11 tests over GoldenExample β€” valid + invalid + boundary cases
  • test_scorer.py β€” TestComputeMetrics (real libs), TestAggregateScores (math), TestJudgeScore (mocked LiteLLM)
  • test_regression_check.py β€” Mocks MLflowClient; covers every CI gate decision path

πŸ”Ή Root Files

  • .env.example β€” template for OPENAI_API_KEY, MLFLOW_TRACKING_URI, SLACK_WEBHOOK_URL, REDIS_URL, TIMESCALE_URL
  • docker-compose.yml β€” six-service stack: app, worker, redis, mlflow, timescale, dashboard
  • dockerfile β€” python:3.11-slim shared image used by every Python service
  • dvc.yaml β€” Two-stage DVC DAG: build_dataset β†’ run_scorer
  • requirements.txt β€” Full local dev dependencies (192 packages)
  • requirements-ci.txt β€” Slimmed CI dependencies (no Windows wheels)

βš™οΈ Configuration Files

πŸ”Έ config/dataset_config.yaml

Defines the five behavioural categories, per-category minimum example counts, and dataset-level weights.

πŸ”Έ config/scorer_config.yaml

The judge model, judge temperature, numerical thresholds, metric weights, MLflow experiment metadata, and per-category overall weights.

  • thresholds.overall = 0.80
  • thresholds.per_category_min = 0.72
  • thresholds.regression_alert_delta = 0.05
  • metric_weights = {bertscore: 0.30, rouge_l: 0.20, judge_score: 0.50}

πŸ”Έ config/model_version.yaml

The audit trail β€” current_model, baseline_model, prompt_template_version, system_prompt_hash, last_updated, updated_by. Changing this file re-triggers the CI workflow.

πŸ”Έ config/shadow_config.yaml

Shadow sampling percentage, routing strategy, promotion criteria, Celery queue name, result TTL.


πŸ“¦ Datasets

πŸ”Ή Stage 1 β€” Golden Dataset

  • JSONL file with one validated GoldenExample per line
  • Schema fields:
    • id
    • prompt
    • reference_response
    • category
    • source
    • metadata (optional)

πŸ”Ή Stage 5 β€” Shadow Traffic Table (TimescaleDB shadow_evals hypertable)

  • ts (TIMESTAMPTZ β€” partition key)
  • prompt_hash (BIGINT)
  • prod_model
  • shadow_model
  • prod_score
  • shadow_score
  • delta

πŸ› οΈ Tools, Libraries & Skills Used

πŸ€– Core ML / LLM Stack

  • LiteLLM β€” unified provider-agnostic LLM interface
  • MLflow 2.11 β€” experiment tracking + production baseline tagging
  • BERTScore β€” semantic similarity via BERT embeddings
  • rouge_score β€” ROUGE-L (longest common subsequence)
  • PyTorch (CPU) β€” backbone for BERTScore embeddings
  • HuggingFace transformers + tokenizers

βš™οΈ MLOps / Infrastructure

  • DVC β€” versioned data + reproducible pipeline DAG
  • GitHub Actions β€” CI/CD orchestration
  • Docker + Docker Compose β€” reproducible local stack
  • TimescaleDB β€” time-series PostgreSQL extension
  • Redis β€” Celery message broker
  • Celery 5.3 β€” async task queue

🌐 Backend / Application

  • FastAPI 0.110 β€” the chat API service
  • Starlette / BaseHTTPMiddleware β€” ShadowMiddleware foundation
  • Pydantic 2.6 β€” schema validation
  • uvicorn β€” ASGI server
  • psycopg2-binary β€” PostgreSQL/TimescaleDB driver
  • python-dotenv / PyYAML

πŸ“ˆ Observability / UI

  • Streamlit 1.32 β€” drift dashboard UI
  • Plotly 5.20 β€” interactive charts
  • pandas β€” DataFrame backbone
  • Slack Webhooks β€” drift alerts

πŸ§ͺ Testing

  • pytest 8.1
  • unittest.mock β€” mocks LiteLLM and MLflow client
  • pytest tmp_path β€” auto-cleaned temp files

🧠 Skills Demonstrated

  • MLOps Engineering
  • CI/CD Engineering with GitHub Actions
  • LLM Evaluation (deterministic + LLM-as-Judge)
  • Distributed Systems (Celery, async shadow scoring)
  • Data Engineering (DVC, schema-driven validation)
  • Observability (Streamlit + Plotly + TimescaleDB + Slack)
  • Production Backend (FastAPI + ASGI middleware + hypertable design)
  • Hermetic Testing (mocks, fixtures, separation of concerns)
  • Configuration Management (single-source-of-truth YAML)
  • Container Orchestration

πŸš€ How to Run

1️⃣ Clone Repository

  • git clone
  • cd "LLM Regressor MLOps"

2️⃣ Setup Environment

  • python -m venv venv
  • venv\Scripts\activate

3️⃣ Install Dependencies

  • pip install -r requirements.txt

4️⃣ Configure Environment Variables

  • copy .env.example .env
  • Edit .env to fill in OPENAI_API_KEY, MLFLOW_TRACKING_URI, etc.

5️⃣ Build the Golden Dataset (Stage 1)

  • python scripts/build_dataset.py

6️⃣ Run the Scorer (Stage 2)

  • python scorer/run_scorer.py

7️⃣ Run the Regression Check (Stage 3)

  • python scripts/check_regression.py

8️⃣ Or Run the Full DAG via DVC

  • dvc repro

9️⃣ Launch the FastAPI App (Stage 5)

  • uvicorn app.main:app --reload --port 8000

πŸ”Ÿ Launch the Celery Worker

  • celery -A app.tasks worker --loglevel=info

1️⃣1️⃣ Launch the Streamlit Dashboard (Stage 4)

  • streamlit run dashboards/app.py

🏷️ Promoting a Run to Production

The regression check always compares against the latest MLflow run tagged env=production.

  • mlflow runs set-tag <run_id> env production

All subsequent CI runs will diff against this baseline.

MLflow run history


πŸ§ͺ Testing

  • pytest tests/ -v

All test suites are hermetic β€” no network, no LLM calls, no MLflow server, no Redis, no TimescaleDB. LiteLLM and MLflowClient are mocked; file I/O uses pytest's tmp_path.


🐳 Docker Deployment

  • docker compose up
Service Image / build Port
app local dockerfile 8000
worker local dockerfile β€”
redis redis:7-alpine 6379
mlflow ghcr.io/mlflow/mlflow:v2.11.0 5000
timescale timescale/timescaledb:latest-pg15 5432
dashboard local dockerfile 8501

πŸ” CI / Quality Gate

The CI gate runs automatically on every push or PR that touches:

  • data/raw/prompts.jsonl
  • config/scorer_config.yaml
  • config/model_version.yaml
  • scorer/**
  • scripts/**

Required GitHub configuration:

  • Secret OPENAI_API_KEY
  • Variable MLFLOW_TRACKING_URI (optional)

A green run:

CI gate passed

A blocked PR:

CI gate failed


πŸ“ˆ Results

  • Every PR is automatically scored against a versioned golden dataset
  • Quality regressions are blocked at the PR level before reaching production
  • Per-category drift is visible on the Streamlit dashboard with version-bump markers
  • 5% of live /chat traffic is silently evaluated for shadow-vs-production comparison
  • Slack alerts fire when rolling 3-run averages dip below threshold
  • All scoring runs are versioned in MLflow with full param + metric history

πŸ“Š Dashboard Previews

Dashboard overview Dashboard charts
Dashboard shadow traffic End-to-end workflow

⚠️ Challenges

  • Handling non-deterministic LLM responses during scoring (solved via temperature=0.0 and judge fallback logic)
  • Reliable JSON parsing from judge LLM responses (solved via Markdown-fence stripping + overall-fallback)
  • Zero-latency shadow routing on production traffic (solved via hash-based async Celery dispatch + body-stream reconstruction)
  • MLflow connectivity edge cases in CI (solved via graceful fallback to local ./mlruns and absolute threshold checks)
  • Preventing Slack alert spam during dashboard refresh (solved via .alert_cooldown.json rate limiting)
  • Cross-platform dependency parity (Windows vs Linux CI runners) β€” solved via split requirements.txt / requirements-ci.txt

🧠 Key Learnings

  • End-to-end MLOps pipeline design for LLM applications
  • Building an LLM-as-Judge evaluation system with structured rubrics
  • Combining deterministic and LLM-based metrics for robust scoring
  • Designing a CI/CD quality gate that blocks regressions at the PR level
  • Production observability with Streamlit, Plotly, MLflow, and TimescaleDB
  • Shadow traffic engineering with zero added latency to live requests
  • Async task processing with Celery + Redis
  • Hermetic testing via mocks and pytest fixtures
  • Combining MLOps + Backend + Data Engineering + LLM Evaluation in one cohesive system

πŸ‘€ Author

  • Prakhar Srivastava
  • Data Scientist | AI Engineer | Data Scientist

About

End-to-end MLOps pipeline that catches LLM quality regressions before production. Every PR is scored against a versioned golden dataset using BERTScore + ROUGE-L + an LLM-as-Judge rubric, compared to the MLflow production baseline, and shadowed against 5% of live traffic. FastAPI + Celery + TimescaleDB + Streamlit + DVC + GitHub Actions.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages