An end-to-end MLOps project that combines LLM Evaluation, CI/CD Automation, and Production Observability to catch LLM quality regressions before they reach production.
This project is divided into five major stages:
- Stage 1 β π₯ Dataset Builder (Schema + Category Validation)
- Stage 2 β π― Scorer (BERTScore + ROUGE-L + LLM Judge)
- Stage 3 β π¦ CI/CD Quality Gate (MLflow Baseline Comparison)
- Stage 4 β π Drift Dashboard (Streamlit + Slack Alerts)
- Stage 5 β π Shadow Traffic Monitor (FastAPI + Celery + TimescaleDB)
This project is based on a real-world LLM-production scenario where:
- Every change to a model, prompt template, or scoring configuration risks introducing silent quality regressions that escape standard unit tests.
- Production LLM behaviour can drift slowly over weeks in ways no single CI run can detect.
- Promoting a new model variant requires safe, evidence-driven decisions β not gut feeling.
The goal: build a fully automated MLOps system that scores every change against a golden dataset, blocks merges on regression, monitors live traffic, and surfaces drift early.
- Validate prompts against a strict Pydantic schema
- Enforce per-category minimum example counts
- Produce a versioned golden dataset via DVC
- Run inference via LiteLLM (provider-agnostic)
- Compute deterministic metrics: BERTScore F1 + ROUGE-L
- Score with an LLM-as-Judge rubric (factual_accuracy, instruction_following, safety)
- Aggregate per category + overall and log to MLflow
- Compare every PR against the production-tagged baseline
- Block merges on category drops > 0.05 or overall < 0.80
- Post a Markdown scorecard back onto the PR
- Visualise score trends, category Γ version heatmaps, shadow comparison
- Fire Slack alerts when rolling 3-run averages dip below threshold
- Silently replay 5% of
/chattraffic to a candidate model - Score both production and shadow responses asynchronously
- Persist deltas to TimescaleDB for promotion decisions
- .github
- workflows
- quality_gate.yml
- workflows
- app (Stage 5 β FastAPI + Celery shadow worker)
- init.py
- main.py
- router.py
- middleware.py
- tasks.py
- db.py
- config (single source of truth)
- dataset_config.yaml
- scorer_config.yaml
- model_version.yaml
- shadow_config.yaml
- dashboards (Stage 4 β Streamlit drift dashboard)
- init.py
- app.py
- charts.py
- alerts.py
- data
- raw
- prompts.jsonl
- eval
- golden_dataset.jsonl
- raw
- dataset (schema + dataset-level validation)
- init.py
- schema.py
- validator.py
- docs
- images
- mlruns (local MLflow tracking store)
- scorer (Stage 2 β scoring pipeline)
- init.py
- run_scorer.py
- inference.py
- metrics.py
- judge.py
- aggregate.py
- scores_output (scores.json artefact)
- .gitkeep
- scripts (CI + dataset glue)
- init.py
- build_dataset.py
- check_regression.py
- post_pr_comment.py
- tests (hermetic pytest suite)
- init.py
- conftest.py
- test_schema.py
- test_scorer.py
- test_regression_check.py
- .env.example
- .gitignore
- LICENSE
- README.md
- docker-compose.yml
- dockerfile
- dvc.yaml
- dvc.lock
- requirements.txt
- requirements-ci.txt
βββββββββββββββββββββββββββββββ
prompts.jsonl βββΊ Stage 1 βββ€ build_dataset (validate) ββββΊ golden_dataset.jsonl
βββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββ
β Stage 2 run_scorer β
β inference (LiteLLM) β
β metrics (BERTScore + β
β ROUGE-L) β
β judge (LLM rubric) β
ββββββββββββββββ¬βββββββββββββββ
βΌ
scores.json + MLflow run
β
βββββββββββββββββββββββββββββββ
β Stage 3 check_regression ββββΊ PR pass / fail
βββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββ
β Stage 4 Streamlit β
MLflow ββββββββ€ dashboard (drift, β
Timescale βββββ€ shadow comparison, β
β Slack alerts) β
βββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββ
live /chat βββββΊβ Stage 5 FastAPI app β
β middleware β 5% shadow β
β Celery task β judge LLM β
β delta β TimescaleDB β
βββββββββββββββββββββββββββββββ
- Load raw prompts from
data/raw/prompts.jsonl. - Validate each entry against the
GoldenExamplePydantic schema:promptmust exceed 10 charactersreference_responsemust not be blankcategorymust match one of the five Literal values
- Enforce per-category minimum example counts from
dataset_config.yaml. - Write validated entries to
data/eval/golden_dataset.jsonl. - Exit with code 1 on any schema or count error β halting the DVC/CI pipeline.
π scripts/build_dataset.py, dataset/schema.py, dataset/validator.py
For every example in the golden dataset:
- Inference β call the LLM under test via LiteLLM π
scorer/inference.py - Deterministic metrics β BERTScore F1 + ROUGE-L (no LLM calls) π
scorer/metrics.py - Judge LLM rubric β structured scoring on factual accuracy, instruction following, safety π
scorer/judge.py - Aggregation β composite per example, average per category, weighted overall π
scorer/aggregate.py - Tracking β log every metric, param, and tag to MLflow
- Artefact β write
scores_output/scores.jsonfor the next stage
π scorer/run_scorer.py
- Load the current run's
scores.json. - Fetch the latest MLflow run tagged
env=production(the baseline). - Compare every per-category score to the baseline.
- Fail the PR (
sys.exit(1)) if:- Any category drops more than
regression_alert_delta(0.05) below baseline, or - The overall score falls below the absolute threshold (0.80).
- Any category drops more than
- Fallback to absolute-threshold checks when MLflow is unreachable or no baseline exists.
- Post a Markdown scorecard back onto the PR via the GitHub Issues Comments API.
π scripts/check_regression.py, scripts/post_pr_comment.py
A Streamlit observability surface backed by MLflow + TimescaleDB:
- Row 1 β Four summary metric cards (latest overall, total runs, model version, gate status)
- Row 2 β Side-by-side: score time-series (with version-bump markers) + category Γ version heatmap
- Row 3 β Production vs shadow rolling-average comparison
- Sidebar β Manual alert trigger, threshold reference, MLflow UI link
- Slack alerts β fire when rolling 3-run averages dip below threshold (rate-limited to one per hour per category)
π dashboards/app.py, dashboards/charts.py, dashboards/alerts.py
A FastAPI service that silently measures quality on live traffic:
- ShadowMiddleware β hashes every
POST /chatrequest body, fires a Celery task for 5% of calls (zero added latency) - Celery worker β runs both production and shadow models, scores with the judge LLM, computes the delta
- TimescaleDB hypertable β stores time-series shadow results for fast time-range queries
- Hash-based routing β same prompt always lands on the same model variant (deterministic, reproducible)
π app/main.py, app/middleware.py, app/tasks.py, app/db.py, app/router.py
main.pyβ FastAPI entry point +ShadowMiddleware(shadow_pct=5)registrationrouter.pyβPOST /chat(PydanticChatRequestβChatResponse) +GET /healthmiddleware.pyβShadowMiddleware: MD5-hash routing, fire-and-forget Celery dispatch, body-stream reconstructiontasks.pyβscore_shadow_asyncCelery task: scores both prod and shadow models, computes delta, persists to DBdb.pyβ psycopg2 + TimescaleDB hypertable forshadow_evals; prompts hashed to BIGINT for privacy
run_scorer.pyβ End-to-end orchestrator; loads dataset, opens MLflow run, scores every example, writesscores.jsoninference.pyβ LiteLLM wrapper; works with OpenAI, Anthropic, Mistral, Cohere, Ollama by changing only the model stringmetrics.pyβ Deterministic BERTScore F1 + ROUGE-L (offline, reproducible)judge.pyβ LLM-as-Judge rubric with JSON-fence stripping and overall-fallback computationaggregate.pyβ Composite scoring:bertscore Γ 0.30 + rouge_l Γ 0.20 + judge_overall Γ 0.50per example
build_dataset.pyβ Stage 1 entry pointcheck_regression.pyβ Stage 3 entry point; the CI gate decision enginepost_pr_comment.pyβ Builds a Markdown scorecard and posts it directly to the PR
app.pyβ Streamlit entry point; loads MLflow + TimescaleDB into pandas DataFrames; renders all panelscharts.pyβ Three Plotly chart builders:score_timeseries,category_heatmap,shadow_comparisonalerts.pyβcheck_and_alert+ Slack webhook + cooldown-managed rate limiting
schema.pyβGoldenExamplePydantic model with two field validatorsvalidator.pyβvalidate_datasetenforces per-category minimum counts
conftest.pyβ Shared fixtures (in-memory ortmp_path-backed)test_schema.pyβ 11 tests overGoldenExampleβ valid + invalid + boundary casestest_scorer.pyβTestComputeMetrics(real libs),TestAggregateScores(math),TestJudgeScore(mocked LiteLLM)test_regression_check.pyβ MocksMLflowClient; covers every CI gate decision path
.env.exampleβ template forOPENAI_API_KEY,MLFLOW_TRACKING_URI,SLACK_WEBHOOK_URL,REDIS_URL,TIMESCALE_URLdocker-compose.ymlβ six-service stack: app, worker, redis, mlflow, timescale, dashboarddockerfileβpython:3.11-slimshared image used by every Python servicedvc.yamlβ Two-stage DVC DAG:build_datasetβrun_scorerrequirements.txtβ Full local dev dependencies (192 packages)requirements-ci.txtβ Slimmed CI dependencies (no Windows wheels)
Defines the five behavioural categories, per-category minimum example counts, and dataset-level weights.
The judge model, judge temperature, numerical thresholds, metric weights, MLflow experiment metadata, and per-category overall weights.
thresholds.overall = 0.80thresholds.per_category_min = 0.72thresholds.regression_alert_delta = 0.05metric_weights = {bertscore: 0.30, rouge_l: 0.20, judge_score: 0.50}
The audit trail β current_model, baseline_model, prompt_template_version, system_prompt_hash, last_updated, updated_by. Changing this file re-triggers the CI workflow.
Shadow sampling percentage, routing strategy, promotion criteria, Celery queue name, result TTL.
- JSONL file with one validated
GoldenExampleper line - Schema fields:
- id
- prompt
- reference_response
- category
- source
- metadata (optional)
- ts (TIMESTAMPTZ β partition key)
- prompt_hash (BIGINT)
- prod_model
- shadow_model
- prod_score
- shadow_score
- delta
- LiteLLM β unified provider-agnostic LLM interface
- MLflow 2.11 β experiment tracking + production baseline tagging
- BERTScore β semantic similarity via BERT embeddings
- rouge_score β ROUGE-L (longest common subsequence)
- PyTorch (CPU) β backbone for BERTScore embeddings
- HuggingFace transformers + tokenizers
- DVC β versioned data + reproducible pipeline DAG
- GitHub Actions β CI/CD orchestration
- Docker + Docker Compose β reproducible local stack
- TimescaleDB β time-series PostgreSQL extension
- Redis β Celery message broker
- Celery 5.3 β async task queue
- FastAPI 0.110 β the chat API service
- Starlette / BaseHTTPMiddleware β ShadowMiddleware foundation
- Pydantic 2.6 β schema validation
- uvicorn β ASGI server
- psycopg2-binary β PostgreSQL/TimescaleDB driver
- python-dotenv / PyYAML
- Streamlit 1.32 β drift dashboard UI
- Plotly 5.20 β interactive charts
- pandas β DataFrame backbone
- Slack Webhooks β drift alerts
- pytest 8.1
- unittest.mock β mocks LiteLLM and MLflow client
- pytest tmp_path β auto-cleaned temp files
- MLOps Engineering
- CI/CD Engineering with GitHub Actions
- LLM Evaluation (deterministic + LLM-as-Judge)
- Distributed Systems (Celery, async shadow scoring)
- Data Engineering (DVC, schema-driven validation)
- Observability (Streamlit + Plotly + TimescaleDB + Slack)
- Production Backend (FastAPI + ASGI middleware + hypertable design)
- Hermetic Testing (mocks, fixtures, separation of concerns)
- Configuration Management (single-source-of-truth YAML)
- Container Orchestration
- git clone
- cd "LLM Regressor MLOps"
- python -m venv venv
- venv\Scripts\activate
- pip install -r requirements.txt
- copy .env.example .env
- Edit
.envto fill inOPENAI_API_KEY,MLFLOW_TRACKING_URI, etc.
- python scripts/build_dataset.py
- python scorer/run_scorer.py
- python scripts/check_regression.py
- dvc repro
- uvicorn app.main:app --reload --port 8000
- celery -A app.tasks worker --loglevel=info
- streamlit run dashboards/app.py
The regression check always compares against the latest MLflow run tagged env=production.
- mlflow runs set-tag <run_id> env production
All subsequent CI runs will diff against this baseline.
- pytest tests/ -v
All test suites are hermetic β no network, no LLM calls, no MLflow server, no Redis, no TimescaleDB. LiteLLM and MLflowClient are mocked; file I/O uses pytest's tmp_path.
- docker compose up
| Service | Image / build | Port |
|---|---|---|
app |
local dockerfile |
8000 |
worker |
local dockerfile |
β |
redis |
redis:7-alpine |
6379 |
mlflow |
ghcr.io/mlflow/mlflow:v2.11.0 |
5000 |
timescale |
timescale/timescaledb:latest-pg15 |
5432 |
dashboard |
local dockerfile |
8501 |
The CI gate runs automatically on every push or PR that touches:
data/raw/prompts.jsonlconfig/scorer_config.yamlconfig/model_version.yamlscorer/**scripts/**
Required GitHub configuration:
- Secret
OPENAI_API_KEY - Variable
MLFLOW_TRACKING_URI(optional)
A green run:
A blocked PR:
- Every PR is automatically scored against a versioned golden dataset
- Quality regressions are blocked at the PR level before reaching production
- Per-category drift is visible on the Streamlit dashboard with version-bump markers
- 5% of live
/chattraffic is silently evaluated for shadow-vs-production comparison - Slack alerts fire when rolling 3-run averages dip below threshold
- All scoring runs are versioned in MLflow with full param + metric history
![]() |
![]() |
![]() |
![]() |
- Handling non-deterministic LLM responses during scoring (solved via temperature=0.0 and judge fallback logic)
- Reliable JSON parsing from judge LLM responses (solved via Markdown-fence stripping + overall-fallback)
- Zero-latency shadow routing on production traffic (solved via hash-based async Celery dispatch + body-stream reconstruction)
- MLflow connectivity edge cases in CI (solved via graceful fallback to local
./mlrunsand absolute threshold checks) - Preventing Slack alert spam during dashboard refresh (solved via
.alert_cooldown.jsonrate limiting) - Cross-platform dependency parity (Windows vs Linux CI runners) β solved via split
requirements.txt/requirements-ci.txt
- End-to-end MLOps pipeline design for LLM applications
- Building an LLM-as-Judge evaluation system with structured rubrics
- Combining deterministic and LLM-based metrics for robust scoring
- Designing a CI/CD quality gate that blocks regressions at the PR level
- Production observability with Streamlit, Plotly, MLflow, and TimescaleDB
- Shadow traffic engineering with zero added latency to live requests
- Async task processing with Celery + Redis
- Hermetic testing via mocks and pytest fixtures
- Combining MLOps + Backend + Data Engineering + LLM Evaluation in one cohesive system
- Prakhar Srivastava
- Data Scientist | AI Engineer | Data Scientist







