Production-simulated credit decision-ops platform — champion/challenger governance, Optuna HPO, Platt calibration, PSI drift alerting, fairness audit, FastAPI serving.
Credit default models fail silently. PR AUC looks fine in training, the model is miscalibrated at the cutoff that matters, and nobody checks disparate impact until a regulator does.
A model with PR AUC 0.26 can produce systematically wrong approval rates if the probability outputs are not calibrated against the decision thresholds. Drift compounds this: a 14-day population shift in EXT_SOURCE_2 of −0.18 and a credit income ratio increase of ×1.22 pushes PSI to 0.2358 — the alert threshold — while the model keeps scoring without warning. Standard notebook workflows catch none of this. RiskFrame is built to catch all of it.
RiskFrame runs XGBoost and LightGBM head-to-head with a 5-gate promotion framework. Neither model advances without clearing every gate on held-out test data.
python -m src.training.train \
--data_dir data/home-credit-default-risk \
--artifact_dir artifacts/xgb_v1 \
--config configs/training_config.jsonPipeline: 7-table load → bureau/prev_app/installments aggregation to SK_ID_CURR grain → 183-feature ABTBuilder → stratified 60/20/20 split → ColumnTransformer fit on train-only → RandomizedSearchCV on XGBClassifier (20 iter, 3-fold CV) → CalibratedClassifierCV(method='sigmoid', cv='prefit') fit on val set → sklearn.Pipeline serialized to model.joblib.
| Metric | Value |
|---|---|
| PR AUC | 0.2611 |
| ROC AUC | 0.7663 |
| ECE | 0.0046 |
| Gate decision | DEPLOYED |
python -m src.training.challenger.train_challengerSame split, separate ColumnTransformer, RandomizedSearchCV on LightGBM, Platt calibration. Evaluated head-to-head across 8 metrics.
| Metric | XGBoost Champion | LightGBM Challenger |
|---|---|---|
| PR AUC | 0.2611 | 0.2609 |
| ROC AUC | 0.7663 | 0.7649 |
| ECE | 0.0046 | higher |
| Gate 1 (PR AUC delta ≥ 0.001) | — | FAIL → HOLD |
DeLong AUC z-stat: ~0.08, p ~0.07 — not statistically significant. Performance equivalent, champion retained.
python -m src.training.optuna_hpo --n_trials 50 --seed 4250-trial TPE Bayesian search over 9 XGBoost hyperparameters. xgb_v2 achieves PR AUC 0.2654 — a better discriminator — but ECE regresses from 0.0046 to 0.0243. Platt sigmoid calibration becomes numerically unstable when the internal XGB probability distribution shifts, making xgb_v2 a worse policy instrument despite stronger AUC.
| xgb_v1 Champion | xgb_v2 Optuna | |
|---|---|---|
| PR AUC | 0.2611 | 0.2654 |
| ROC AUC | 0.7663 | 0.7692 |
| ECE | 0.0046 | 0.0243 |
| Review rate collapse | — | −14.4pp |
| Gate decision | DEPLOYED | HOLD |
The policy engine converts probability scores to decisions at thresholds. An uncalibrated model with excellent AUC can still produce the wrong APPROVE/REVIEW/REJECT split if probability outputs are systematically off.
CalibratedClassifierCV(method='sigmoid', cv='prefit') is fit on the val set after RandomizedSearchCV completes — never on training data. ECE 0.0046 confirms the champion is well-calibrated at the decision cutoffs. This is why xgb_v2 (ECE 0.0243) is held: better discrimination is not worth calibration regression when the policy engine depends on the probability.
The drift monitor computes Population Stability Index (PSI) and per-feature KS statistics across 183 features at each batch run.
| Day | PSI | Status |
|---|---|---|
| 1–3 | ~0.03 | Nominal |
| 7 | 0.158 | WARN |
| 14 | 0.2358 | ALERT |
Day 14 drift is synthetic: EXT_SOURCE_2 shifted −0.18, credit_income_ratio scaled ×1.22. The drift monitor fires correctly. drift_fire_test.py asserts PSI > 0.20 on this population — it is part of the 22/22 test suite.
fairness_report.json applies two threshold-gated metrics by gender:
| Metric | Value | Threshold | Status |
|---|---|---|---|
| Disparate Impact (F/M approval rate ratio) | 1.059 | 0.80–1.25 | No violation |
| Equal Opportunity gap (TPR parity) | ~2.8pp | < 5pp | No violation |
CODE_GENDER is excluded from adverse action reason codes (ECOA-compliant). SHAP rank of CODE_GENDER_F is #10 (mean |SHAP| = 0.0848 vs EXT_SOURCE_2 = 0.3470).
score < 0.06 → APPROVE
0.06 ≤ score < 0.28 → REVIEW (routed to human queue)
score ≥ 0.28 → REJECT
Policy is separately versioned from the model. v1.0 → v1.1 on Day 12 tightened capacity from 30% to 15% — recorded in policy_change_log.json with rationale and authorized_by.
# Score an applicant
curl -X POST http://localhost:8000/score \
-H "Content-Type: application/json" \
-d '{
"SK_ID_CURR": 123456,
"EXT_SOURCE_2": 0.6,
"AMT_CREDIT": 200000,
"AMT_INCOME_TOTAL": 100000,
"DAYS_BIRTH": -12000,
"DAYS_EMPLOYED": -2000,
"CODE_GENDER": "M",
"NAME_INCOME_TYPE": "Working"
}'Endpoints: /score /explain (SHAP) /batch /drift /policy /registry
Training-serving parity: the sklearn.Pipeline is the same object loaded by batch_scorer.py and serving/app.py. parity_check.py asserts batch scorer == API scorer within 1e-6.
docker-compose up -d
curl http://localhost:8000/healthpython seed_demo.py # Generate all artifacts
python show_demo_report.py # Terminal evidence summary| Day | Event |
|---|---|
| 1–3 | Clean batch runs, nominal PSI ~0.03 |
| 4 | Malformed batch: 47 null SK_ID_CURR + 12 DAYS_EMPLOYED=+5200 → rejected_rows.csv |
| 7 | Population shift, PSI 0.158 WARN |
| 10 | LightGBM challenger registered, shadow scoring begins |
| 12 | Policy v1.0→v1.1: capacity 30%→15%, thresholds tightened |
| 14 | Synthetic drift injection → PSI 0.2358 ALERT |
| 15 | Fairness report generated |
| 21 | 200 synthetic review outcomes logged with override reasons |
| 25 | Challenger comparison: 8-metric head-to-head + 5 promotion gates |
| 30 | Delayed label validation: bad-rate-by-bucket vs. predicted |
python tests/pipeline_integrity_test.py # 10 integrity tests
python tests/golden_scenario_suite.py # 12 deterministic golden scenarios
python tests/drift_fire_test.py # PSI > 0.20 on synthetic drift
python tests/parity_check.py # Training-serving parityAll green: 22/22 tests passing
pip install -r requirements.txt
python -m src.training.train
python seed_demo.py
uvicorn serving.app:app --host 0.0.0.0 --port 8000
open dashboard.html| Artifact | Proves |
|---|---|
calibration_report.json |
Brier, ECE, MCE — calibration measured |
drift_report.json |
PSI > 0.20 alert fires on drifted population |
challenger_comparison_report.json |
8-metric head-to-head |
challenger_promotion_decision.json |
All 5 gates documented |
fairness_report.json |
Disparate impact + SHAP rank |
policy_change_log.json |
Day 12 threshold change with rationale |
optuna_hpo_results.json |
50-trial search, ECE regression documented |
batch_scoring_runs.csv |
30-run operational history |
What this is: Solo-built, non-production, production-simulated credit decisioning platform on public Home Credit data.
What is real: Feature pipeline, model training, calibration evaluation, policy engine, batch/online scoring, drift computation, policy logs, review logs, fairness computation, delayed label check, Docker serving.
What is simulated: Applicant records (public dataset), operational lifecycle events (scripted), human review decisions (synthetic, labeled as such).
What is not claimed: Production deployment, regulatory approval, MRM validation, real customer data.
| Document | Location |
|---|---|
| PRD v2.3 | docs/prd/RiskFrame_PRD_v2.3.pdf |
| Interview Defense | docs/defense/RiskFrame_Interview_Defense_v2.pdf |
| Model Card | MODEL_CARD.md |
| API curl proof | docs/api_curl_proof.md |
| Docker run proof | docs/docker_run_proof.md |
Full design rationale, architecture decisions, and expected interview questions with answers:
docs/defense/RiskFrame_Interview_Defense_v2.pdf
Covers: champion/challenger framework, Optuna HPO ECE regression, Platt calibration rationale, PSI drift alerting, fairness audit methodology, 5-gate promotion framework, and production failure modes.
This project is part of a portfolio targeting Applied LLM Systems Engineer roles.
- NexusSupply — Supplier Risk Intelligence Platform (LangGraph + FinBERT + XGBoost + Instructor + NetworkX)
- LendFlow — AI-powered loan underwriting pipeline (LangGraph + RAG + FOIR rules engine)
- AgentReliabilityLab — Cyber threat triage agent (LangGraph + hybrid RAG + HITL + RAGAS eval)
- RiskFrame Platform — ML model lifecycle (XGBoost + LightGBM champion/challenger, Optuna HPO, drift monitoring)
- DevPulse Platform — Version-safe RAG migration intelligence (LLM-Last principle, conflict detection)
- PulseRank Platform — Marketplace ranking with IPS debiasing (position bias correction, delayed attribution)
- MetaSignal Platform — Experimentation intelligence (CUPED + guardrail-first + A/A calibration)