Skip to content

SidharthKriplani/riskframe_platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RiskFrame v2.3

PR AUC ROC AUC ECE Tests PSI Day14 Disparate Impact Optuna HPO CI Python FastAPI Docker

Production-simulated credit decision-ops platform — champion/challenger governance, Optuna HPO, Platt calibration, PSI drift alerting, fairness audit, FastAPI serving.


Architecture

RiskFrame Architecture


Sample Output

Model Scorecard


The Problem

Credit default models fail silently. PR AUC looks fine in training, the model is miscalibrated at the cutoff that matters, and nobody checks disparate impact until a regulator does.

A model with PR AUC 0.26 can produce systematically wrong approval rates if the probability outputs are not calibrated against the decision thresholds. Drift compounds this: a 14-day population shift in EXT_SOURCE_2 of −0.18 and a credit income ratio increase of ×1.22 pushes PSI to 0.2358 — the alert threshold — while the model keeps scoring without warning. Standard notebook workflows catch none of this. RiskFrame is built to catch all of it.


Champion / Challenger Framework

RiskFrame runs XGBoost and LightGBM head-to-head with a 5-gate promotion framework. Neither model advances without clearing every gate on held-out test data.

Champion — XGBoost v1

python -m src.training.train \
  --data_dir data/home-credit-default-risk \
  --artifact_dir artifacts/xgb_v1 \
  --config configs/training_config.json

Pipeline: 7-table load → bureau/prev_app/installments aggregation to SK_ID_CURR grain → 183-feature ABTBuilder → stratified 60/20/20 split → ColumnTransformer fit on train-only → RandomizedSearchCV on XGBClassifier (20 iter, 3-fold CV) → CalibratedClassifierCV(method='sigmoid', cv='prefit') fit on val set → sklearn.Pipeline serialized to model.joblib.

Metric Value
PR AUC 0.2611
ROC AUC 0.7663
ECE 0.0046
Gate decision DEPLOYED

Challenger — LightGBM v1

python -m src.training.challenger.train_challenger

Same split, separate ColumnTransformer, RandomizedSearchCV on LightGBM, Platt calibration. Evaluated head-to-head across 8 metrics.

Metric XGBoost Champion LightGBM Challenger
PR AUC 0.2611 0.2609
ROC AUC 0.7663 0.7649
ECE 0.0046 higher
Gate 1 (PR AUC delta ≥ 0.001) FAIL → HOLD

DeLong AUC z-stat: ~0.08, p ~0.07 — not statistically significant. Performance equivalent, champion retained.


Model Scorecard

Model Scorecard


Optuna HPO Design

python -m src.training.optuna_hpo --n_trials 50 --seed 42

50-trial TPE Bayesian search over 9 XGBoost hyperparameters. xgb_v2 achieves PR AUC 0.2654 — a better discriminator — but ECE regresses from 0.0046 to 0.0243. Platt sigmoid calibration becomes numerically unstable when the internal XGB probability distribution shifts, making xgb_v2 a worse policy instrument despite stronger AUC.

xgb_v1 Champion xgb_v2 Optuna
PR AUC 0.2611 0.2654
ROC AUC 0.7663 0.7692
ECE 0.0046 0.0243
Review rate collapse −14.4pp
Gate decision DEPLOYED HOLD

Platt Calibration Rationale

The policy engine converts probability scores to decisions at thresholds. An uncalibrated model with excellent AUC can still produce the wrong APPROVE/REVIEW/REJECT split if probability outputs are systematically off.

CalibratedClassifierCV(method='sigmoid', cv='prefit') is fit on the val set after RandomizedSearchCV completes — never on training data. ECE 0.0046 confirms the champion is well-calibrated at the decision cutoffs. This is why xgb_v2 (ECE 0.0243) is held: better discrimination is not worth calibration regression when the policy engine depends on the probability.


PSI Drift Alerting

The drift monitor computes Population Stability Index (PSI) and per-feature KS statistics across 183 features at each batch run.

Day PSI Status
1–3 ~0.03 Nominal
7 0.158 WARN
14 0.2358 ALERT

Day 14 drift is synthetic: EXT_SOURCE_2 shifted −0.18, credit_income_ratio scaled ×1.22. The drift monitor fires correctly. drift_fire_test.py asserts PSI > 0.20 on this population — it is part of the 22/22 test suite.


Fairness Audit

fairness_report.json applies two threshold-gated metrics by gender:

Metric Value Threshold Status
Disparate Impact (F/M approval rate ratio) 1.059 0.80–1.25 No violation
Equal Opportunity gap (TPR parity) ~2.8pp < 5pp No violation

CODE_GENDER is excluded from adverse action reason codes (ECOA-compliant). SHAP rank of CODE_GENDER_F is #10 (mean |SHAP| = 0.0848 vs EXT_SOURCE_2 = 0.3470).


Decision Policy

score < 0.06            →  APPROVE
0.06 ≤ score < 0.28     →  REVIEW   (routed to human queue)
score ≥ 0.28            →  REJECT

Policy is separately versioned from the model. v1.0 → v1.1 on Day 12 tightened capacity from 30% to 15% — recorded in policy_change_log.json with rationale and authorized_by.


FastAPI Serving

# Score an applicant
curl -X POST http://localhost:8000/score \
  -H "Content-Type: application/json" \
  -d '{
    "SK_ID_CURR": 123456,
    "EXT_SOURCE_2": 0.6,
    "AMT_CREDIT": 200000,
    "AMT_INCOME_TOTAL": 100000,
    "DAYS_BIRTH": -12000,
    "DAYS_EMPLOYED": -2000,
    "CODE_GENDER": "M",
    "NAME_INCOME_TYPE": "Working"
  }'

Endpoints: /score /explain (SHAP) /batch /drift /policy /registry

Training-serving parity: the sklearn.Pipeline is the same object loaded by batch_scorer.py and serving/app.py. parity_check.py asserts batch scorer == API scorer within 1e-6.

docker-compose up -d
curl http://localhost:8000/health

30-Day Operational Lifecycle

python seed_demo.py        # Generate all artifacts
python show_demo_report.py # Terminal evidence summary
Day Event
1–3 Clean batch runs, nominal PSI ~0.03
4 Malformed batch: 47 null SK_ID_CURR + 12 DAYS_EMPLOYED=+5200 → rejected_rows.csv
7 Population shift, PSI 0.158 WARN
10 LightGBM challenger registered, shadow scoring begins
12 Policy v1.0→v1.1: capacity 30%→15%, thresholds tightened
14 Synthetic drift injection → PSI 0.2358 ALERT
15 Fairness report generated
21 200 synthetic review outcomes logged with override reasons
25 Challenger comparison: 8-metric head-to-head + 5 promotion gates
30 Delayed label validation: bad-rate-by-bucket vs. predicted

Tests

python tests/pipeline_integrity_test.py   # 10 integrity tests
python tests/golden_scenario_suite.py     # 12 deterministic golden scenarios
python tests/drift_fire_test.py           # PSI > 0.20 on synthetic drift
python tests/parity_check.py             # Training-serving parity

All green: 22/22 tests passing


Quick Start

pip install -r requirements.txt
python -m src.training.train
python seed_demo.py
uvicorn serving.app:app --host 0.0.0.0 --port 8000
open dashboard.html

Key Evidence Artifacts

Artifact Proves
calibration_report.json Brier, ECE, MCE — calibration measured
drift_report.json PSI > 0.20 alert fires on drifted population
challenger_comparison_report.json 8-metric head-to-head
challenger_promotion_decision.json All 5 gates documented
fairness_report.json Disparate impact + SHAP rank
policy_change_log.json Day 12 threshold change with rationale
optuna_hpo_results.json 50-trial search, ECE regression documented
batch_scoring_runs.csv 30-run operational history

Honest Positioning

What this is: Solo-built, non-production, production-simulated credit decisioning platform on public Home Credit data.

What is real: Feature pipeline, model training, calibration evaluation, policy engine, batch/online scoring, drift computation, policy logs, review logs, fairness computation, delayed label check, Docker serving.

What is simulated: Applicant records (public dataset), operational lifecycle events (scripted), human review decisions (synthetic, labeled as such).

What is not claimed: Production deployment, regulatory approval, MRM validation, real customer data.


Documentation

Document Location
PRD v2.3 docs/prd/RiskFrame_PRD_v2.3.pdf
Interview Defense docs/defense/RiskFrame_Interview_Defense_v2.pdf
Model Card MODEL_CARD.md
API curl proof docs/api_curl_proof.md
Docker run proof docs/docker_run_proof.md

Interview Defense

Full design rationale, architecture decisions, and expected interview questions with answers:

docs/defense/RiskFrame_Interview_Defense_v2.pdf

Covers: champion/challenger framework, Optuna HPO ECE regression, Platt calibration rationale, PSI drift alerting, fairness audit methodology, 5-gate promotion framework, and production failure modes.


Part of Applied LLM Systems Portfolio

This project is part of a portfolio targeting Applied LLM Systems Engineer roles.

  • NexusSupply — Supplier Risk Intelligence Platform (LangGraph + FinBERT + XGBoost + Instructor + NetworkX)
  • LendFlow — AI-powered loan underwriting pipeline (LangGraph + RAG + FOIR rules engine)
  • AgentReliabilityLab — Cyber threat triage agent (LangGraph + hybrid RAG + HITL + RAGAS eval)
  • RiskFrame Platform — ML model lifecycle (XGBoost + LightGBM champion/challenger, Optuna HPO, drift monitoring)
  • DevPulse Platform — Version-safe RAG migration intelligence (LLM-Last principle, conflict detection)
  • PulseRank Platform — Marketplace ranking with IPS debiasing (position bias correction, delayed attribution)
  • MetaSignal Platform — Experimentation intelligence (CUPED + guardrail-first + A/A calibration)

About

End-to-end credit decisioning platform: XGBoost + LightGBM challenger, Optuna HPO, SHAP, PSI drift, fairness, FastAPI

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors