An end-to-end Probability of Default (PD) modelling pipeline for a retail/SME loan portfolio, built around a calibrated XGBoost classifier. The pipeline covers the full credit-risk workflow: class-imbalance handling, hyperparameter tuning, probability calibration, point-in-time and through-the-cycle PD via Monte Carlo simulation, Expected Credit Loss (ECL) computation, and a complete model-validation / monitoring suite with explainability.
The core routine (run_monte_carlo_pd) implements the following stages:
- Target construction — a binary
Default_Flagis derived from loan status. - Feature engineering — 12 numeric features (Loan_Amount, Interest_Rate, CIBIL_Score, Time_in_months, Age, Vintage, ADB, Enquiries_L3M, Business_vintage, DPD_30, DPD_90, Risk_Bucket) and 3 categorical features (Industry, State, Loan_Type), one-hot encoded.
- Imbalance correction — ADASYN oversampling inside the training pipeline so the minority (default) class is learned without leaking into the test fold.
- Model + calibration —
XGBClassifierwrapped inCalibratedClassifierCV(sigmoid / Platt scaling, 5-fold) to produce well-calibrated probabilities. - Hyperparameter tuning —
RandomizedSearchCV(20 iterations, 5-fold, ROC-AUC scoring) overn_estimators,max_depth,learning_rate,gamma, andsubsample. - Macro overlay — point-in-time probabilities are scaled by macro factors to produce a baseline (Jan 2026) and a stressed (Jan 2027) view.
- Monte Carlo simulation — per-loan PD is estimated by simulation over each horizon.
- ECL —
ECL = PD × LGD × EAD, with LGD from recovery rate (default 0.6) and a simplifying EAD assumption;TTC_PDis set to the stressed-horizon PD. - Explainability — SHAP (
TreeExplainer) values and a SHAP summary plot.
validate_pd_model_and_save_to_excel produces a full discrimination, calibration and
stability report:
| Area | Metrics |
|---|---|
| Discrimination | AUC, Gini, KS statistic |
| Stability | PSI (decile distribution) |
| Association | Pearson correlation |
| Classification | Precision, Recall, F1 |
| Loss | Average LGD, EAD, ECL (both horizons) |
| Explainability | SHAP feature importance |
| Plot | File |
|---|---|
| ROC curve | plots/ROC_curve.png |
| KS plot | plots/KS_plot.png |
| PSI decile distribution | plots/PSI_plot.png |
| Pearson correlation | plots/Pearson_plot.png |
| Precision / Recall / F1 | plots/PRF_plot.png |
| SHAP summary | plots/SHAP_summary.png |
.
├── src/
│ └── PD_V6.py # Full PD / ECL / validation pipeline
├── plots/ # Diagnostic and explainability plots
├── requirements.txt
├── LICENSE
└── README.md
PD_V6.py reads an Excel portfolio file (e.g. FINAL_PORTFOLIO_2003_3D_WithDates.xlsx,
not included). Required columns include: Default_Status, Loan_Amount, Interest_Rate,
CIBIL_Score, Time_in_months, Age, Vintage, ADB, Enquiries_L3M,
Business_vintage, DPD_30, DPD_90, Risk_Bucket (A1/A2/B1/B2), Industry, State,
Loan_Type. Recovery_Rate is optional.
pip install -r requirements.txt
# Place your portfolio Excel file alongside src/ and update the path in the __main__ block,
# then:
python src/PD_V6.pyThe script writes a results workbook and a ValidationResults.xlsx report with all metrics
and embedded plots. Adjust the output paths in the __main__ block to suit your environment.
Python · XGBoost · scikit-learn · imbalanced-learn (ADASYN) · SHAP · NumPy · pandas · matplotlib · XlsxWriter
MIT — see LICENSE.