Probability of Default (PD) Model — XGBoost Credit Risk Pipeline

An end-to-end Probability of Default (PD) modelling pipeline for a retail/SME loan portfolio, built around a calibrated XGBoost classifier. The pipeline covers the full credit-risk workflow: class-imbalance handling, hyperparameter tuning, probability calibration, point-in-time and through-the-cycle PD via Monte Carlo simulation, Expected Credit Loss (ECL) computation, and a complete model-validation / monitoring suite with explainability.

Methodology

The core routine (run_monte_carlo_pd) implements the following stages:

Target construction — a binary Default_Flag is derived from loan status.
Feature engineering — 12 numeric features (Loan_Amount, Interest_Rate, CIBIL_Score, Time_in_months, Age, Vintage, ADB, Enquiries_L3M, Business_vintage, DPD_30, DPD_90, Risk_Bucket) and 3 categorical features (Industry, State, Loan_Type), one-hot encoded.
Imbalance correction — ADASYN oversampling inside the training pipeline so the minority (default) class is learned without leaking into the test fold.
Model + calibration — XGBClassifier wrapped in CalibratedClassifierCV (sigmoid / Platt scaling, 5-fold) to produce well-calibrated probabilities.
Hyperparameter tuning — RandomizedSearchCV (20 iterations, 5-fold, ROC-AUC scoring) over n_estimators, max_depth, learning_rate, gamma, and subsample.
Macro overlay — point-in-time probabilities are scaled by macro factors to produce a baseline (Jan 2026) and a stressed (Jan 2027) view.
Monte Carlo simulation — per-loan PD is estimated by simulation over each horizon.
ECL — ECL = PD × LGD × EAD, with LGD from recovery rate (default 0.6) and a simplifying EAD assumption; TTC_PD is set to the stressed-horizon PD.
Explainability — SHAP (TreeExplainer) values and a SHAP summary plot.

Model validation & monitoring

validate_pd_model_and_save_to_excel produces a full discrimination, calibration and stability report:

Area	Metrics
Discrimination	AUC, Gini, KS statistic
Stability	PSI (decile distribution)
Association	Pearson correlation
Classification	Precision, Recall, F1
Loss	Average LGD, EAD, ECL (both horizons)
Explainability	SHAP feature importance

Diagnostic plots

Plot	File
ROC curve	`plots/ROC_curve.png`
KS plot	`plots/KS_plot.png`
PSI decile distribution	`plots/PSI_plot.png`
Pearson correlation	`plots/Pearson_plot.png`
Precision / Recall / F1	`plots/PRF_plot.png`
SHAP summary	`plots/SHAP_summary.png`

Repository structure

.
├── src/
│   └── PD_V6.py          # Full PD / ECL / validation pipeline
├── plots/                # Diagnostic and explainability plots
├── requirements.txt
├── LICENSE
└── README.md

Expected data schema

PD_V6.py reads an Excel portfolio file (e.g. FINAL_PORTFOLIO_2003_3D_WithDates.xlsx, not included). Required columns include: Default_Status, Loan_Amount, Interest_Rate, CIBIL_Score, Time_in_months, Age, Vintage, ADB, Enquiries_L3M, Business_vintage, DPD_30, DPD_90, Risk_Bucket (A1/A2/B1/B2), Industry, State, Loan_Type. Recovery_Rate is optional.

Running

pip install -r requirements.txt
# Place your portfolio Excel file alongside src/ and update the path in the __main__ block,
# then:
python src/PD_V6.py

The script writes a results workbook and a ValidationResults.xlsx report with all metrics and embedded plots. Adjust the output paths in the __main__ block to suit your environment.

Tech stack

Python · XGBoost · scikit-learn · imbalanced-learn (ADASYN) · SHAP · NumPy · pandas · matplotlib · XlsxWriter

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Probability of Default (PD) Model — XGBoost Credit Risk Pipeline

Methodology

Model validation & monitoring

Diagnostic plots

Repository structure

Expected data schema

Running

Tech stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
plots		plots
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Probability of Default (PD) Model — XGBoost Credit Risk Pipeline

Methodology

Model validation & monitoring

Diagnostic plots

Repository structure

Expected data schema

Running

Tech stack

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages