Skip to content

Digantdc/pd-credit-risk-model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Probability of Default (PD) Model — XGBoost Credit Risk Pipeline

An end-to-end Probability of Default (PD) modelling pipeline for a retail/SME loan portfolio, built around a calibrated XGBoost classifier. The pipeline covers the full credit-risk workflow: class-imbalance handling, hyperparameter tuning, probability calibration, point-in-time and through-the-cycle PD via Monte Carlo simulation, Expected Credit Loss (ECL) computation, and a complete model-validation / monitoring suite with explainability.

Methodology

The core routine (run_monte_carlo_pd) implements the following stages:

  1. Target construction — a binary Default_Flag is derived from loan status.
  2. Feature engineering — 12 numeric features (Loan_Amount, Interest_Rate, CIBIL_Score, Time_in_months, Age, Vintage, ADB, Enquiries_L3M, Business_vintage, DPD_30, DPD_90, Risk_Bucket) and 3 categorical features (Industry, State, Loan_Type), one-hot encoded.
  3. Imbalance correctionADASYN oversampling inside the training pipeline so the minority (default) class is learned without leaking into the test fold.
  4. Model + calibrationXGBClassifier wrapped in CalibratedClassifierCV (sigmoid / Platt scaling, 5-fold) to produce well-calibrated probabilities.
  5. Hyperparameter tuningRandomizedSearchCV (20 iterations, 5-fold, ROC-AUC scoring) over n_estimators, max_depth, learning_rate, gamma, and subsample.
  6. Macro overlay — point-in-time probabilities are scaled by macro factors to produce a baseline (Jan 2026) and a stressed (Jan 2027) view.
  7. Monte Carlo simulation — per-loan PD is estimated by simulation over each horizon.
  8. ECLECL = PD × LGD × EAD, with LGD from recovery rate (default 0.6) and a simplifying EAD assumption; TTC_PD is set to the stressed-horizon PD.
  9. Explainability — SHAP (TreeExplainer) values and a SHAP summary plot.

Model validation & monitoring

validate_pd_model_and_save_to_excel produces a full discrimination, calibration and stability report:

Area Metrics
Discrimination AUC, Gini, KS statistic
Stability PSI (decile distribution)
Association Pearson correlation
Classification Precision, Recall, F1
Loss Average LGD, EAD, ECL (both horizons)
Explainability SHAP feature importance

Diagnostic plots

Plot File
ROC curve plots/ROC_curve.png
KS plot plots/KS_plot.png
PSI decile distribution plots/PSI_plot.png
Pearson correlation plots/Pearson_plot.png
Precision / Recall / F1 plots/PRF_plot.png
SHAP summary plots/SHAP_summary.png

Repository structure

.
├── src/
│   └── PD_V6.py          # Full PD / ECL / validation pipeline
├── plots/                # Diagnostic and explainability plots
├── requirements.txt
├── LICENSE
└── README.md

Expected data schema

PD_V6.py reads an Excel portfolio file (e.g. FINAL_PORTFOLIO_2003_3D_WithDates.xlsx, not included). Required columns include: Default_Status, Loan_Amount, Interest_Rate, CIBIL_Score, Time_in_months, Age, Vintage, ADB, Enquiries_L3M, Business_vintage, DPD_30, DPD_90, Risk_Bucket (A1/A2/B1/B2), Industry, State, Loan_Type. Recovery_Rate is optional.

Running

pip install -r requirements.txt
# Place your portfolio Excel file alongside src/ and update the path in the __main__ block,
# then:
python src/PD_V6.py

The script writes a results workbook and a ValidationResults.xlsx report with all metrics and embedded plots. Adjust the output paths in the __main__ block to suit your environment.

Tech stack

Python · XGBoost · scikit-learn · imbalanced-learn (ADASYN) · SHAP · NumPy · pandas · matplotlib · XlsxWriter

License

MIT — see LICENSE.

About

Calibrated XGBoost Probability-of-Default model for a loan portfolio — ADASYN imbalance handling, Monte Carlo PD/ECL, SHAP explainability, and a full validation & monitoring suite (AUC, Gini, KS, PSI).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages