An R-based analysis project for predicting credit default using exploratory data analysis, logistic regression, and decision trees. Developed as part of the MSc Data Science Intelligence Data Analysis module.
The dataset consists of 14 variables describing loan default events. The response variable is default (encoded as 1/0), with approximately 20% of customers defaulting. The goal is to identify which covariates affect default status and build classifiers for loan rejection decisions.
- What are the most important covariates that play a significant role in defaulting?
- Which classifier is more suitable for our final recommendation?
- How can different objectives lead to different models?
install.packages(c("caret", "ggplot2", "gridExtra", "PerformanceAnalytics",
"Information", "pROC", "rpart", "rpart.plot"))- Dataset:
intelligence_dataset.csv - Target variable:
default(yes/no)
Update the file path in the script to match your local setup:
data <- read.csv("path/to/intelligence_dataset.csv")- Correlation: Income is highly correlated with valuation and month expenses → PCA used for dimensionality reduction
- PCA: First two principal components cover ~80% variance; PC1 (income, monthly expenses, valuation), PC2 (liabilities)
- Categorical variables: Barplots of conditional probabilities for employment, dependants, property, fraud, repossess, bankruptcy, l.insurance, customer, prop.owner, purpose
- Chi-square tests: Property, Repossess, Bankruptcy, and Prop.owner show discriminatory power; Fraud is a perfect separator (no-conviction → 0 defaults)
- Continuous variables: Histograms (Sturge's rule) and boxplots for income, liabilities, valuation, month.expenses
- Weight of Evidence (WOE) & Information Value (IV): Property "none" → positive WOE (don't default), "flat" → negative (default); lower income → default
- Backward stepwise selection (AIC and BIC) and StepCV (10-fold CV)
- Train/test split: 75/25
Variable selection (Table 2)
| Method | Selected variables |
|---|---|
| Backward AIC | Dependants, Property, PC1, PC2 |
| Backward BIC | Property, PC1, PC2 |
| StepCV | Property, PC1 |
Objective: Limit False Negatives → optimize Sensitivity (target ≥85%) with threshold 0.22
- Information gain and entropy for splits
- Complexity parameter (cp) = 0.0083 after 1-SD rule pruning
- 5 splits, 6 terminal nodes; PC1 as root split
| Model | AUC | Accuracy | F1 | Sensitivity | Threshold |
|---|---|---|---|---|---|
| Logistic (with objective) | 91% | 84% | 69% | 85.6% | 0.22 |
| Decision Tree (with objective) | 84% | 77% | 62% | 85.6% | 0.22 |
| Logistic (no objective) | 91% | 87% | 66% | 59% | 0.5 |
| Decision Tree (no objective) | 84% | 87% | 65% | 59% | 0.5 |
Logistic Regression (Backward selection with BIC) is the recommended model. It achieves:
- AUC 91.4%
- Sensitivity ≥85.6% at threshold 0.22
- Covariates: Property, PC1, PC2
A customer is classified as a defaulter when the estimated probability of default exceeds 0.22. This strategy prioritizes correctly identifying potential defaulters over granting more loans.
Bar charts for categorical variables: count, relfreq, or conditional probability.
Conditional histograms and boxplots of continuous variables by default status.
- Correlation plot and PCA biplot
- Barplots for categorical predictors vs. default
- Histograms and boxplots for continuous variables
- WOE/IV summary tables and plots
- CV error and AUC plots
- ROC curves
- Confusion matrices
Paschalis Itsios (35193390)
MSc Data Science
- Intelligent Information and Database Systems: 9th Asian Conference, ACIIDS 2017, Kanazawa, Japan, April 3–5, 2017, Proceedings, Part II.
- Grus, J. (2015). Data Science from Scratch (First Edition). O'Reilly Media, Inc.
Academic use only — for coursework submission.