Intelligence Data Analysis — Credit Default Prediction

An R-based analysis project for predicting credit default using exploratory data analysis, logistic regression, and decision trees. Developed as part of the MSc Data Science Intelligence Data Analysis module.

Overview

The dataset consists of 14 variables describing loan default events. The response variable is default (encoded as 1/0), with approximately 20% of customers defaulting. The goal is to identify which covariates affect default status and build classifiers for loan rejection decisions.

Research Questions

What are the most important covariates that play a significant role in defaulting?
Which classifier is more suitable for our final recommendation?
How can different objectives lead to different models?

Requirements

R Packages

install.packages(c("caret", "ggplot2", "gridExtra", "PerformanceAnalytics", 
                   "Information", "pROC", "rpart", "rpart.plot"))

Data

Dataset: intelligence_dataset.csv
Target variable: default (yes/no)

Data Path

Update the file path in the script to match your local setup:

data <- read.csv("path/to/intelligence_dataset.csv")

Project Structure

1. Exploratory Data Analysis

Correlation: Income is highly correlated with valuation and month expenses → PCA used for dimensionality reduction
PCA: First two principal components cover ~80% variance; PC1 (income, monthly expenses, valuation), PC2 (liabilities)
Categorical variables: Barplots of conditional probabilities for employment, dependants, property, fraud, repossess, bankruptcy, l.insurance, customer, prop.owner, purpose
Chi-square tests: Property, Repossess, Bankruptcy, and Prop.owner show discriminatory power; Fraud is a perfect separator (no-conviction → 0 defaults)
Continuous variables: Histograms (Sturge's rule) and boxplots for income, liabilities, valuation, month.expenses
Weight of Evidence (WOE) & Information Value (IV): Property "none" → positive WOE (don't default), "flat" → negative (default); lower income → default

2. Modelling

Logistic Regression

Backward stepwise selection (AIC and BIC) and StepCV (10-fold CV)
Train/test split: 75/25

Variable selection (Table 2)

Method	Selected variables
Backward AIC	Dependants, Property, PC1, PC2
Backward BIC	Property, PC1, PC2
StepCV	Property, PC1

Objective: Limit False Negatives → optimize Sensitivity (target ≥85%) with threshold 0.22

Decision Trees

Information gain and entropy for splits
Complexity parameter (cp) = 0.0083 after 1-SD rule pruning
5 splits, 6 terminal nodes; PC1 as root split

3. Model Evaluation

Model	AUC	Accuracy	F1	Sensitivity	Threshold
Logistic (with objective)	91%	84%	69%	85.6%	0.22
Decision Tree (with objective)	84%	77%	62%	85.6%	0.22
Logistic (no objective)	91%	87%	66%	59%	0.5
Decision Tree (no objective)	84%	87%	65%	59%	0.5

Final Recommendation

Logistic Regression (Backward selection with BIC) is the recommended model. It achieves:

AUC 91.4%
Sensitivity ≥85.6% at threshold 0.22
Covariates: Property, PC1, PC2

A customer is classified as a defaulter when the estimated probability of default exceeds 0.22. This strategy prioritizes correctly identifying potential defaulters over granting more loans.

Custom Functions

`cc_barplot(Data, x, y, freq)`

Bar charts for categorical variables: count, relfreq, or conditional probability.

`cc_hist()` and `cc_boxplot()`

Conditional histograms and boxplots of continuous variables by default status.

Key Outputs

Correlation plot and PCA biplot
Barplots for categorical predictors vs. default
Histograms and boxplots for continuous variables
WOE/IV summary tables and plots
CV error and AUC plots
ROC curves
Confusion matrices

Author

Paschalis Itsios (35193390)
MSc Data Science

References

Intelligent Information and Database Systems: 9th Asian Conference, ACIIDS 2017, Kanazawa, Japan, April 3–5, 2017, Proceedings, Part II.
Grus, J. (2015). Data Science from Scratch (First Edition). O'Reilly Media, Inc.

License

Academic use only — for coursework submission.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Paschalis_Itsios_35193390.pdf		Paschalis_Itsios_35193390.pdf
README.md		README.md
final_Paschalis_Itsios_35193390 (2).R		final_Paschalis_Itsios_35193390 (2).R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intelligence Data Analysis — Credit Default Prediction

Overview

Research Questions

Requirements

R Packages

Data

Data Path

Project Structure

1. Exploratory Data Analysis

2. Modelling

Logistic Regression

Decision Trees

3. Model Evaluation

Final Recommendation

Custom Functions

`cc_barplot(Data, x, y, freq)`

`cc_hist()` and `cc_boxplot()`

Key Outputs

Author

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Intelligence Data Analysis — Credit Default Prediction

Overview

Research Questions

Requirements

R Packages

Data

Data Path

Project Structure

1. Exploratory Data Analysis

2. Modelling

Logistic Regression

Decision Trees

3. Model Evaluation

Final Recommendation

Custom Functions

cc_barplot(Data, x, y, freq)

cc_hist() and cc_boxplot()

Key Outputs

Author

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`cc_barplot(Data, x, y, freq)`

`cc_hist()` and `cc_boxplot()`

Packages