Skip to content

itsiospaschalis/CreditScoreAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Intelligence Data Analysis — Credit Default Prediction

An R-based analysis project for predicting credit default using exploratory data analysis, logistic regression, and decision trees. Developed as part of the MSc Data Science Intelligence Data Analysis module.

Overview

The dataset consists of 14 variables describing loan default events. The response variable is default (encoded as 1/0), with approximately 20% of customers defaulting. The goal is to identify which covariates affect default status and build classifiers for loan rejection decisions.

Research Questions

  1. What are the most important covariates that play a significant role in defaulting?
  2. Which classifier is more suitable for our final recommendation?
  3. How can different objectives lead to different models?

Requirements

R Packages

install.packages(c("caret", "ggplot2", "gridExtra", "PerformanceAnalytics", 
                   "Information", "pROC", "rpart", "rpart.plot"))

Data

  • Dataset: intelligence_dataset.csv
  • Target variable: default (yes/no)

Data Path

Update the file path in the script to match your local setup:

data <- read.csv("path/to/intelligence_dataset.csv")

Project Structure

1. Exploratory Data Analysis

  • Correlation: Income is highly correlated with valuation and month expenses → PCA used for dimensionality reduction
  • PCA: First two principal components cover ~80% variance; PC1 (income, monthly expenses, valuation), PC2 (liabilities)
  • Categorical variables: Barplots of conditional probabilities for employment, dependants, property, fraud, repossess, bankruptcy, l.insurance, customer, prop.owner, purpose
  • Chi-square tests: Property, Repossess, Bankruptcy, and Prop.owner show discriminatory power; Fraud is a perfect separator (no-conviction → 0 defaults)
  • Continuous variables: Histograms (Sturge's rule) and boxplots for income, liabilities, valuation, month.expenses
  • Weight of Evidence (WOE) & Information Value (IV): Property "none" → positive WOE (don't default), "flat" → negative (default); lower income → default

2. Modelling

Logistic Regression

  • Backward stepwise selection (AIC and BIC) and StepCV (10-fold CV)
  • Train/test split: 75/25

Variable selection (Table 2)

Method Selected variables
Backward AIC Dependants, Property, PC1, PC2
Backward BIC Property, PC1, PC2
StepCV Property, PC1

Objective: Limit False Negatives → optimize Sensitivity (target ≥85%) with threshold 0.22

Decision Trees

  • Information gain and entropy for splits
  • Complexity parameter (cp) = 0.0083 after 1-SD rule pruning
  • 5 splits, 6 terminal nodes; PC1 as root split

3. Model Evaluation

Model AUC Accuracy F1 Sensitivity Threshold
Logistic (with objective) 91% 84% 69% 85.6% 0.22
Decision Tree (with objective) 84% 77% 62% 85.6% 0.22
Logistic (no objective) 91% 87% 66% 59% 0.5
Decision Tree (no objective) 84% 87% 65% 59% 0.5

Final Recommendation

Logistic Regression (Backward selection with BIC) is the recommended model. It achieves:

  • AUC 91.4%
  • Sensitivity ≥85.6% at threshold 0.22
  • Covariates: Property, PC1, PC2

A customer is classified as a defaulter when the estimated probability of default exceeds 0.22. This strategy prioritizes correctly identifying potential defaulters over granting more loans.


Custom Functions

cc_barplot(Data, x, y, freq)

Bar charts for categorical variables: count, relfreq, or conditional probability.

cc_hist() and cc_boxplot()

Conditional histograms and boxplots of continuous variables by default status.


Key Outputs

  • Correlation plot and PCA biplot
  • Barplots for categorical predictors vs. default
  • Histograms and boxplots for continuous variables
  • WOE/IV summary tables and plots
  • CV error and AUC plots
  • ROC curves
  • Confusion matrices

Author

Paschalis Itsios (35193390)
MSc Data Science

References

  1. Intelligent Information and Database Systems: 9th Asian Conference, ACIIDS 2017, Kanazawa, Japan, April 3–5, 2017, Proceedings, Part II.
  2. Grus, J. (2015). Data Science from Scratch (First Edition). O'Reilly Media, Inc.

License

Academic use only — for coursework submission.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages