Loan-Risk-Predicting

Loan risk prediction is a critical process in the financial industry, designed to evaluate the probability that a borrower will default on a loan. This involves a series of steps that leverage data analysis and machine learning to provide accurate risk assessments.

Technologies Used

• Python • Pandas • NumPy • Scikit-learn • Matplotlib • Seaborn

Data Exploration

The dataset was provided in JSON format and included various financial features. The data exploration process involved understanding the distribution of features, handling missing values, and preparing the data for modeling.

Data Visualizations

Age Distribution: The age distribution of clients is visualized to understand the demographic spread.
Income Distribution: Visualizing the income distribution helps in identifying potential outliers and understanding the overall wealth distribution among clients.
Risk Distribution: Understanding the distribution of risk flags in the dataset.

Data Exploration Insights

Missing Values: The dataset contained missing values in both numeric and categorical columns. Missing values were handled by filling numeric columns with their median values and categorical columns with their mode.
Outliers: Identified and handled potential outliers in numeric columns to ensure model robustness.
Feature Relationships: Explored relationships between features using correlation matrices and scatter plots.

Feature Engineering

Feature engineering involved creating new features and encoding categorical variables. Significant features such as age, income, and experience were included.

Steps

Handling Missing Values: Numeric columns were filled with median values, and categorical columns were filled with mode values.
Categorical Encoding: Converted categorical variables into dummy variables using one-hot encoding.

Model Building and Evaluation

A Random Forest Classifier was used to build the prediction model. Hyperparameter tuning was performed using GridSearchCV to optimize the model's performance.

Hyperparameter Tuning

The parameter grid for GridSearchCV included:

• n_estimators: [100, 200, 300] 
• max_depth: [None, 10, 20 , 30] 
• min_samples_split: [2, 5, 10] 
• min_samples_leaf: [1, 2, 4]

The best parameters found were:

• max_depth: None 
• min_samples_leaf: 1 
• min_samples_split: 10 
• n_estimators: 200

Model Performance

The model was evaluated on both training and testing data to ensure it generalizes well.

Training Performance

• Accuracy: 0.9999603174603174 
• Classification Report: 
            precision    recall  f1-score   support 
 
           0       1.00      1.00      1.00    176857 
           1       1.00      1.00      1.00     24743 
 
    accuracy                           1.00    201600 
   macro avg       1.00      1.00      1.00    201600 
weighted avg       1.00      1.00      1.00    201600

Testing Performance

• Testing Accuracy: 0.9014880952380953 
• Testing Classification Report: 
               precision    recall  f1-score   support 
 
           0       0.93      0.96      0.94     44147 
           1       0.64      0.46      0.54      6253 
 
    accuracy                           0.90     50400 
   macro avg       0.78      0.71      0.74     50400 
weighted avg       0.89      0.90      0.89     50400 

• Confusion Matrix: 
[[42538  1609] 
 [ 3356  2897]]

Results and Insights

Final Model Performance

• Final Testing Accuracy: 0.9045039682539683 
• Final Classification Report: 
 
           precision    recall  f1-score   support 
 
       0       0.92      0.97      0.95     44147 
       1       0.68      0.43      0.53      6253 
 
accuracy                           0.90     50400 
macro avg      0.80      0.70      0.74     50400  
weighted avg   0.89      0.90      0.89     50400

Main Deciding Factors

The key features influencing the risk prediction include:

Employment Status: Stable employment status correlates with lower risk.
Income: Higher income levels generally indicate lower risk
Age Group: Certain age groups may exhibit different risk profiles.

Conclusion

The Random Forest model effectively predicts loan risk, achieving a testing accuracy of approximately 90%. While the model performs exceptionally well on training data, careful tuning and validation are crucial to prevent overfitting and ensure reliable predictions on new data. The main deciding factors such as credit score and employment status provide valuable insights into the risk assessment process.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Loan_risk.ipynb		Loan_risk.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Loan-Risk-Predicting

Table of Contents

Technologies Used

Data Exploration

Data Visualizations

Data Exploration Insights

Feature Engineering

Steps

Model Building and Evaluation

Hyperparameter Tuning

Model Performance

Training Performance

Testing Performance

Results and Insights

Final Model Performance

Main Deciding Factors

Conclusion

Data Visualization and Exploration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Loan-Risk-Predicting

Table of Contents

Technologies Used

Data Exploration

Data Visualizations

Data Exploration Insights

Feature Engineering

Steps

Model Building and Evaluation

Hyperparameter Tuning

Model Performance

Training Performance

Testing Performance

Results and Insights

Final Model Performance

Main Deciding Factors

Conclusion

Data Visualization and Exploration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages