This project implements a real-world loan approval prediction system using Machine Learning.
The goal is to predict whether a loan application will be approved or rejected based on applicant details such as income, education, credit history, and property area.
Unlike toy ML projects, this solution focuses on building a production-style pipeline, handling unseen test data, and avoiding common pitfalls such as data leakage and inconsistent encoding.
Loan approval is a high-stakes decision problem commonly faced by banks and financial institutions.
In real deployments, models must:
- Handle missing and inconsistent data
- Work on unseen applicants (no labels available)
- Apply exactly the same preprocessing logic used during training
- Fail safely without breaking on new inputs
This project was designed to simulate that real-world deployment scenario.
- Handling missing values using appropriate strategies
- Distinguishing between categorical vs numerical features
- Preventing data leakage between training and test data
- Label Encoding for binary / ordinal categories
- One-Hot Encoding for nominal features
- Feature scaling using StandardScaler
Trained and evaluated multiple models:
- Logistic Regression (final selected model)
- K-Nearest Neighbors (KNN)
- Decision Tree Classifier
Model selection was based on:
- Accuracy
- Precision / Recall
- Confusion Matrix analysis
- Real-world interpretability
- Reused trained encoders and scalers correctly
- Handled unseen test data safely
- Ensured feature alignment between train and test datasets
- Generated batch predictions without ground truth (deployment scenario)
This is a step many beginner projects skip, but it is essential in real ML systems.
Loan-Approval-Risk-Prediction/
│
├── loan_approval_risk_prediction.ipynb # Complete end-to-end ML pipeline
├── train_data.csv # Training dataset (with target variable)
├── test_data.csv # Unseen test dataset (no target variable)
├── loan_approval_predictions.csv # Model predictions on test data
├── README.md # Project documentation
The project uses two datasets:
- Contains historical loan application data
- Includes the target variable
Loan_Status1→ Loan Approved0→ Loan Rejected
- Used for:
- Data preprocessing
- Model training
- Model evaluation
- Contains new, unseen loan applications
- Does not include
Loan_Status - Used to simulate a real-world deployment scenario
- Final predictions are generated for this dataset
Both datasets are included in this repository so that any recruiter, reviewer, or developer can run the notebook end-to-end without additional downloads.
- Clone the repository
- Ensure the following files are in the same directory:
loan_approval_risk_prediction.ipynbtrain_data.csvtest_data.csv
- Open the notebook and run all cells top to bottom
- The final output file
loan_approval_predictions.csvwill be generated automatically
The model generates a file: loan_approval_predictions.csv
This file contains loan approval decisions for new applicants, exactly how a backend ML service would output predictions in a real system.
- Accuracy: ~86%
- Strong recall for approved loans
- Balanced performance across classes
- Logistic Regression chosen for stability and interpretability
This project explicitly avoids:
- Refitting encoders on test data
- Using encoded values to fill missing categorical features
- Feature order mismatch during inference
- Scaling test data incorrectly
- Crashing on unseen inputs
These issues are very common in ML projects, but were carefully handled here.
This project emphasizes how Machine Learning is actually used in practice, not just how models are trained in tutorials.
It demonstrates a strong foundation in data preprocessing, model evaluation, and production-style inference.