A machine learning model to detect fraudulent credit card transactions, focusing on handling imbalanced data.
This project focuses on building a machine learning model to detect fraudulent credit card transactions. The primary challenge is the highly imbalanced nature of the dataset, where fraudulent transactions account for a very small fraction (0.17%) of the total. The goal is to develop a reliable classification model that can effectively identify fraud while minimizing false positives to ensure a good customer experience.
The dataset used is a public dataset from Kaggle containing credit card transactions made over a period of two days. It consists of 284,807 transactions, of which only 492 are fraudulent. Features V1 through V28 are the result of a PCA transformation to protect user privacy.
Link to Original Kaggle Dataset
The project followed a structured machine learning workflow:
- Data Exploration (EDA): Initial analysis confirmed the extreme class imbalance and identified that the
TimeandAmountfeatures required scaling. - Preprocessing: Applied
StandardScalerfrom Scikit-learn to theTimeandAmountcolumns to standardize their scales. - Model Training & Comparison:
- Baseline Model (Logistic Regression): A simple model was first trained to establish a performance baseline. It achieved high recall (92%) but very poor precision (6%), making it impractical due to a high number of false alarms.
- Advanced Model (Random Forest): A Random Forest Classifier was then trained. This model demonstrated a much better balance between precision and recall.
- Evaluation: The key challenge was selecting the right evaluation metric. Instead of relying on accuracy, the models were evaluated based on their Precision, Recall, and F1-Score, especially for the minority (fraud) class.
The final Random Forest model achieved:
- Precision: 96%
- Recall: 76%
This demonstrates a successful precision-recall trade-off. While the model doesn't catch every single fraudulent transaction (76% recall), the transactions it does flag are highly likely to be fraudulent (96% precision).
From a business perspective, this is a highly valuable outcome. It provides the fraud investigation team with a manageable and reliable list of alerts, drastically reducing the time wasted on false alarms and preventing the frustration of blocking legitimate customer transactions.
- Python 3
- Pandas & NumPy: For data manipulation and analysis.
- Matplotlib & Seaborn: For data visualization.
- Scikit-learn: For preprocessing, model training (Logistic Regression, Random Forest), and evaluation.