A multi-tool data analytics capstone project that investigates credit card transaction fraud using MySQL, Python (EDA), Microsoft Excel, and Tableau. The project covers the full analytics pipeline — from raw data ingestion and SQL querying to exploratory analysis in Python and an interactive Tableau dashboard.
Financial-Fraud-Analysis/
│
├── Financial Fraud.sql # MySQL script: DB setup, data import & SQL analysis
├── EDA & Python.ipynb # Jupyter Notebook: EDA & visualizations in Python
├── Excel Tasks.xlsb # Excel-based analysis and tasks
├── Financial Fraud Dashboard.twb # Tableau workbook for the interactive dashboard
├── Problem Statement.pdf # Project brief and objectives
├── requirements.txt # Python dependencies
├── Dataset/ # Raw dataset files
├── SQL Outputs/ # Screenshots / exports of SQL query results
└── LICENSE # MIT License
The dataset contains credit card transaction records with both legitimate and fraudulent entries. Key columns include:
| Column | Description |
|---|---|
TRANS_DATE_TRANS_TIME |
Date and time of the transaction |
CC_NUM |
Credit card number (anonymized) |
MERCHANT |
Merchant name |
CATEGORY |
Transaction category (e.g., grocery, travel) |
AMT |
Transaction amount |
FIRST_NAME / LAST_NAME |
Cardholder name |
GENDER |
Cardholder gender |
CITY / STATE / ZIP |
Cardholder location |
LAT / LONGITUDE |
Cardholder's geographic coordinates |
CITY_POP |
Population of cardholder's city |
JOB |
Cardholder's occupation |
DOB |
Date of birth |
TRANS_NUM |
Unique transaction identifier |
MERCH_LAT / MERCH_LONG |
Merchant's geographic coordinates |
IS_FRAUD |
Fraud label — 1 = Fraudulent, 0 = Legitimate |
A separate Location Data table (Location Data.csv) is used for geographic joins in SQL.
| Tool | Purpose |
|---|---|
| MySQL | Database setup, data import, SQL-based analysis |
| Python (Jupyter Notebook) | Exploratory Data Analysis & visualizations |
| Microsoft Excel | Supplementary analysis and task documentation |
| Tableau | Interactive fraud dashboard |
| pandas, matplotlib, seaborn, scikit-learn | Python libraries for analysis & ML |
Database Setup:
- Created
FINANCEdatabase with aCC_DATAtable mirroring the raw CSV schema. - Loaded the dataset using
LOAD DATA LOCAL INFILE. - Created a separate
LOCATION_DATAtable for geographic coordinates, joined toCC_DATAviaCC_NUM.
SQL Queries Performed:
| Analysis | Description |
|---|---|
| Total Transactions | Count of all records in the dataset |
| Top 10 Merchants | Most frequent merchants by transaction count |
| Avg. Transaction Amount by Category | Average spend per transaction category |
| Fraud Count & Percentage | Total fraudulent transactions and their share of all transactions |
| Transaction Geolocation | Joins CC_DATA with LOCATION_DATA to map each transaction to lat/long |
| City with Highest Population | Identifies the most populous city in the dataset |
| Transaction Date Range | Earliest and latest transaction timestamps |
| Total Transaction Value | Sum of all transaction amounts |
| Transactions by Category | Count of transactions per spending category |
| Avg. Amount by Gender | Average transaction amount split by cardholder gender |
| Transactions by Day of Week | Average transaction amount grouped by weekday |
The Jupyter Notebook covers end-to-end exploratory data analysis using pandas, matplotlib, seaborn, and scikit-learn:
- Data Loading & Inspection — shape, dtypes, null checks, descriptive statistics
- Fraud Distribution — class imbalance visualization (fraudulent vs. legitimate)
- Transaction Amount Analysis — distribution plots, outlier detection
- Category-wise Analysis — fraud rates per spending category
- Time-based Analysis — fraud patterns by hour, day, and month
- Geographic Analysis — mapping transaction and fraud locations
- Correlation Analysis — heatmaps to identify feature relationships
- Feature Engineering — extracting time-based features from transaction timestamps
Supplementary analysis covering pivot tables, summaries, and structured task outputs aligned with the project's problem statement.
An interactive Tableau dashboard presenting:
- 🗺️ Geographic fraud map — fraud hotspots by location
- 📊 Fraud by category — which transaction types are most vulnerable
- 👤 Demographic breakdown — fraud patterns by gender and age group
- 📅 Time trends — fraud activity over time
- 💳 Top fraudulent merchants — most targeted merchant names
Open
Financial Fraud Dashboard.twbin Tableau Desktop to explore the visuals.
- MySQL Server (v8.0+) with
local_infileenabled - Python 3.8+ with Jupyter Notebook
- Tableau Desktop (for
.twbfile) - Microsoft Excel (for
.xlsbfile)
1. Clone the repository
git clone https://github.com/PrakharSri18-data/Financial-Fraud-Analysis.git
cd Financial-Fraud-Analysis2. Set up the database
- Locate the dataset files inside the
Dataset/folder - Open
Financial Fraud.sqlin MySQL Workbench - Update the
LOAD DATA LOCAL INFILEpath to match your local file location - Run the full script
3. Run the Python notebook
pip install -r requirements.txt
jupyter notebook "EDA & Python.ipynb"4. Open the Tableau dashboard
- Launch Tableau Desktop
- Open
Financial Fraud Dashboard.twb - Re-connect the data source if prompted
pandas==3.0.2
numpy==2.4.4
matplotlib==3.10.8
seaborn==0.13.2
scikit-learn==1.8.0
scipy==1.17.1
Full list available in
requirements.txt
- The dataset is heavily imbalanced — fraudulent transactions make up only a small percentage of all records, highlighting the real-world challenge of fraud detection.
- Certain transaction categories show disproportionately high fraud rates relative to their volume.
- Geographic clustering reveals specific regions with elevated fraud activity.
- Time-of-day patterns suggest fraudulent transactions tend to spike during off-peak hours.
- Gender and city population show measurable differences in average transaction amounts.
This project is licensed under the MIT License © 2026 Prakhar Srivastava.
Prakhar Srivastava
Data Analyst, Data Scientist & AI Engineer | Machine Learning, Deep Learning, Generative AI, Prompt Engineering & Agentic AI