A machine learning project for detecting and classifying spam messages using Python, built with a Jupyter Notebook for interactive exploration and model development.
This project implements an SMS/Email spam detection system using machine learning techniques. It includes comprehensive data exploration, feature engineering, model training, evaluation, and visualization.
- Data Analysis: Comprehensive exploratory data analysis (EDA) of spam/ham messages
- Text Preprocessing: Tokenization, cleaning, and vectorization of text data
- Multiple Models: Implementation and comparison of various classification algorithms
- Visualization: Detailed visualizations of data distributions and model performance
- Model Evaluation: Comprehensive metrics including confusion matrices, ROC curves, and performance comparisons
email_spam_classifier/
├── sms-spam-detection.ipynb # Main Jupyter notebook with full analysis and modeling
├── spam.csv # Dataset containing SMS messages labeled as spam/ham
├── requirements.txt # Python dependencies
├── pyproject.toml # Project configuration
├── hello.py # Basic project entry point
└── README.md # This file
The project uses the spam.csv file containing SMS messages with labels:
- Ham: Legitimate messages
- Spam: Unwanted/spam messages
- Python 3.10 or higher
- pip or conda for package management
- Clone or download this repository
- Install dependencies:
pip install -r requirements.txt- Launch Jupyter Notebook:
jupyter notebook- Open
sms-spam-detection.ipynband run the cells
Core dependencies:
notebook- Jupyter notebook support- pandas - Data manipulation
- scikit-learn - Machine learning algorithms
- matplotlib/seaborn - Data visualization
- numpy - Numerical computing
See requirements.txt for complete list.
The sms-spam-detection.ipynb includes:
- Data Loading & Exploration - Load and examine the dataset structure
- Exploratory Data Analysis - Distribution analysis, text statistics
- Data Preprocessing - Cleaning, tokenization, and vectorization
- Model Training - Multiple classifier implementations
- Model Evaluation - Performance metrics and comparisons
- Visualization - Confusion matrices, ROC curves, feature importance
- Predictions - Making predictions on new messages
Run the Jupyter notebook cells sequentially to:
- Load and explore the spam dataset
- Preprocess text data
- Train multiple classification models
- Evaluate and compare model performance
- Generate visualizations and insights
Project settings are defined in pyproject.toml:
- Project name:
email-spam-classifier - Version:
0.1.0 - Python requirement:
>=3.10
This project is provided as-is for educational purposes.
Feel free to fork, modify, and improve this project for your learning purposes.
Note: This is a learning/development project. The notebook contains experimental code and iterative model development.