A machine learning-based spam classification application that uses a Flask web framework and a trained model to classify messages as spam or not.
git clone https://github.com/arunmm8335/Spam-Classifier.git
cd Spam-Classifierpython -m venv .venvOn Windows:
.venv\Scripts\activateOn Mac/Linux:
source venv/bin/activatepip install -r requirements.txtThis will install the necessary Python libraries like Flask, pandas, scikit-learn, and nltk.
When you run the app for the first time, the NLTK library will attempt to download the stopwords dataset. Ensure you're connected to the internet for this step.
Make sure the spam.csv file is placed inside the data/ directory. The dataset can be downloaded from Kaggle's Spam SMS Dataset.
The model training process is handled by spam_classifier.py, which uses scikit-learn's machine learning algorithms. If you want to retrain or experiment with different algorithms, run:
python spam_classifier.pyThis will:
- Load and preprocess the dataset.
- Train a spam classification model.
- Save the trained model as
spam_classifier.pklinside themodels/directory.
After training, run the Flask application to deploy the spam classifier as a web service:
python app/app.pyThe app will start a local server, usually accessible at http://127.0.0.1:5000.
- Navigate to the website.
- Enter a message in the text field and click "Check" to classify it as either Spam or Ham (Not Spam).
Spam-Classifier/
│── app/ # Contains Flask web application files
│ ├── app.py # Main Flask application
│ ├── templates/ # HTML templates (e.g., index.html)
│── data/ # Directory for storing the dataset (spam.csv)
│── models/ # Directory for saving the trained model (spam_classifier.pkl)
│── notebooks/ # Jupyter notebooks for model experimentation
│── spam_classifier.py # Script for training the model
│── requirements.txt # List of required Python libraries
The model uses Logistic Regression (or another chosen classifier) trained on features extracted from SMS text messages. Text features are created using TF-IDF Vectorization, a method of transforming text into numerical vectors that reflect word importance.
- Data Loading – The dataset is loaded into a pandas DataFrame.
- Preprocessing – The text data is cleaned (removing stopwords, punctuation, and converting to lowercase).
- Feature Extraction – TF-IDF Vectorizer converts text into numerical features.
- Model Training – A Logistic Regression model (or other classifier) is trained.
- Model Evaluation – Performance is assessed using accuracy, precision, recall, and F1-score.
The trained model is evaluated using key metrics:
- Accuracy – Overall classification accuracy.
- Precision – Correct positive predictions out of all positive predictions.
- Recall – Actual positive samples correctly identified.
- F1-Score – Harmonic mean of precision and recall.
Accuracy: 0.9668
precision recall f1-score support
0 0.96 1.00 0.98 965
1 1.00 0.75 0.86 150
accuracy 0.97 1115
macro avg 0.98 0.88 0.92 1115
weighted avg 0.97 0.97 0.96 1115