📩 SMS Spam Classification using Machine Learning

🚀 Internship Project | NLP & Text Classification

📌 Project Overview

This project implements an SMS Spam Detection System using
Natural Language Processing (NLP) and a Multinomial Naive Bayes classifier.

The model classifies SMS messages into:

✅ HAM – Legitimate messages
🚫 SPAM – Promotional or unwanted messages

The system is trained on the popular SMS Spam Collection Dataset and achieves high accuracy with excellent precision.

📂 Dataset Information

Dataset Name: SMS Spam Collection
Total Messages: 5572

Class Distribution

HAM: 4825 messages (86.59%)
SPAM: 747 messages (13.41%)

Columns:

label → ham / spam
message → raw SMS text

🧰 Technologies Used

Python 3.x
Pandas
NLTK
Scikit-learn
Matplotlib

🧹 Text Preprocessing

Each SMS message undergoes the following preprocessing steps:

Convert text to lowercase
Remove punctuation
Remove digits
Remove English stopwords (NLTK)
Store cleaned text in a new column

Example

"Free entry in 2 a wkly comp!!!" ↓ "free entry wkly comp"

🔤 Feature Extraction (TF-IDF)

Technique Used: TF-IDF Vectorization

Why TF-IDF?

Converts text into numerical vectors
Highlights important words
Reduces impact of very common words

TF-IDF Shapes

Training Set: (4457, 7431)
Testing Set: (1115, 7431)

🤖 Machine Learning Model

Model: Multinomial Naive Bayes

Why Naive Bayes?

Works extremely well for text classification
Fast and memory efficient
Probabilistic interpretation

✂️ Train–Test Split

Training Data: 80%
Testing Data: 20%
Stratified split to preserve class balance

📊 Model Evaluation

Performance Metrics

Metric	Score
Accuracy	0.9641
Precision	1.0000
Recall	~0.75
F1 Score	0.8450

Interpretation

🔥 Very high precision → Almost no HAM classified as SPAM
✅ Good recall → Most SPAM messages are detected
⚖️ Balanced F1 score

📨 Sample Predictions

The model displays 5 random test predictions, showing:

Original message
Actual label
Predicted label

Example: Message : hey tmr meet bugis Actual Label : HAM Predicted Label : HAM

📈 Data Visualizations

The project generates and saves the following plots:

📌 Class Distribution Plot

Shows HAM vs SPAM message counts
Saved as:
results/class_distribution.png

📌 Top Spam Indicator Words

Displays most influential words for SPAM classification
Extracted from Naive Bayes log probabilities
Saved as:
results/top_spam_words.png

▶️ How to Run the Project

1️⃣ Install dependencies

pip install requirements

2️⃣ Download NLTK stopwords (one-time)

import nltk
nltk.download("stopwords")

#3️⃣ Run the model

python model.py

📌 Key Highlights

Complete Machine Learning pipeline
Real-world NLP dataset
Clean and effective text preprocessing
Strong evaluation metrics (Accuracy, Precision, Recall, F1-score)
Clear and meaningful visualizations
Well-structured, readable, and commented code

🧠 Future Improvements

Add confusion matrix visualization
Experiment with Logistic Regression and SVM
Perform hyperparameter tuning for better performance
Deploy the model using Flask or Streamlit

👨‍💻 Author

Arnab Datta
Internship Project – Machine Learning & NLP

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SMSSpamCollection.txt		SMSSpamCollection.txt
model.py		model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📩 SMS Spam Classification using Machine Learning

📌 Project Overview

📂 Dataset Information

Class Distribution

🧰 Technologies Used

🧹 Text Preprocessing

Example

🔤 Feature Extraction (TF-IDF)

TF-IDF Shapes

🤖 Machine Learning Model

Why Naive Bayes?

✂️ Train–Test Split

📊 Model Evaluation

Performance Metrics

Interpretation

📨 Sample Predictions

📈 Data Visualizations

📌 Class Distribution Plot

📌 Top Spam Indicator Words

▶️ How to Run the Project

1️⃣ Install dependencies

2️⃣ Download NLTK stopwords (one-time)

📌 Key Highlights

🧠 Future Improvements

👨‍💻 Author

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📩 SMS Spam Classification using Machine Learning

📌 Project Overview

📂 Dataset Information

Class Distribution

🧰 Technologies Used

🧹 Text Preprocessing

Example

🔤 Feature Extraction (TF-IDF)

TF-IDF Shapes

🤖 Machine Learning Model

Why Naive Bayes?

✂️ Train–Test Split

📊 Model Evaluation

Performance Metrics

Interpretation

📨 Sample Predictions

📈 Data Visualizations

📌 Class Distribution Plot

📌 Top Spam Indicator Words

▶️ How to Run the Project

1️⃣ Install dependencies

2️⃣ Download NLTK stopwords (one-time)

📌 Key Highlights

🧠 Future Improvements

👨‍💻 Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages