Skip to content

Arnab500th/Spam-sms-Classifier-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“© SMS Spam Classification using Machine Learning

πŸš€ Internship Project | NLP & Text Classification


πŸ“Œ Project Overview

This project implements an SMS Spam Detection System using
Natural Language Processing (NLP) and a Multinomial Naive Bayes classifier.

The model classifies SMS messages into:

  • βœ… HAM – Legitimate messages
  • 🚫 SPAM – Promotional or unwanted messages

The system is trained on the popular SMS Spam Collection Dataset and achieves high accuracy with excellent precision.


πŸ“‚ Dataset Information

  • Dataset Name: SMS Spam Collection
  • Total Messages: 5572

Class Distribution

  • HAM: 4825 messages (86.59%)
  • SPAM: 747 messages (13.41%)

Columns:

  • label β†’ ham / spam
  • message β†’ raw SMS text

🧰 Technologies Used

  • Python 3.x
  • Pandas
  • NLTK
  • Scikit-learn
  • Matplotlib

🧹 Text Preprocessing

Each SMS message undergoes the following preprocessing steps:

  • Convert text to lowercase
  • Remove punctuation
  • Remove digits
  • Remove English stopwords (NLTK)
  • Store cleaned text in a new column

Example

"Free entry in 2 a wkly comp!!!" ↓ "free entry wkly comp"


πŸ”€ Feature Extraction (TF-IDF)

Technique Used: TF-IDF Vectorization

Why TF-IDF?

  • Converts text into numerical vectors
  • Highlights important words
  • Reduces impact of very common words

TF-IDF Shapes

  • Training Set: (4457, 7431)
  • Testing Set: (1115, 7431)

πŸ€– Machine Learning Model

  • Model: Multinomial Naive Bayes

Why Naive Bayes?

  • Works extremely well for text classification
  • Fast and memory efficient
  • Probabilistic interpretation

βœ‚οΈ Train–Test Split

  • Training Data: 80%
  • Testing Data: 20%
  • Stratified split to preserve class balance

πŸ“Š Model Evaluation

Performance Metrics

Metric Score
Accuracy 0.9641
Precision 1.0000
Recall ~0.75
F1 Score 0.8450

Interpretation

  • πŸ”₯ Very high precision β†’ Almost no HAM classified as SPAM
  • βœ… Good recall β†’ Most SPAM messages are detected
  • βš–οΈ Balanced F1 score

πŸ“¨ Sample Predictions

The model displays 5 random test predictions, showing:

  • Original message
  • Actual label
  • Predicted label

Example: Message : hey tmr meet bugis Actual Label : HAM Predicted Label : HAM


πŸ“ˆ Data Visualizations

The project generates and saves the following plots:

πŸ“Œ Class Distribution Plot

  • Shows HAM vs SPAM message counts
  • Saved as:
    results/class_distribution.png
class_distribution

πŸ“Œ Top Spam Indicator Words

  • Displays most influential words for SPAM classification
  • Extracted from Naive Bayes log probabilities
  • Saved as:
    results/top_spam_words.png
top_spam_words

▢️ How to Run the Project

1️⃣ Install dependencies

pip install requirements

2️⃣ Download NLTK stopwords (one-time)

import nltk
nltk.download("stopwords")

#3️⃣ Run the model

python model.py

πŸ“Œ Key Highlights

  • Complete Machine Learning pipeline
  • Real-world NLP dataset
  • Clean and effective text preprocessing
  • Strong evaluation metrics (Accuracy, Precision, Recall, F1-score)
  • Clear and meaningful visualizations
  • Well-structured, readable, and commented code

🧠 Future Improvements

  • Add confusion matrix visualization
  • Experiment with Logistic Regression and SVM
  • Perform hyperparameter tuning for better performance
  • Deploy the model using Flask or Streamlit

πŸ‘¨β€πŸ’» Author

Arnab Datta
Internship Project – Machine Learning & NLP

About

This project was developed as part of an internship program to apply Python, Machine Learning, and NLP concepts by building an SMS Spam Classification system that distinguishes between spam and legitimate messages.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages