π Internship Project | NLP & Text Classification
This project implements an SMS Spam Detection System using
Natural Language Processing (NLP) and a Multinomial Naive Bayes classifier.
The model classifies SMS messages into:
- β HAM β Legitimate messages
- π« SPAM β Promotional or unwanted messages
The system is trained on the popular SMS Spam Collection Dataset and achieves high accuracy with excellent precision.
- Dataset Name: SMS Spam Collection
- Total Messages: 5572
- HAM: 4825 messages (86.59%)
- SPAM: 747 messages (13.41%)
Columns:
labelβ ham / spammessageβ raw SMS text
- Python 3.x
- Pandas
- NLTK
- Scikit-learn
- Matplotlib
Each SMS message undergoes the following preprocessing steps:
- Convert text to lowercase
- Remove punctuation
- Remove digits
- Remove English stopwords (NLTK)
- Store cleaned text in a new column
"Free entry in 2 a wkly comp!!!" β "free entry wkly comp"
Technique Used: TF-IDF Vectorization
Why TF-IDF?
- Converts text into numerical vectors
- Highlights important words
- Reduces impact of very common words
- Training Set: (4457, 7431)
- Testing Set: (1115, 7431)
- Model: Multinomial Naive Bayes
- Works extremely well for text classification
- Fast and memory efficient
- Probabilistic interpretation
- Training Data: 80%
- Testing Data: 20%
- Stratified split to preserve class balance
| Metric | Score |
|---|---|
| Accuracy | 0.9641 |
| Precision | 1.0000 |
| Recall | ~0.75 |
| F1 Score | 0.8450 |
- π₯ Very high precision β Almost no HAM classified as SPAM
- β Good recall β Most SPAM messages are detected
- βοΈ Balanced F1 score
The model displays 5 random test predictions, showing:
- Original message
- Actual label
- Predicted label
Example: Message : hey tmr meet bugis Actual Label : HAM Predicted Label : HAM
The project generates and saves the following plots:
- Shows HAM vs SPAM message counts
- Saved as:
results/class_distribution.png
- Displays most influential words for SPAM classification
- Extracted from Naive Bayes log probabilities
- Saved as:
results/top_spam_words.png
pip install requirementsimport nltk
nltk.download("stopwords")
#3οΈβ£ Run the model
python model.py
- Complete Machine Learning pipeline
- Real-world NLP dataset
- Clean and effective text preprocessing
- Strong evaluation metrics (Accuracy, Precision, Recall, F1-score)
- Clear and meaningful visualizations
- Well-structured, readable, and commented code
- Add confusion matrix visualization
- Experiment with Logistic Regression and SVM
- Perform hyperparameter tuning for better performance
- Deploy the model using Flask or Streamlit
Arnab Datta
Internship Project β Machine Learning & NLP