This project demonstrates how to build an email spam detection system using Natural Language Processing (NLP) and machine learning techniques. The goal is to automatically identify and filter out unwanted spam emails, improving user experience and security.
- Data: A dataset of emails labeled as spam or ham (non-spam).
- Preprocessing: Steps to clean and prepare the email data for modeling.
- Modeling: Machine learning algorithms used to classify emails.
- Evaluation: Assessing the performance of the models.
- Pipeline: A complete pipeline from data preprocessing to prediction.
- Testing: Examples of how the model performs on new, unseen emails.
The dataset contains two main columns:
- Category: Indicates whether the email is spam or ham.
- Message: The actual content of the email.
Effective preprocessing is essential for building a reliable spam detection model. The preprocessing steps include:
- Lowercasing
- Removing punctuation
- Removing non-alphabetic characters
- Tokenization
- Removing stop words
- Lemmatization
The processed data is used to train a machine learning model. Techniques like TF-IDF (Term Frequency-Inverse Document Frequency) are used for vectorization. The model is trained on the processed and vectorized data, with oversampling applied to balance the classes.
The model's performance is evaluated using metrics such as accuracy, precision, recall, and F1-score.
The trained model can be deployed as a web application using frameworks like Streamlit, allowing users to input email messages and receive predictions on whether they are spam or not.