This project demonstrates text classification using the 20 Newsgroups dataset. The dataset contains approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups. The aim is to classify these documents into their respective categories using the Multinomial Naive Bayes model. The project follows a structured approach that includes data loading and exploration, data cleaning and preprocessing, feature extraction using TF-IDF vectorization, and model training and evaluation.
The 20 Newsgroups dataset is fetched using sklearn.datasets.fetch_20newsgroups and includes the following:
- Text Data: The content of the newsgroup posts.
- Target: Category labels for the posts.
- Target Names: Names of the categories.
- Python
- Pandas
- Numpy
- NLTK
- Scikit-Learn
- Seaborn
- Matplotlib
- Jupyter Notebook
text-classification/
│
├── data/
│ └── 20_newsgroups_dataset.csv
│
├── notebook/
│ └── text_classification_project.ipynb
│
├── results/
│ ├── distribution_of_categories.png
│ └── classification_report.txt
│
└── requirements.txt-
Data Collection: The project utilises the 'fetch_20newsgroups' dataset from the sklearn.datasets library. This dataset includes text data from 20 different newsgroups, which are categorised into various topics.
-
Data Preprocessing:
- Loading the Data: The dataset is fetched and loaded using the fetch_20newsgroups function from sklearn.datasets.
- Creating DataFrame: The data and target variables are extracted and combined into a Pandas DataFrame. The target variable is also mapped to category names for better interpretability.
- Saving Dataset: The DataFrame is saved as a CSV file for easy access and future use.
-
Exploratory Data Analysis (EDA):
- Checking Data Types: The data types of the DataFrame columns are checked to ensure correctness.
- Inspecting Missing Values: The dataset is inspected for any missing values, no missing values present in the dataset.
- Descriptive Statistics: Basic descriptive statistics are computed to understand the datase.
- Data Distribution Visualisation: The distribution of categories in the dataset is visualised.
-
Text Preprocessing and Cleaning:
- Removing Stopwords: Stopwords, which carry less meaningful information, are removed using NLTK's list of English stopwords.
- Text Cleaning: A function is defined to clean the text data by removing special characters, numbers, and converting text to lowercase.
-
Feature Extraction:
TF-IDF Vectorization: The cleaned text data is transformed into numerical features using the TF-IDF vectoriser (TfidfVectoriser from sklearn.feature_extraction.text).
-
Model Building and Training:
- Train-Test Split: The dataset is split into training and testing sets using an 80-20 split.
- Naive Bayes Classifier: A Multinomial Naive Bayes classifier (MultinomialNB from sklearn.naive_bayes) is trained on the training set.
-
Model Evaluation:
- Accuracy Score: The accuracy of the model is evaluated on the test set.
- Classification Report: A classification report is generated to provide detailed performance metrics for each category.
- Clone the repository:
git clone https://github.com/ellahu1003/text-classification-project.git cd text-classification-project - Install the required packages:
pip install -r Requirements.txt
- Run the Jupyter Notebook:
jupyter notebook notebook/Text_classification_project.ipynb
The 'Requirements.txt' file lists all the Python packages required to run the project. Install these dependencies to avoid any compatibility issues.
- Accuracy score: [0.88]
- The visual distribution of categories are available in distribution_of_categories_.png
- Detailed information about the model's performance in each class is available in classification_report.txt
The model performs well overall with a high accuracy of 88%. However, there are some classes where performance could be improved, particularly class 15, which has very low recall. This suggests that the model struggles to identify this class effectively.