A comprehensive text preprocessing and feature extraction project for sentiment analysis of Amazon fine food reviews using Natural Language Processing techniques.
This project performs sentiment analysis on the Amazon Fine Food Reviews dataset, implementing various text preprocessing techniques and feature extraction methods including Bag of Words (BoW), N-grams, and TF-IDF vectorization.
The project uses the Amazon Fine Food Reviews database containing customer reviews with the following key features:
- ProductId: Unique identifier for products
- UserId: Unique identifier for users
- Score: Rating given by users (1-5 scale)
- Text: Review text content
- Summary: Brief review summary
- Time: Timestamp of the review
Due to the large file size, the dataset is not included in this repository. Please download it from Kaggle:
Kaggle Dataset: Amazon Fine Food Reviews
The dataset contains the following files:
/kaggle/input/amazon-fine-food-reviews/hashes.txt
/kaggle/input/amazon-fine-food-reviews/Reviews.csv
/kaggle/input/amazon-fine-food-reviews/database.sqlite
Reviews.csv- Main dataset file (required)database.sqlite- SQLite database format (optional)hashes.txt- File hashes for verification
After downloading, place Reviews.csv in the project root directory.
- Score Filtering: Excludes neutral reviews (score = 3) to focus on clearly positive/negative sentiment
- Binary Classification: Converts scores to binary labels (positive: 4-5, negative: 1-2)
- Deduplication: Removes duplicate reviews based on UserId, ProfileName, Time, and Text
- Data Validation: Filters out invalid entries where helpfulness numerator exceeds denominator
- HTML Tag Removal: Strips HTML tags from review text
- Punctuation Cleaning: Removes special characters and punctuation
- Stop Words Removal: Filters common English stop words
- Stemming: Applies Porter Stemmer to reduce words to root forms
- Text Normalization: Converts to lowercase and filters short words
- Creates sparse matrix representation of text data
- Vocabulary size: 115,281 unique terms
- Binary occurrence counting for each document
- Unigrams: Single word features
- Bigrams: Two-word combinations
- Combined (1,2)-grams: Both unigrams and bigrams
- Feature space: 2,910,192 total features
- Term Frequency-Inverse Document Frequency weighting
- Reduces impact of common words across corpus
- Highlights distinctive terms for each document
- Same feature space as n-grams (2,910,192 features)
- Original Dataset: ~500K+ reviews
- After Preprocessing: 364,171 reviews (69.26% retention)
- Class Distribution:
- Positive Reviews: 307,061 (84.3%)
- Negative Reviews: 57,110 (15.7%)
# Core Libraries
import pandas as pd
import numpy as np
import sqlite3
# NLP Libraries
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
# ML Libraries
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import confusion_matrix, roc_curve, auc
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns- Data Loading: SQLite database connection and filtering
- Preprocessing: Text cleaning and normalization
- Feature Engineering: Multiple vectorization techniques
- Analysis: Most frequent terms and TF-IDF scoring
pip install pandas numpy scikit-learn nltk matplotlib seaborn# Download NLTK data
import nltk
nltk.download('stopwords')- Place the
database.sqlitefile in your working directory - Update the database path in the connection string
- Execute the Jupyter notebook cells sequentially
Positive Reviews: like, taste, good, flavor, love, great, product Negative Reviews: taste, like, product, flavor, would, try, use
- Effective removal of HTML tags and special characters
- Successful stemming reduces vocabulary size
- TF-IDF weighting reveals document-specific important terms
amazon-food-reviews-analysis/
├── Amazon_Fine_Food_(1).ipynb # Main analysis notebook
├── database.sqlite # Dataset File
├── README.md # This file
└── requirements.txt # Dependencies
- Machine learning model implementation for sentiment prediction
- Advanced preprocessing techniques (lemmatization, named entity recognition)
- Deep learning approaches (LSTM, BERT)
- Cross-validation and model evaluation metrics
- Visualization of sentiment trends over time
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
This project is open source and this project is available for educational viewing .
- Amazon for providing the Fine Food Reviews dataset
- NLTK team for natural language processing tools
- Scikit-learn contributors for machine learning utilities
Note: This is an educational project demonstrating text preprocessing and feature extraction techniques for sentiment analysis. The dataset path needs to be updated based on your local setup.