Amazon Fine Food Reviews Analysis

A comprehensive text preprocessing and feature extraction project for sentiment analysis of Amazon fine food reviews using Natural Language Processing techniques.

Overview

This project performs sentiment analysis on the Amazon Fine Food Reviews dataset, implementing various text preprocessing techniques and feature extraction methods including Bag of Words (BoW), N-grams, and TF-IDF vectorization.

Dataset

The project uses the Amazon Fine Food Reviews database containing customer reviews with the following key features:

ProductId: Unique identifier for products
UserId: Unique identifier for users
Score: Rating given by users (1-5 scale)
Text: Review text content
Summary: Brief review summary
Time: Timestamp of the review

Dataset Download

Due to the large file size, the dataset is not included in this repository. Please download it from Kaggle:

Kaggle Dataset: Amazon Fine Food Reviews

The dataset contains the following files:

/kaggle/input/amazon-fine-food-reviews/hashes.txt
/kaggle/input/amazon-fine-food-reviews/Reviews.csv
/kaggle/input/amazon-fine-food-reviews/database.sqlite

Reviews.csv - Main dataset file (required)
database.sqlite - SQLite database format (optional)
hashes.txt - File hashes for verification

After downloading, place Reviews.csv in the project root directory.

Features

Data Preprocessing

Score Filtering: Excludes neutral reviews (score = 3) to focus on clearly positive/negative sentiment
Binary Classification: Converts scores to binary labels (positive: 4-5, negative: 1-2)
Deduplication: Removes duplicate reviews based on UserId, ProfileName, Time, and Text
Data Validation: Filters out invalid entries where helpfulness numerator exceeds denominator

Text Preprocessing

HTML Tag Removal: Strips HTML tags from review text
Punctuation Cleaning: Removes special characters and punctuation
Stop Words Removal: Filters common English stop words
Stemming: Applies Porter Stemmer to reduce words to root forms
Text Normalization: Converts to lowercase and filters short words

Feature Extraction Methods

1. Bag of Words (BoW)

Creates sparse matrix representation of text data
Vocabulary size: 115,281 unique terms
Binary occurrence counting for each document

2. N-grams Analysis

Unigrams: Single word features
Bigrams: Two-word combinations
Combined (1,2)-grams: Both unigrams and bigrams
Feature space: 2,910,192 total features

3. TF-IDF Vectorization

Term Frequency-Inverse Document Frequency weighting
Reduces impact of common words across corpus
Highlights distinctive terms for each document
Same feature space as n-grams (2,910,192 features)

Key Statistics

Original Dataset: ~500K+ reviews
After Preprocessing: 364,171 reviews (69.26% retention)
Class Distribution:
- Positive Reviews: 307,061 (84.3%)
- Negative Reviews: 57,110 (15.7%)

Technical Implementation

Libraries Used

# Core Libraries
import pandas as pd
import numpy as np
import sqlite3

# NLP Libraries
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

# ML Libraries  
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import confusion_matrix, roc_curve, auc

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

Data Flow

Data Loading: SQLite database connection and filtering
Preprocessing: Text cleaning and normalization
Feature Engineering: Multiple vectorization techniques
Analysis: Most frequent terms and TF-IDF scoring

Installation & Usage

Prerequisites

pip install pandas numpy scikit-learn nltk matplotlib seaborn

Setup

# Download NLTK data
import nltk
nltk.download('stopwords')

Running the Analysis

Place the database.sqlite file in your working directory
Update the database path in the connection string
Execute the Jupyter notebook cells sequentially

Key Findings

Most Common Terms

Positive Reviews: like, taste, good, flavor, love, great, product Negative Reviews: taste, like, product, flavor, would, try, use

Text Processing Results

Effective removal of HTML tags and special characters
Successful stemming reduces vocabulary size
TF-IDF weighting reveals document-specific important terms

Project Structure

amazon-food-reviews-analysis/
├── Amazon_Fine_Food_(1).ipynb    # Main analysis notebook
├── database.sqlite               # Dataset File
├── README.md                    # This file
└── requirements.txt             # Dependencies

Future Enhancements

Machine learning model implementation for sentiment prediction
Advanced preprocessing techniques (lemmatization, named entity recognition)
Deep learning approaches (LSTM, BERT)
Cross-validation and model evaluation metrics
Visualization of sentiment trends over time

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

License

This project is open source and this project is available for educational viewing .

Acknowledgments

Amazon for providing the Fine Food Reviews dataset
NLTK team for natural language processing tools
Scikit-learn contributors for machine learning utilities

Note: This is an educational project demonstrating text preprocessing and feature extraction techniques for sentiment analysis. The dataset path needs to be updated based on your local setup.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Amazon_Fine_Food_.ipynb		Amazon_Fine_Food_.ipynb
README.md		README.md
requirnments.txt		requirnments.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amazon Fine Food Reviews Analysis

Overview

Dataset

Dataset Download

Features

Data Preprocessing

Text Preprocessing

Feature Extraction Methods

1. Bag of Words (BoW)

2. N-grams Analysis

3. TF-IDF Vectorization

Key Statistics

Technical Implementation

Libraries Used

Data Flow

Installation & Usage

Prerequisites

Setup

Running the Analysis

Key Findings

Most Common Terms

Text Processing Results

Project Structure

Future Enhancements

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Amazon Fine Food Reviews Analysis

Overview

Dataset

Dataset Download

Features

Data Preprocessing

Text Preprocessing

Feature Extraction Methods

1. Bag of Words (BoW)

2. N-grams Analysis

3. TF-IDF Vectorization

Key Statistics

Technical Implementation

Libraries Used

Data Flow

Installation & Usage

Prerequisites

Setup

Running the Analysis

Key Findings

Most Common Terms

Text Processing Results

Project Structure

Future Enhancements

Contributing

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages