Skip to content

Nithya162/Amazon-Fine-Food

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

Amazon Fine Food Reviews Analysis

A comprehensive text preprocessing and feature extraction project for sentiment analysis of Amazon fine food reviews using Natural Language Processing techniques.

Overview

This project performs sentiment analysis on the Amazon Fine Food Reviews dataset, implementing various text preprocessing techniques and feature extraction methods including Bag of Words (BoW), N-grams, and TF-IDF vectorization.

Dataset

The project uses the Amazon Fine Food Reviews database containing customer reviews with the following key features:

  • ProductId: Unique identifier for products
  • UserId: Unique identifier for users
  • Score: Rating given by users (1-5 scale)
  • Text: Review text content
  • Summary: Brief review summary
  • Time: Timestamp of the review

Dataset Download

Due to the large file size, the dataset is not included in this repository. Please download it from Kaggle:

Kaggle Dataset: Amazon Fine Food Reviews

The dataset contains the following files:

/kaggle/input/amazon-fine-food-reviews/hashes.txt
/kaggle/input/amazon-fine-food-reviews/Reviews.csv
/kaggle/input/amazon-fine-food-reviews/database.sqlite
  • Reviews.csv - Main dataset file (required)
  • database.sqlite - SQLite database format (optional)
  • hashes.txt - File hashes for verification

After downloading, place Reviews.csv in the project root directory.

Features

Data Preprocessing

  • Score Filtering: Excludes neutral reviews (score = 3) to focus on clearly positive/negative sentiment
  • Binary Classification: Converts scores to binary labels (positive: 4-5, negative: 1-2)
  • Deduplication: Removes duplicate reviews based on UserId, ProfileName, Time, and Text
  • Data Validation: Filters out invalid entries where helpfulness numerator exceeds denominator

Text Preprocessing

  • HTML Tag Removal: Strips HTML tags from review text
  • Punctuation Cleaning: Removes special characters and punctuation
  • Stop Words Removal: Filters common English stop words
  • Stemming: Applies Porter Stemmer to reduce words to root forms
  • Text Normalization: Converts to lowercase and filters short words

Feature Extraction Methods

1. Bag of Words (BoW)

  • Creates sparse matrix representation of text data
  • Vocabulary size: 115,281 unique terms
  • Binary occurrence counting for each document

2. N-grams Analysis

  • Unigrams: Single word features
  • Bigrams: Two-word combinations
  • Combined (1,2)-grams: Both unigrams and bigrams
  • Feature space: 2,910,192 total features

3. TF-IDF Vectorization

  • Term Frequency-Inverse Document Frequency weighting
  • Reduces impact of common words across corpus
  • Highlights distinctive terms for each document
  • Same feature space as n-grams (2,910,192 features)

Key Statistics

  • Original Dataset: ~500K+ reviews
  • After Preprocessing: 364,171 reviews (69.26% retention)
  • Class Distribution:
    • Positive Reviews: 307,061 (84.3%)
    • Negative Reviews: 57,110 (15.7%)

Technical Implementation

Libraries Used

# Core Libraries
import pandas as pd
import numpy as np
import sqlite3

# NLP Libraries
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

# ML Libraries  
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import confusion_matrix, roc_curve, auc

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

Data Flow

  1. Data Loading: SQLite database connection and filtering
  2. Preprocessing: Text cleaning and normalization
  3. Feature Engineering: Multiple vectorization techniques
  4. Analysis: Most frequent terms and TF-IDF scoring

Installation & Usage

Prerequisites

pip install pandas numpy scikit-learn nltk matplotlib seaborn

Setup

# Download NLTK data
import nltk
nltk.download('stopwords')

Running the Analysis

  1. Place the database.sqlite file in your working directory
  2. Update the database path in the connection string
  3. Execute the Jupyter notebook cells sequentially

Key Findings

Most Common Terms

Positive Reviews: like, taste, good, flavor, love, great, product Negative Reviews: taste, like, product, flavor, would, try, use

Text Processing Results

  • Effective removal of HTML tags and special characters
  • Successful stemming reduces vocabulary size
  • TF-IDF weighting reveals document-specific important terms

Project Structure

amazon-food-reviews-analysis/
├── Amazon_Fine_Food_(1).ipynb    # Main analysis notebook
├── database.sqlite               # Dataset File
├── README.md                    # This file
└── requirements.txt             # Dependencies

Future Enhancements

  • Machine learning model implementation for sentiment prediction
  • Advanced preprocessing techniques (lemmatization, named entity recognition)
  • Deep learning approaches (LSTM, BERT)
  • Cross-validation and model evaluation metrics
  • Visualization of sentiment trends over time

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

License

This project is open source and this project is available for educational viewing .

Acknowledgments

  • Amazon for providing the Fine Food Reviews dataset
  • NLTK team for natural language processing tools
  • Scikit-learn contributors for machine learning utilities

Note: This is an educational project demonstrating text preprocessing and feature extraction techniques for sentiment analysis. The dataset path needs to be updated based on your local setup.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors