Project Title - Aspect-based Sentiment Analysis on Hotel Reviews

UE23CS352A - Machine Learning - Mini Project

Project Title - Aspect-based Sentiment Analysis on Hotel Reviews

Authors:

AKHILESH ANIL [PES1UG23CS045]
AMOGH SHETTY [PES1UG23CS060]

5th Semester, 'A' Section, B.Tech. Computer Science & Engineering, PES University

Project Overview

This project involves constructing an Aspect-based Sentiment Analysis classifier for TripAdvisor Hotel Reviews dataset.
The pipeline extracts/cleans review text, builds n-gram bag-of-words / TF / TF-IDF features (with optional TruncatedSVD), compares class-balancing strategies, trains per-aspect models using MultinomialNB, Linear SVM, Linear Regression, and reports star-rating accuracy and polarity accuracy.
The primary goal of our project is to prove computationally that the aspect-ratings-based sentiment analysis is better than overall sentiment analysis.

Our implementation is inspired by the works done by:

Yangyang Yu (Stanford) - Click here for the paper
Hongning Wang et. al - Latent Aspect Rating Analysis without Aspect Keyword Supervision [The 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2011] - Click here for datasets

Repository Structure

├── data/
│   ├── raw/                        
│   ├── cleaned/                    
├── notebooks/
│   ├── Preprocess_Extract.ipynb
│   ├── TextPreprocess_DataPrep.ipynb
│   ├── Class_Balance_Training.ipynb
│   ├── Train_PerLabel_Models.ipynb
│   └── Final_Evaluation_Suite.ipynb
├── artifacts/
│   ├── vectorizer.joblib
│   ├── svd_transform.joblib
│   ├── models/
│   │   ├── svc_aspect_<aspectname>.joblib
│   │   ├── mnb_aspect_<aspectname>.joblib
│   │   └── lr_aspect_<aspectname>.joblib
│   └── results/
│       ├── metrics_summary.csv
│       ├── confusion_matrix_<aspect>.png
│       └── feature_comparison_plots.png
├── requirements.txt
└── README.md

Quick reproducible steps to run the repository

Create virtual environment and install dependencies:

python -m venv venv
source venv/bin/activate      # Windows: venv\Scripts\activate
pip install -r requirements.txt

# In Anaconda
conda create -n <Your Virtual Environment name>
conda activate <Your Virtual Environment name>
conda install ipykernel jupyterlab # Install required dependencies 
pip install -r requirements.txt # Assuming you have installed 'pip' during conda setup

Open and run these notebooks in the given order:
1. Preprocess_Extract.ipynb - Data extraction, cleaning, integration and DataFrame creation.
2. TextPreprocess_DataPrep.ipynb - Text cleaning, Save vectorizers and transformed data with joblib
3. Train_PerLabel_Models.ipynb - Train per-aspect models with no class balancing and save them for reproducibility
4. Class_Balance_Training.ipynb - Run balancing experiments and extract results
5. Final_Evaluation_Suite.ipynb — Final evaluation framwork to evaluate across different class balancing methods, ML algorithm comparison and primarily, is aspect-specific predictors better than overall predictors?

Note

Kindly pay attention to the file path for sourcing the datasets and prepared ML models across the above files. Since we're utilising multiple variants of the datasets and saved checkpoints of the ML models created for each variant (for reproducibility purposes), it is important to change the file path to match your system.

For instance, whenever you find the below path name across any file/cell: D:\PES University\5th Semester\\Machine Learning\\Aspect-Sentiment, replace with your path where you have setup the repository locally in your system.

Expected resources as output

Cleaned_FinalMerged_reviews_ratings.csv produced by the extraction notebook.
Trained models (saved with joblib) for each aspect.
Final evaluation tables & plots (star-accuracy / polarity-accuracy per aspect, class-balance comparison, feature selection comparison). We're primarily answering: Is Aspect-specific predictors better than overall predictors for Hotel Reviews?

A review on garnered final outputs

Figure 1 - Star-rating accuracy v/s Dataset size

Interpretation

Accuracy increases monotonically with dataset size.
General trend: MNB > SVM > LR at every size.
Naive Bayes captures coarse star-rating signal effectively even on small data.
SVM improves steadily with size but converges slightly below MNB.
Linear Regression underperforms because it models continuous ratings and loses categorical discriminative power.

Overall justification: The results show that our preprocessing and n-gram feature engineering methods are stable (no overfitting at 40 %).

Figure 2 - Polarity Accuracy v/s Dataset size

Interpretation

Polarity accuracy (positive/neutral/negative) is consistently higher (0.76 to 0.82) than fine-grained star prediction.
Our curves flatten beyond 25 % data with sufficient samples to learn clear sentiment separation.

All algorithms approach parity since text polarity is a simpler, linearly separable task.

Figure 3 - Feature Configuration Comparisons

Interpretation

Config	Star Acc	Polarity Acc	Observation
Unigram (Count)	0.57	0.80	Good baseline
Unigram (TF-IDF)	0.61	0.82	Best overall
Char 5-gram (Count/TF-IDF)	0.60	0.81	Robust to typos, slightly below word features
Word 2/3-gram (Count/TF-IDF)	0.58 – 0.59	0.81	Phrase info adds noise on small corpora

Figure 4 - Overall vs Aspect-Specific Predictor

Intepretation

Aspect	Aspect-Specific Acc	Overall-Trained on Aspect	Winner
Value	0.48	0.48	Tie
Rooms	0.49	0.51	Overall
Service	0.58	0.57	Aspect
Location	0.56	0.48	Aspect
Cleanliness	0.56	0.52	Aspect
Front Desk	0.53	0.49	Aspect
Business Service	0.41	0.42	Overall

Service, Location, Cleanliness, Front Desk clearly benefit from aspect-specific models to these aspects have strong lexical cues (“helpful staff”, “spotless room”, “central location”).
Rooms, Value, Business Service are harder as vocabulary overlaps with other aspects, leading overall models to generalize better.

Final Verdict

We're satisfied with sufficient experimental evidences that the aspect training results are indeed better than the overall training results across majority of the aspects, which means training the predictor using aspect-based objectives was able to learn the aspect-specific features.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
Code		Code
Images		Images
Paper References		Paper References
README.md		README.md
TripAdvisor.zip		TripAdvisor.zip
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UE23CS352A - Machine Learning - Mini Project

Project Title - Aspect-based Sentiment Analysis on Hotel Reviews

Authors:

Project Overview

Repository Structure

Quick reproducible steps to run the repository

Expected resources as output

A review on garnered final outputs

Figure 1 - Star-rating accuracy v/s Dataset size

Interpretation

Figure 2 - Polarity Accuracy v/s Dataset size

Interpretation

Figure 3 - Feature Configuration Comparisons

Interpretation

Figure 4 - Overall vs Aspect-Specific Predictor

Intepretation

Final Verdict

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UE23CS352A - Machine Learning - Mini Project

Project Title - Aspect-based Sentiment Analysis on Hotel Reviews

Authors:

Project Overview

Repository Structure

Quick reproducible steps to run the repository

Expected resources as output

A review on garnered final outputs

Figure 1 - Star-rating accuracy v/s Dataset size

Interpretation

Figure 2 - Polarity Accuracy v/s Dataset size

Interpretation

Figure 3 - Feature Configuration Comparisons

Interpretation

Figure 4 - Overall vs Aspect-Specific Predictor

Intepretation

Final Verdict

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages