Skip to content

akhianil604/Aspect-Sentiment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UE23CS352A - Machine Learning - Mini Project

Project Title - Aspect-based Sentiment Analysis on Hotel Reviews

Authors:

  1. AKHILESH ANIL [PES1UG23CS045]
  2. AMOGH SHETTY [PES1UG23CS060]

5th Semester, 'A' Section, B.Tech. Computer Science & Engineering, PES University

Project Overview

  1. This project involves constructing an Aspect-based Sentiment Analysis classifier for TripAdvisor Hotel Reviews dataset.
  2. The pipeline extracts/cleans review text, builds n-gram bag-of-words / TF / TF-IDF features (with optional TruncatedSVD), compares class-balancing strategies, trains per-aspect models using MultinomialNB, Linear SVM, Linear Regression, and reports star-rating accuracy and polarity accuracy.
  3. The primary goal of our project is to prove computationally that the aspect-ratings-based sentiment analysis is better than overall sentiment analysis.

Our implementation is inspired by the works done by:

  1. Yangyang Yu (Stanford) - Click here for the paper
  2. Hongning Wang et. al - Latent Aspect Rating Analysis without Aspect Keyword Supervision [The 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2011] - Click here for datasets

Repository Structure

├── data/
│   ├── raw/                        
│   ├── cleaned/                    
├── notebooks/
│   ├── Preprocess_Extract.ipynb
│   ├── TextPreprocess_DataPrep.ipynb
│   ├── Class_Balance_Training.ipynb
│   ├── Train_PerLabel_Models.ipynb
│   └── Final_Evaluation_Suite.ipynb
├── artifacts/
│   ├── vectorizer.joblib
│   ├── svd_transform.joblib
│   ├── models/
│   │   ├── svc_aspect_<aspectname>.joblib
│   │   ├── mnb_aspect_<aspectname>.joblib
│   │   └── lr_aspect_<aspectname>.joblib
│   └── results/
│       ├── metrics_summary.csv
│       ├── confusion_matrix_<aspect>.png
│       └── feature_comparison_plots.png
├── requirements.txt
└── README.md

Quick reproducible steps to run the repository

  1. Create virtual environment and install dependencies:
python -m venv venv
source venv/bin/activate      # Windows: venv\Scripts\activate
pip install -r requirements.txt

# In Anaconda
conda create -n <Your Virtual Environment name>
conda activate <Your Virtual Environment name>
conda install ipykernel jupyterlab # Install required dependencies 
pip install -r requirements.txt # Assuming you have installed 'pip' during conda setup
  1. Open and run these notebooks in the given order:
    1. Preprocess_Extract.ipynb - Data extraction, cleaning, integration and DataFrame creation.
    2. TextPreprocess_DataPrep.ipynb - Text cleaning, Save vectorizers and transformed data with joblib
    3. Train_PerLabel_Models.ipynb - Train per-aspect models with no class balancing and save them for reproducibility
    4. Class_Balance_Training.ipynb - Run balancing experiments and extract results
    5. Final_Evaluation_Suite.ipynb — Final evaluation framwork to evaluate across different class balancing methods, ML algorithm comparison and primarily, is aspect-specific predictors better than overall predictors?

Note

Kindly pay attention to the file path for sourcing the datasets and prepared ML models across the above files. Since we're utilising multiple variants of the datasets and saved checkpoints of the ML models created for each variant (for reproducibility purposes), it is important to change the file path to match your system.

For instance, whenever you find the below path name across any file/cell: D:\PES University\5th Semester\\Machine Learning\\Aspect-Sentiment, replace with your path where you have setup the repository locally in your system.

Expected resources as output

  1. Cleaned_FinalMerged_reviews_ratings.csv produced by the extraction notebook.
  2. Trained models (saved with joblib) for each aspect.
  3. Final evaluation tables & plots (star-accuracy / polarity-accuracy per aspect, class-balance comparison, feature selection comparison). We're primarily answering: Is Aspect-specific predictors better than overall predictors for Hotel Reviews?

A review on garnered final outputs

Figure 1 - Star-rating accuracy v/s Dataset size

Star-rating accuracy and dataset size

Interpretation

  1. Accuracy increases monotonically with dataset size.
  2. General trend: MNB > SVM > LR at every size.
  3. Naive Bayes captures coarse star-rating signal effectively even on small data.
  4. SVM improves steadily with size but converges slightly below MNB.
  5. Linear Regression underperforms because it models continuous ratings and loses categorical discriminative power.

Overall justification: The results show that our preprocessing and n-gram feature engineering methods are stable (no overfitting at 40 %).

Figure 2 - Polarity Accuracy v/s Dataset size

Polarity accuracy and dataset size

Interpretation

  1. Polarity accuracy (positive/neutral/negative) is consistently higher (0.76 to 0.82) than fine-grained star prediction.
  2. Our curves flatten beyond 25 % data with sufficient samples to learn clear sentiment separation.

All algorithms approach parity since text polarity is a simpler, linearly separable task.

Figure 3 - Feature Configuration Comparisons

Feature configuration comparisons

Interpretation

Config Star Acc Polarity Acc Observation
Unigram (Count) 0.57 0.80 Good baseline
Unigram (TF-IDF) 0.61 0.82 Best overall
Char 5-gram (Count/TF-IDF) 0.60 0.81 Robust to typos, slightly below word features
Word 2/3-gram (Count/TF-IDF) 0.58 – 0.59 0.81 Phrase info adds noise on small corpora

Figure 4 - Overall vs Aspect-Specific Predictor

Overall v/s Aspect-specific predictor

Intepretation

Aspect Aspect-Specific Acc Overall-Trained on Aspect Winner
Value 0.48 0.48 Tie
Rooms 0.49 0.51 Overall
Service 0.58 0.57 Aspect
Location 0.56 0.48 Aspect
Cleanliness 0.56 0.52 Aspect
Front Desk 0.53 0.49 Aspect
Business Service 0.41 0.42 Overall
  1. Service, Location, Cleanliness, Front Desk clearly benefit from aspect-specific models to these aspects have strong lexical cues (“helpful staff”, “spotless room”, “central location”).
  2. Rooms, Value, Business Service are harder as vocabulary overlaps with other aspects, leading overall models to generalize better.

Final Verdict

We're satisfied with sufficient experimental evidences that the aspect training results are indeed better than the overall training results across majority of the aspects, which means training the predictor using aspect-based objectives was able to learn the aspect-specific features.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors