Skip to content

SaneethSunkari/google-playstore-product-analytics

Repository files navigation

Google Play Store Product Analytics

From messy app metadata and review text to product insights, SQL analysis, a stakeholder-ready dashboard, and a decision memo.

Python SQL scikit-learn Dashboard Apps Categories Avg Rating

Open Dashboard Read Memo View SQL


Dashboard Preview

TL;DR

This project asks a product question that matters:

What drives app quality perception, and where do star ratings fail to reflect what users are actually saying?

I cleaned raw Google Play Store app and review data, built a reusable SQLite layer, wrote production-style SQL, ran segmentation and sentiment analysis in Python, used a lightweight ML model for interpretation, and packaged the outputs into a dashboard and memo.

At A Glance

Area What This Repo Delivers
Business focus Product quality perception, sentiment gaps, update cadence, install-tier performance
Data App metadata + review sentiment from the Google Play Store dataset
Analytics stack Python, pandas, SQL, SQLite, seaborn, scikit-learn
Outputs Executed notebooks, SQL queries, processed data, dashboard, memo, figures
Audience Hiring managers, recruiters, PMs, analytics interviewers

Headline Results

Finding Result Why It Matters
Update cadence Apps updated in the last 3 months averaged 4.25 stars vs 4.05 for apps untouched for 2+ years Freshness is a strong quality signal
Scale vs quality Apps in the 10M+ install tier averaged 4.32 stars vs 4.10 for apps below 100K installs Large products tend to sustain better rating health
Ratings vs sentiment Game showed the biggest mismatch: 4.40 average rating but only 0.058 average sentiment Ratings alone can hide product friction
Portfolio health Only 30.5% of apps landed in Star Apps; 14.8% landed in Declining Apps Portfolio segmentation helps prioritize intervention

Why This Project Stands Out

  • It goes end to end: raw data, cleaning, SQL, Python, ML interpretation, dashboard, memo, and repo packaging.
  • It is framed around a real product question instead of generic exploratory analysis.
  • It keeps the ML honest: the model supports interpretation instead of pretending to be the whole project.
  • It finishes with decision-ready outputs, not just notebooks.

Project Evidence

  • Changelog - how the project evolved from cleaning to SQL, modeling, dashboarding, and portfolio packaging
  • Roadmap - what would turn this static case study into a stronger analytics product
  • Validation and boundaries - what the project proves and what it does not claim
  • Product memo - stakeholder-ready recommendations and limitations

Workflow

flowchart LR
    A["Raw App Data"] --> B["Cleaning + Standardization"]
    C["Raw Review Data"] --> B
    B --> D["SQLite Warehouse"]
    D --> E["SQL Analysis"]
    D --> F["Python Analysis"]
    F --> G["Segmentation + Sentiment + Modeling"]
    E --> H["Dashboard Assets"]
    G --> H
    G --> I["Product Memo"]
Loading

What I Built

Data Engineering

  • cleaned raw fields like Installs, Size, Price, and Reviews
  • removed the corrupted source row and dropped unusable ratings
  • standardized categorical fields like Type and Content Rating
  • deduplicated app-level records and stored clean tables in SQLite
  • generated reusable processed datasets in data/processed/

SQL Analysis

The SQL layer answers the core descriptive questions:

  • Which categories perform best on rating and review volume?
  • How does quality shift by install tier?
  • How do free and paid apps compare within category?
  • Which apps dominate review volume inside each category?
  • How closely do app ratings align with review sentiment?

Core query files:

Python Analysis

Python was used to go beyond descriptive SQL:

  • RFM-style segmentation: recency, review volume, and installs
  • Sentiment analysis: polarity distribution, category-level negatives, review text patterns
  • Recency analysis: how update freshness relates to ratings and installs
  • Model interpretation: a RandomForestClassifier predicting whether an app clears the 4.0+ rating threshold

Decision Support Outputs

Dashboard + Memo

Deliverable Purpose
Local Dashboard Interactive summary for category intelligence, sentiment, and app health
Product Memo Executive summary with findings, recommendations, and limitations
Dashboard Preview Repo-friendly static image for GitHub

Machine Learning, Kept Honest

The ML component is intentionally lightweight. The goal was not to build a production model. The goal was to understand which features were most associated with apps clearing a 4.0 rating threshold.

Target

  • 1 = app rating >= 4.0
  • 0 = app rating < 4.0

Features

  • category
  • installs
  • reviews
  • size
  • type
  • content rating
  • average sentiment

Reported metrics

  • Precision: 0.844
  • Recall: 0.793
  • F1: 0.818

Takeaways

  • review volume carried the most predictive weight
  • app size and category context also mattered
  • sentiment helped, but it worked best as part of the broader feature set

Repo Tour

Path What You Will Find
data/processed/ Cleaned datasets, SQL outputs, modeling summaries, and final metrics
notebooks/ Executed notebooks for cleaning, SQL, Python analysis, and modeling
sql/ Reusable SQL query files
figures/ Charts used across the dashboard and memo
dashboard/ Local HTML dashboard and preview assets
memo/ Product memo in Markdown and PDF
scripts/ Rebuild pipeline and notebook generator

Quick Start

1. Install dependencies

python3 -m pip install -r requirements.txt

2. Add raw data

Place these files in data/raw/:

  • googleplaystore.csv
  • googleplaystore_user_reviews.csv

Dataset source: Kaggle - Google Play Store Apps

3. Run the full build

python3 scripts/project_pipeline.py

4. Regenerate notebooks if needed

python3 scripts/generate_notebooks.py

Deliverables

Limitations
  • This is observational data, so the project highlights patterns rather than proving causality.
  • Installs and reviews are proxies for engagement, not true event-level behavior.
  • Review sentiment is useful, but some apps have richer review coverage than others.
  • The repo ships with a polished local dashboard instead of a public Tableau link.
Next Steps
  • add event-level retention and funnel data
  • compare pre/post update behavior instead of using freshness as a proxy
  • track sentiment and ratings over time instead of treating them as static summaries
  • layer in monetization or revenue data for stronger business prioritization

About

End-to-end product analytics case study using Google Play metadata, review sentiment, SQL, ML interpretation, dashboarding, and a decision memo.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors