Google Play Store Product Analytics

From messy app metadata and review text to product insights, SQL analysis, a stakeholder-ready dashboard, and a decision memo.

TL;DR

This project asks a product question that matters:

What drives app quality perception, and where do star ratings fail to reflect what users are actually saying?

I cleaned raw Google Play Store app and review data, built a reusable SQLite layer, wrote production-style SQL, ran segmentation and sentiment analysis in Python, used a lightweight ML model for interpretation, and packaged the outputs into a dashboard and memo.

At A Glance

Area	What This Repo Delivers
Business focus	Product quality perception, sentiment gaps, update cadence, install-tier performance
Data	App metadata + review sentiment from the Google Play Store dataset
Analytics stack	Python, pandas, SQL, SQLite, seaborn, scikit-learn
Outputs	Executed notebooks, SQL queries, processed data, dashboard, memo, figures
Audience	Hiring managers, recruiters, PMs, analytics interviewers

Headline Results

Finding	Result	Why It Matters
Update cadence	Apps updated in the last 3 months averaged 4.25 stars vs 4.05 for apps untouched for 2+ years	Freshness is a strong quality signal
Scale vs quality	Apps in the 10M+ install tier averaged 4.32 stars vs 4.10 for apps below 100K installs	Large products tend to sustain better rating health
Ratings vs sentiment	Game showed the biggest mismatch: 4.40 average rating but only 0.058 average sentiment	Ratings alone can hide product friction
Portfolio health	Only 30.5% of apps landed in `Star Apps`; 14.8% landed in `Declining Apps`	Portfolio segmentation helps prioritize intervention

Why This Project Stands Out

It goes end to end: raw data, cleaning, SQL, Python, ML interpretation, dashboard, memo, and repo packaging.
It is framed around a real product question instead of generic exploratory analysis.
It keeps the ML honest: the model supports interpretation instead of pretending to be the whole project.
It finishes with decision-ready outputs, not just notebooks.

Project Evidence

Changelog - how the project evolved from cleaning to SQL, modeling, dashboarding, and portfolio packaging
Roadmap - what would turn this static case study into a stronger analytics product
Validation and boundaries - what the project proves and what it does not claim
Product memo - stakeholder-ready recommendations and limitations

Workflow

flowchart LR
    A["Raw App Data"] --> B["Cleaning + Standardization"]
    C["Raw Review Data"] --> B
    B --> D["SQLite Warehouse"]
    D --> E["SQL Analysis"]
    D --> F["Python Analysis"]
    F --> G["Segmentation + Sentiment + Modeling"]
    E --> H["Dashboard Assets"]
    G --> H
    G --> I["Product Memo"]

What I Built

Data Engineering

cleaned raw fields like Installs, Size, Price, and Reviews
removed the corrupted source row and dropped unusable ratings
standardized categorical fields like Type and Content Rating
deduplicated app-level records and stored clean tables in SQLite
generated reusable processed datasets in data/processed/

SQL Analysis

The SQL layer answers the core descriptive questions:

Which categories perform best on rating and review volume?
How does quality shift by install tier?
How do free and paid apps compare within category?
Which apps dominate review volume inside each category?
How closely do app ratings align with review sentiment?

Core query files:

Python Analysis

Python was used to go beyond descriptive SQL:

RFM-style segmentation: recency, review volume, and installs
Sentiment analysis: polarity distribution, category-level negatives, review text patterns
Recency analysis: how update freshness relates to ratings and installs
Model interpretation: a RandomForestClassifier predicting whether an app clears the 4.0+ rating threshold

Decision Support Outputs

interactive local dashboard: dashboard/local_dashboard.html
product memo: memo/product_memo.md
PDF memo: memo/product_memo.pdf
repo preview image: repo_preview.png

Dashboard + Memo

Deliverable	Purpose
Local Dashboard	Interactive summary for category intelligence, sentiment, and app health
Product Memo	Executive summary with findings, recommendations, and limitations
Dashboard Preview	Repo-friendly static image for GitHub

Machine Learning, Kept Honest

The ML component is intentionally lightweight. The goal was not to build a production model. The goal was to understand which features were most associated with apps clearing a 4.0 rating threshold.

Target

1 = app rating >= 4.0
0 = app rating < 4.0

Features

category
installs
reviews
size
type
content rating
average sentiment

Reported metrics

Precision: 0.844
Recall: 0.793
F1: 0.818

Takeaways

review volume carried the most predictive weight
app size and category context also mattered
sentiment helped, but it worked best as part of the broader feature set

Repo Tour

Path	What You Will Find
data/processed/	Cleaned datasets, SQL outputs, modeling summaries, and final metrics
notebooks/	Executed notebooks for cleaning, SQL, Python analysis, and modeling
sql/	Reusable SQL query files
figures/	Charts used across the dashboard and memo
dashboard/	Local HTML dashboard and preview assets
memo/	Product memo in Markdown and PDF
scripts/	Rebuild pipeline and notebook generator

Quick Start

1. Install dependencies

python3 -m pip install -r requirements.txt

2. Add raw data

Place these files in data/raw/:

googleplaystore.csv
googleplaystore_user_reviews.csv

Dataset source: Kaggle - Google Play Store Apps

3. Run the full build

python3 scripts/project_pipeline.py

4. Regenerate notebooks if needed

python3 scripts/generate_notebooks.py

Deliverables

cleaned app dataset: data/processed/apps_cleaned.csv
cleaned review dataset: data/processed/reviews_cleaned.csv
SQLite database: data/processed/google_playstore.db
project metrics: data/processed/project_metrics.json
executed notebooks: notebooks/
chart exports: figures/
memo outputs: memo/

Limitations

This is observational data, so the project highlights patterns rather than proving causality.
Installs and reviews are proxies for engagement, not true event-level behavior.
Review sentiment is useful, but some apps have richer review coverage than others.
The repo ships with a polished local dashboard instead of a public Tableau link.

Next Steps

add event-level retention and funnel data
compare pre/post update behavior instead of using freshness as a proxy
track sentiment and ratings over time instead of treating them as static summaries
layer in monetization or revenue data for stronger business prioritization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Google Play Store Product Analytics

TL;DR

At A Glance

Headline Results

Why This Project Stands Out

Project Evidence

Workflow

What I Built

Data Engineering

SQL Analysis

Python Analysis

Decision Support Outputs

Dashboard + Memo

Machine Learning, Kept Honest

Repo Tour

Quick Start

1. Install dependencies

2. Add raw data

3. Run the full build

4. Regenerate notebooks if needed

Deliverables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
dashboard		dashboard
data		data
docs		docs
figures		figures
memo		memo
notebooks		notebooks
scripts		scripts
sql		sql
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
ROADMAP.md		ROADMAP.md
repo_preview.png		repo_preview.png
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Google Play Store Product Analytics

TL;DR

At A Glance

Headline Results

Why This Project Stands Out

Project Evidence

Workflow

What I Built

Data Engineering

SQL Analysis

Python Analysis

Decision Support Outputs

Dashboard + Memo

Machine Learning, Kept Honest

Repo Tour

Quick Start

1. Install dependencies

2. Add raw data

3. Run the full build

4. Regenerate notebooks if needed

Deliverables

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages