Skip to content

doyancha/Youtube_Data_Analysis

Repository files navigation

LinkedIn GitHub Email Live App

Streamlit Python Pandas Plotly Seaborn


⚑ THE CORE QUESTION

╔══════════════════════════════════════════════════════════════════════════╗
β•‘                                                                          β•‘
β•‘        How does YouTube content convert VISIBILITY into ENGAGEMENT       β•‘
β•‘                         and AUDIENCE RESPONSE?                           β•‘
β•‘                                                                          β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
  

Executive Summary

This project transforms large-scale YouTube trending and comment data into a polished Streamlit analytics experience focused on one core question:

How does YouTube content convert visibility into engagement and audience response?

The app combines engagement metrics, creator/category comparisons, correlation analysis, sentiment evaluation, and word cloud exploration to provide both strategic and exploratory insight across the YouTube ecosystem.


πŸ“Œ Table of Contents

# Section
01 🎯 Project Intention & Goals
02 πŸ“Š Dataset at a Glance
03 πŸ”¬ Analysis Modules
04 πŸ’‘ Key Findings
05 🧠 Technical Pipeline
06 πŸ“ Repository Structure
07 πŸš€ Run Locally
08 πŸ“Έ Dashboard Preview
09 πŸ‘€ About the Author
10 πŸ“š Suggested Resources

🎯 Project Intention & Goals

This project was built to go beyond surface-level metrics β€” treating YouTube's trending data not as a curiosity, but as a mirror into how digital content earns and sustains public attention.

πŸ”­ What This Project Sets Out To Do

  • πŸ“Œ Decode engagement dynamics also understand whether a video truly earns its views or just stumbles into them
  • πŸ“Œ Profile creator momentum and identify channels that trend repeatedly versus those with one-off spikes
  • πŸ“Œ Measure category efficiency as well as determine which content categories convert visibility into actual engagement most effectively
  • πŸ“Œ Analyze audience sentiment and go beyond likes/dislikes to understand the emotional tone of comment sections at scale
  • πŸ“Œ Explore language signals also investigate whether structural cues in video titles (like punctuation) correlate with stronger audience reactions
  • πŸ“Œ Build a production-grade dashboard not just a notebook, but a fully deployed, stakeholder-ready analytics application
  • πŸ“Œ Demonstrate end-to-end data storytelling from raw CSV ingestion to interactive visual narrative

🧩 Business Questions Addressed

  • Which content categories drive the strongest like rates?
  • Which creators sustain repeated trending presence over time?
  • How strongly are reach and approval correlated and where do they diverge?
  • What does comment sentiment reveal about audience reception beyond raw counts?
  • Do title punctuation patterns influence engagement metrics?
---

πŸ“Š Dataset at a Glance

πŸ“Ή Metric πŸ“ˆ Value
Trending Video Records 679,050
Unique Channels 48,183
Comment Records 691,400
Content Categories 17
Total Views 833.11 Billion
Total Likes 23.46 Billion
Total Comments 2.63 Billion
Views ↔ Likes Correlation 0.78
Countries Covered Multiple (US, UK, CA, RU & more)

πŸ—‚οΈ Data Sources Used

File Purpose
full_df.csv Primary enriched video dataset (views, likes, dislikes, rates, punc_count)
comments_data.csv Comment-level text for sentiment & word cloud analysis
*_category_id.json Maps numeric category IDs to human-readable names
data/youtube_videos.parquet Deployment-optimized video data (compressed)
data/comments_sentiment.parquet Pre-scored comments for fast cloud loading

πŸ”¬ Analysis Modules

The notebook is organized into 11 distinct analytical sections, each building on the previous:


🧹 Module 1 β€” Data Ingestion & Cleaning

Foundation before insight.

  • Multi-country CSV loading with encoding handling (utf-8, latin-1, ISO-8859-1)
  • Null detection and removal via dropna()
  • Deduplication across merged country datasets
  • Export pipeline: .csv β†’ .json β†’ MySQL DB via SQLAlchemy
---

πŸ’¬ Module 2 β€” Sentiment Analysis

What does the audience actually feel?

  • NLTK VADER (Valence Aware Dictionary and sEntiment Reasoner) for comment scoring
  • Compound score range: βˆ’1.0 (most negative) β†’ +1.0 (most positive)
  • Classification thresholds:
    • Positive: score β‰₯ 0.5
    • Negative: score ≀ βˆ’0.5
  • Applied to 691,400+ comment records
def get_sentiment(text):
    score = analyzer.polarity_scores(str(text))
    return score['compound']

df['sentiment_score'] = df['comment_text'].apply(get_sentiment)


☁️ Module 3 β€” Word Cloud Generation

What language defines positive vs. negative engagement?

  • Separated corpora: positive comments vs. negative comments
  • Stop-word filtering using wordcloud.STOPWORDS
  • Side-by-side 18Γ—7 matplotlib visualization
  • Reveals dominant vocabulary patterns in audience reactions
---

πŸ˜„ Module 4 β€” Emoji Frequency Analysis

The unspoken language of YouTube comments.

  • Emoji extraction using the emoji library's emoji_list() function
  • Top-10 most-used emojis ranked by frequency
  • Interactive Plotly bar chart for visual exploration
  • Provides cultural and emotional context beyond text
---

πŸ—ΊοΈ Module 5 β€” Multi-Country Data Merge

Scale before depth.

  • Iterates all *videos.csv files from multiple country datasets
  • Unified full_df DataFrame with consistent schema
  • Shape validation post-merge
---

🏷️ Module 6 β€” Category Enrichment

Numbers mean nothing without context.

  • Parsed _category_id.json to build cat_dict = {id: name}
  • Mapped numeric IDs to readable labels (e.g., 10 β†’ "Music")
  • Seaborn strip plot of likes distribution per category
---

πŸ“ Module 7 β€” Engagement Rate Engineering

Raw counts lie. Rates tell the truth.

Three derived metrics computed as a percentage of views:

Metric Formulas

like_rate          = (likes / views) Γ— 100
dislike_rate       = (dislikes / views) Γ— 100
comment_count_rate = (comment_count / views) Γ— 100
  

  • 1Γ—3 subplot grid showing rate distributions per category
  • Reveals hidden performers vs. inflated-view content
---

πŸ”— Module 8 β€” Correlation Analysis

How tightly do these signals move together?

Pearson correlation matrix: views, likes, dislikes Seaborn regplot for views vs likes with regression line Annotated heatmap for full numeric intuition

---

πŸ“Ί Module 9 β€” Channel Trending Analysis

Who dominates the trending tab β€” and how often?

value_counts() on channel_title across all records Top 20 most-trending channels ranked Interactive Plotly bar chart with color gradient by count

---

✏️ Module 10 β€” Punctuation Analysis

Does expressive formatting drive stronger reactions?

def punctuation_count(text):
    return len([char for char in text if char in string.punctuation])

full_df['punc_count'] = full_df['title'].apply(punctuation_count)


  • 2Γ—2 subplot grid: punc_count vs views, likes, dislikes, comment_count
  • Treated as a behavioral signal, not a primary driver
---

πŸ’Ύ Module 11 β€” Export & Deployment Packaging

From notebook to production.

  • build_deployment_data.py script packages raw data into compressed Parquet files
  • Eliminates need to upload 600MB+ CSV to cloud
  • Streamlit app detects data/*.parquet and loads from there automatically
---

πŸ’‘ Key Findings

Here's what the data actually says β€” straight from the analysis.

πŸ”‘ Finding 1 β€” Views & Likes Move Together, But Not Perfectly

The correlation between views and likes is 0.78 which is strong, but not deterministic. Some videos accumulate massive view counts without proportional likes, suggesting passive viewership. The best-performing content earns active approval, not just passive impressions.

πŸ”‘ Finding 2 β€” Category Efficiency Matters More Than Scale

Howto & Style leads all categories in average Like Rate. This means content that teaches tends to generate more deliberate, appreciative engagement even when raw view counts are modest compared to Entertainment or Music.

πŸ”‘ Finding 3 β€” Trending Success Is Not Random

The channel with the highest recurring trending count was The Late Show with Stephen Colbert appearing 710 times in the dataset. This points to durable creator-level momentum, not viral luck. Consistent content cadence compounds into algorithmic staying power.

πŸ”‘ Finding 4 β€” Creator Brand Drives Raw Reach

NickyJamTV led in total accumulated views. At the top of the distribution, raw reach is overwhelmingly driven by creator brand strength rather than category or title optimization.

πŸ”‘ Finding 5 β€” Sentiment Adds a Layer Beyond Raw Metrics

Positive comments are dominated by words of admiration and entertainment. Negative comments cluster around criticism and controversy. High like counts alone can mask a divided comment section sentiment analysis is the corrective lens.

πŸ”‘ Finding 6 β€” Title Punctuation as a Behavioral Signal

Videos with moderate punctuation in titles show marginally higher engagement across all metrics. However, this is a supporting signal expressive formatting may correlate with a certain content personality type rather than directly causing engagement.

---

🧠 Technical Pipeline

Raw CSVs (Multi-Country)
        β”‚
        β–Ό
   Data Cleaning ──────────────────────────────────────────────┐
   (null removal, dedup, encoding handling)                    β”‚
        β”‚                                                      β”‚
        β–Ό                                                      β–Ό
Feature Engineering                                    Comment Processing
(like_rate, dislike_rate,                           (VADER sentiment scoring,
 comment_count_rate, punc_count,                     word cloud generation,
 category_name mapping)                              emoji frequency analysis)
        β”‚                                                      β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                            β–Ό
                   Analysis & Visualization
              (Seaborn, Matplotlib, Plotly)
                            β”‚
                            β–Ό
                  Export & Packaging
             (CSV, JSON, MySQL, Parquet)
                            β”‚
                            β–Ό
               Streamlit Dashboard (6 Pages)
              youtube-data-analysis.streamlit.app
  

πŸ› οΈ Tech Stack

Layer Tools
Data Wrangling pandas, numpy
NLP / Sentiment nltk (VADER), wordcloud
Emoji Analysis emoji, collections.Counter
Visualization matplotlib, seaborn, plotly.express
Dashboard streamlit
Database sqlalchemy, pymysql (MySQL)
Deployment Data pyarrow (Parquet)
String Processing string, os

πŸ“ Repository Structure

Youtube_Data_Analysis/
β”‚
β”œβ”€β”€ πŸ““ youtube_data_analysis.ipynb   ← Main analysis notebook (225 cells)
β”œβ”€β”€ πŸ–₯️ app.py                        ← Streamlit dashboard (6 pages)
β”œβ”€β”€ βš™οΈ build_deployment_data.py      ← Parquet packaging script
β”œβ”€β”€ πŸ“„ comments_data.csv             ← Comment records (691K rows)
β”œβ”€β”€ πŸ“„ full_df.csv                   ← Full cleaned and enriched dataset
β”œβ”€β”€ πŸ“„ requirements.txt              ← Python dependencies
β”‚
β”œβ”€β”€ πŸ“‚ data/
β”‚   β”œβ”€β”€ youtube_videos.parquet       ← Compressed video dataset
β”‚   └── comments_sentiment.parquet   ← Pre-scored comment dataset
β”‚
β”œβ”€β”€ πŸ“‚ assets/
β”‚   β”œβ”€β”€ img-1.png  β†’ img-6.png       ← Dashboard screenshots
β”‚
└── πŸ”— Resources Link               ← Additional Databases, full_df.csv (google drive link)



⚠️ Note: full_df.csv is not committed to this repo (too large for GitHub). Generate it by running the notebook, or build from the packaged Parquet files.


πŸš€ Run Locally

Step 1 β€” Clone the Repository

git clone https://github.com/doyancha/Youtube_Data_Analysis.git
cd Youtube_Data_Analysis

Step 2 β€” Install Dependencies

pip install -r requirements.txt

Step 3 β€” Prepare the Data

Option A: You already have full_df.csv also place it in the project root.

Option B: Regenerate from the notebook and run all cells in youtube_data_analysis.ipynb, then export:

full_df.to_csv("full_df.csv", index=False)

Option C: Build compressed Parquet files for cloud deployment:

python build_deployment_data.py

Step 4 β€” Launch the Dashboard

streamlit run app.py

The app will open at http://localhost:8501 πŸŽ‰

⚑ Requirements

streamlit==1.53.1
pandas==2.3.3
numpy==2.4.1
matplotlib==3.10.8
seaborn==0.13.2
plotly==6.5.2
nltk==3.9.2
wordcloud==1.9.6
pyarrow==23.0.0
---

Dashboard Preview

The repository includes exported dashboard screenshots in the assets folder so visitors can preview the interface before running the app locally.

Dashboard Preview 1 Dashboard Preview 2 Dashboard Preview 3
Dashboard Preview 4 Dashboard Preview 5 Dashboard Preview 6

The dashboard uses a dark neon-glassmorphism design language built to feel like a modern BI product, not a notebook export.

Page Description
🏠 Overview Executive KPIs, views/likes scatter, trending channel bar chart, correlation heatmap
πŸ“ˆ Content Patterns Category like rates, reach-to-engagement behavior, punctuation impact
πŸ” Channel & Category Explorer Filterable KPI strip, creator comparisons, filtered scatter
πŸ’¬ Audience Sentiment Sentiment mix donut, score distribution, punctuation vs. reaction
☁️ Word Clouds Positive/negative word clouds, most frequent terms
🏁 Final Takeaway Strategic recommendation cards for creators and analysts

πŸ‘€ About the Author

╔═══════════════════════════════════════════════════════════╗
β•‘                                                           β•‘
β•‘                 MIR SHAHADUT HOSSAIN                      β•‘
β•‘            Data Analyst | Streamlit Developer             β•‘
β•‘                                                           β•‘
β•‘      Turning raw data into decisions, one dashboard       β•‘
β•‘                       at a time.                          β•‘
β•‘                                                           β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
  

LinkedIn GitHub Email Live App


πŸ“š Suggested Resources

πŸ“– Documentation & Libraries Used

Resource Link
Streamlit Docs docs.streamlit.io
Pandas Documentation pandas.pydata.org/docs
Plotly Express plotly.com/python/plotly-express
NLTK VADER Sentiment nltk.org/api/nltk.sentiment.vader
WordCloud Library amueller.github.io/word_cloud
Seaborn Gallery seaborn.pydata.org/examples
emoji (PyPI) pypi.org/project/emoji
SQLAlchemy docs.sqlalchemy.org

πŸ“‚ Datasets

Resource Link
Kaggle: YouTube Trending Videos kaggle.com/datasets/datasnaek/youtube-new
Kaggle: YouTube Comments kaggle.com/datasets/datasnaek/youtube

πŸŽ“ Learning References

Topic Link
Sentiment Analysis with VADER medium.com β€” VADER guide
Python Encoding Guide docs.python.org/encoding
Streamlit App Deployment docs.streamlit.io/streamlit-community-cloud
Parquet with Pandas pandas.pydata.org/parquet
Plotly for Dashboards plotly.com/python/getting-started
Data Storytelling Principles storytellingwithdata.com

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Built with ❀️ by Mir Shahadut Hossain | 2025-2026
  Data Analyst Β· Python Developer Β· Streamlit Builder
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  

About

An interactive YouTube analytics solution built with Streamlit to evaluate content reach, engagement efficiency, creator performance, audience sentiment, and text-based comment patterns using large-scale trending video data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors