GitHub - doyancha/Youtube_Data_Analysis: An interactive YouTube analytics solution built with Streamlit to evaluate content reach, engagement efficiency, creator performance, audience sentiment, and text-based comment patterns using large-scale trending video data.

⚡ THE CORE QUESTION

╔══════════════════════════════════════════════════════════════════════════╗
║                                                                          ║
║        How does YouTube content convert VISIBILITY into ENGAGEMENT       ║
║                         and AUDIENCE RESPONSE?                           ║
║                                                                          ║
╚══════════════════════════════════════════════════════════════════════════╝

Executive Summary

This project transforms large-scale YouTube trending and comment data into a polished Streamlit analytics experience focused on one core question:

How does YouTube content convert visibility into engagement and audience response?

The app combines engagement metrics, creator/category comparisons, correlation analysis, sentiment evaluation, and word cloud exploration to provide both strategic and exploratory insight across the YouTube ecosystem.

📌 Table of Contents

#	Section
01	🎯 Project Intention & Goals
02	📊 Dataset at a Glance
03	🔬 Analysis Modules
04	💡 Key Findings
05	🧠 Technical Pipeline
06	📁 Repository Structure
07	🚀 Run Locally
08	📸 Dashboard Preview
09	👤 About the Author
10	📚 Suggested Resources

🎯 Project Intention & Goals

This project was built to go beyond surface-level metrics — treating YouTube's trending data not as a curiosity, but as a mirror into how digital content earns and sustains public attention.

🔭 What This Project Sets Out To Do

📌 Decode engagement dynamics also understand whether a video truly earns its views or just stumbles into them
📌 Profile creator momentum and identify channels that trend repeatedly versus those with one-off spikes
📌 Measure category efficiency as well as determine which content categories convert visibility into actual engagement most effectively
📌 Analyze audience sentiment and go beyond likes/dislikes to understand the emotional tone of comment sections at scale
📌 Explore language signals also investigate whether structural cues in video titles (like punctuation) correlate with stronger audience reactions
📌 Build a production-grade dashboard not just a notebook, but a fully deployed, stakeholder-ready analytics application
📌 Demonstrate end-to-end data storytelling from raw CSV ingestion to interactive visual narrative

🧩 Business Questions Addressed

Which content categories drive the strongest like rates?
Which creators sustain repeated trending presence over time?
How strongly are reach and approval correlated and where do they diverge?
What does comment sentiment reveal about audience reception beyond raw counts?
Do title punctuation patterns influence engagement metrics?

---

📊 Dataset at a Glance

📹 Metric	📈 Value
Trending Video Records	679,050
Unique Channels	48,183
Comment Records	691,400
Content Categories	17
Total Views	833.11 Billion
Total Likes	23.46 Billion
Total Comments	2.63 Billion
Views ↔ Likes Correlation	0.78
Countries Covered	Multiple (US, UK, CA, RU & more)

🗂️ Data Sources Used

File	Purpose
`full_df.csv`	Primary enriched video dataset (views, likes, dislikes, rates, punc_count)
`comments_data.csv`	Comment-level text for sentiment & word cloud analysis
`*_category_id.json`	Maps numeric category IDs to human-readable names
`data/youtube_videos.parquet`	Deployment-optimized video data (compressed)
`data/comments_sentiment.parquet`	Pre-scored comments for fast cloud loading

🔬 Analysis Modules

The notebook is organized into 11 distinct analytical sections, each building on the previous:

🧹 Module 1 — Data Ingestion & Cleaning

Foundation before insight.

Multi-country CSV loading with encoding handling (utf-8, latin-1, ISO-8859-1)
Null detection and removal via dropna()
Deduplication across merged country datasets
Export pipeline: .csv → .json → MySQL DB via SQLAlchemy

---

💬 Module 2 — Sentiment Analysis

What does the audience actually feel?

NLTK VADER (Valence Aware Dictionary and sEntiment Reasoner) for comment scoring
Compound score range: −1.0 (most negative) → +1.0 (most positive)
Classification thresholds:
- Positive: score ≥ 0.5
- Negative: score ≤ −0.5
Applied to 691,400+ comment records

def get_sentiment(text):
    score = analyzer.polarity_scores(str(text))
    return score['compound']
df['sentiment_score'] = df['comment_text'].apply(get_sentiment)

☁️ Module 3 — Word Cloud Generation

What language defines positive vs. negative engagement?

Separated corpora: positive comments vs. negative comments
Stop-word filtering using wordcloud.STOPWORDS
Side-by-side 18×7 matplotlib visualization
Reveals dominant vocabulary patterns in audience reactions

---

😄 Module 4 — Emoji Frequency Analysis

The unspoken language of YouTube comments.

Emoji extraction using the emoji library's emoji_list() function
Top-10 most-used emojis ranked by frequency
Interactive Plotly bar chart for visual exploration
Provides cultural and emotional context beyond text

---

🗺️ Module 5 — Multi-Country Data Merge

Scale before depth.

Iterates all *videos.csv files from multiple country datasets
Unified full_df DataFrame with consistent schema
Shape validation post-merge

---

🏷️ Module 6 — Category Enrichment

Numbers mean nothing without context.

Parsed _category_id.json to build cat_dict = {id: name}
Mapped numeric IDs to readable labels (e.g., 10 → "Music")
Seaborn strip plot of likes distribution per category

---

📐 Module 7 — Engagement Rate Engineering

Raw counts lie. Rates tell the truth.

Three derived metrics computed as a percentage of views:

Metric Formulas

like_rate          = (likes / views) × 100
dislike_rate       = (dislikes / views) × 100
comment_count_rate = (comment_count / views) × 100

1×3 subplot grid showing rate distributions per category
Reveals hidden performers vs. inflated-view content

---

🔗 Module 8 — Correlation Analysis

How tightly do these signals move together?

Pearson correlation matrix: views, likes, dislikes Seaborn regplot for views vs likes with regression line Annotated heatmap for full numeric intuition

---

📺 Module 9 — Channel Trending Analysis

Who dominates the trending tab — and how often?

value_counts() on channel_title across all records Top 20 most-trending channels ranked Interactive Plotly bar chart with color gradient by count

---

✏️ Module 10 — Punctuation Analysis

Does expressive formatting drive stronger reactions?

def punctuation_count(text):
    return len([char for char in text if char in string.punctuation])
full_df['punc_count'] = full_df['title'].apply(punctuation_count)

2×2 subplot grid: punc_count vs views, likes, dislikes, comment_count
Treated as a behavioral signal, not a primary driver

---

💾 Module 11 — Export & Deployment Packaging

From notebook to production.

build_deployment_data.py script packages raw data into compressed Parquet files
Eliminates need to upload 600MB+ CSV to cloud
Streamlit app detects data/*.parquet and loads from there automatically

---

💡 Key Findings

Here's what the data actually says — straight from the analysis.

🔑 Finding 1 — Views & Likes Move Together, But Not Perfectly

The correlation between views and likes is 0.78 which is strong, but not deterministic. Some videos accumulate massive view counts without proportional likes, suggesting passive viewership. The best-performing content earns active approval, not just passive impressions.

🔑 Finding 2 — Category Efficiency Matters More Than Scale

Howto & Style leads all categories in average Like Rate. This means content that teaches tends to generate more deliberate, appreciative engagement even when raw view counts are modest compared to Entertainment or Music.

🔑 Finding 3 — Trending Success Is Not Random

The channel with the highest recurring trending count was The Late Show with Stephen Colbert appearing 710 times in the dataset. This points to durable creator-level momentum, not viral luck. Consistent content cadence compounds into algorithmic staying power.

🔑 Finding 4 — Creator Brand Drives Raw Reach

NickyJamTV led in total accumulated views. At the top of the distribution, raw reach is overwhelmingly driven by creator brand strength rather than category or title optimization.

🔑 Finding 5 — Sentiment Adds a Layer Beyond Raw Metrics

Positive comments are dominated by words of admiration and entertainment. Negative comments cluster around criticism and controversy. High like counts alone can mask a divided comment section sentiment analysis is the corrective lens.

🔑 Finding 6 — Title Punctuation as a Behavioral Signal

Videos with moderate punctuation in titles show marginally higher engagement across all metrics. However, this is a supporting signal expressive formatting may correlate with a certain content personality type rather than directly causing engagement.

---

🧠 Technical Pipeline

Raw CSVs (Multi-Country)
        │
        ▼
   Data Cleaning ──────────────────────────────────────────────┐
   (null removal, dedup, encoding handling)                    │
        │                                                      │
        ▼                                                      ▼
Feature Engineering                                    Comment Processing
(like_rate, dislike_rate,                           (VADER sentiment scoring,
 comment_count_rate, punc_count,                     word cloud generation,
 category_name mapping)                              emoji frequency analysis)
        │                                                      │
        └───────────────────┬──────────────────────────────────┘
                            │
                            ▼
                   Analysis & Visualization
              (Seaborn, Matplotlib, Plotly)
                            │
                            ▼
                  Export & Packaging
             (CSV, JSON, MySQL, Parquet)
                            │
                            ▼
               Streamlit Dashboard (6 Pages)
              youtube-data-analysis.streamlit.app

🛠️ Tech Stack

Layer	Tools
Data Wrangling	`pandas`, `numpy`
NLP / Sentiment	`nltk` (VADER), `wordcloud`
Emoji Analysis	`emoji`, `collections.Counter`
Visualization	`matplotlib`, `seaborn`, `plotly.express`
Dashboard	`streamlit`
Database	`sqlalchemy`, `pymysql` (MySQL)
Deployment Data	`pyarrow` (Parquet)
String Processing	`string`, `os`

📁 Repository Structure

Youtube_Data_Analysis/
│
├── 📓 youtube_data_analysis.ipynb   ← Main analysis notebook (225 cells)
├── 🖥️ app.py                        ← Streamlit dashboard (6 pages)
├── ⚙️ build_deployment_data.py      ← Parquet packaging script
├── 📄 comments_data.csv             ← Comment records (691K rows)
├── 📄 full_df.csv                   ← Full cleaned and enriched dataset
├── 📄 requirements.txt              ← Python dependencies
│
├── 📂 data/
│   ├── youtube_videos.parquet       ← Compressed video dataset
│   └── comments_sentiment.parquet   ← Pre-scored comment dataset
│
├── 📂 assets/
│   ├── img-1.png  → img-6.png       ← Dashboard screenshots
│
└── 🔗 Resources Link               ← Additional Databases, full_df.csv (google drive link)

⚠️ Note: full_df.csv is not committed to this repo (too large for GitHub). Generate it by running the notebook, or build from the packaged Parquet files.

🚀 Run Locally

Step 1 — Clone the Repository

git clone https://github.com/doyancha/Youtube_Data_Analysis.git
cd Youtube_Data_Analysis

Step 2 — Install Dependencies

pip install -r requirements.txt

Step 3 — Prepare the Data

Option A: You already have full_df.csv also place it in the project root.

Option B: Regenerate from the notebook and run all cells in youtube_data_analysis.ipynb, then export:

full_df.to_csv("full_df.csv", index=False)

Option C: Build compressed Parquet files for cloud deployment:

python build_deployment_data.py

Step 4 — Launch the Dashboard

streamlit run app.py

The app will open at http://localhost:8501 🎉

⚡ Requirements

streamlit==1.53.1
pandas==2.3.3
numpy==2.4.1
matplotlib==3.10.8
seaborn==0.13.2
plotly==6.5.2
nltk==3.9.2
wordcloud==1.9.6
pyarrow==23.0.0

---

Dashboard Preview

The repository includes exported dashboard screenshots in the assets folder so visitors can preview the interface before running the app locally.

The dashboard uses a dark neon-glassmorphism design language built to feel like a modern BI product, not a notebook export.

Page	Description
🏠 Overview	Executive KPIs, views/likes scatter, trending channel bar chart, correlation heatmap
📈 Content Patterns	Category like rates, reach-to-engagement behavior, punctuation impact
🔍 Channel & Category Explorer	Filterable KPI strip, creator comparisons, filtered scatter
💬 Audience Sentiment	Sentiment mix donut, score distribution, punctuation vs. reaction
☁️ Word Clouds	Positive/negative word clouds, most frequent terms
🏁 Final Takeaway	Strategic recommendation cards for creators and analysts

👤 About the Author

╔═══════════════════════════════════════════════════════════╗
║                                                           ║
║                 MIR SHAHADUT HOSSAIN                      ║
║            Data Analyst | Streamlit Developer             ║
║                                                           ║
║      Turning raw data into decisions, one dashboard       ║
║                       at a time.                          ║
║                                                           ║
╚═══════════════════════════════════════════════════════════╝

📚 Suggested Resources

📖 Documentation & Libraries Used

Resource	Link
Streamlit Docs	docs.streamlit.io
Pandas Documentation	pandas.pydata.org/docs
Plotly Express	plotly.com/python/plotly-express
NLTK VADER Sentiment	nltk.org/api/nltk.sentiment.vader
WordCloud Library	amueller.github.io/word_cloud
Seaborn Gallery	seaborn.pydata.org/examples
emoji (PyPI)	pypi.org/project/emoji
SQLAlchemy	docs.sqlalchemy.org

📂 Datasets

Resource	Link
Kaggle: YouTube Trending Videos	kaggle.com/datasets/datasnaek/youtube-new
Kaggle: YouTube Comments	kaggle.com/datasets/datasnaek/youtube

🎓 Learning References

Topic	Link
Sentiment Analysis with VADER	medium.com — VADER guide
Python Encoding Guide	docs.python.org/encoding
Streamlit App Deployment	docs.streamlit.io/streamlit-community-cloud
Parquet with Pandas	pandas.pydata.org/parquet
Plotly for Dashboards	plotly.com/python/getting-started
Data Storytelling Principles	storytellingwithdata.com

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Built with ❤️ by Mir Shahadut Hossain | 2025-2026
  Data Analyst · Python Developer · Streamlit Builder
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
assets		assets
data		data
.gitignore		.gitignore
README.md		README.md
app.py		app.py
build_deployment_data.py		build_deployment_data.py
comments_data.csv		comments_data.csv
requirements.txt		requirements.txt
resources link.txt		resources link.txt
youtube_data_analysis.ipynb		youtube_data_analysis.ipynb

Folders and files

Latest commit

History

Repository files navigation

⚡ THE CORE QUESTION

Executive Summary

📌 Table of Contents

🎯 Project Intention & Goals

🔭 What This Project Sets Out To Do

🧩 Business Questions Addressed

📊 Dataset at a Glance

🗂️ Data Sources Used

🔬 Analysis Modules

🧹 Module 1 — Data Ingestion & Cleaning

💬 Module 2 — Sentiment Analysis

☁️ Module 3 — Word Cloud Generation

😄 Module 4 — Emoji Frequency Analysis

🗺️ Module 5 — Multi-Country Data Merge

🏷️ Module 6 — Category Enrichment

📐 Module 7 — Engagement Rate Engineering

Metric Formulas

🔗 Module 8 — Correlation Analysis

📺 Module 9 — Channel Trending Analysis

✏️ Module 10 — Punctuation Analysis

💾 Module 11 — Export & Deployment Packaging

💡 Key Findings

🔑 Finding 1 — Views & Likes Move Together, But Not Perfectly

🔑 Finding 2 — Category Efficiency Matters More Than Scale

🔑 Finding 3 — Trending Success Is Not Random

🔑 Finding 4 — Creator Brand Drives Raw Reach

🔑 Finding 5 — Sentiment Adds a Layer Beyond Raw Metrics

🔑 Finding 6 — Title Punctuation as a Behavioral Signal

🧠 Technical Pipeline

🛠️ Tech Stack

📁 Repository Structure

🚀 Run Locally

Step 1 — Clone the Repository

Step 2 — Install Dependencies

Step 3 — Prepare the Data

Step 4 — Launch the Dashboard

⚡ Requirements

Dashboard Preview

👤 About the Author

📚 Suggested Resources

📖 Documentation & Libraries Used

📂 Datasets

🎓 Learning References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages