ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β How does YouTube content convert VISIBILITY into ENGAGEMENT β β and AUDIENCE RESPONSE? β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
This project transforms large-scale YouTube trending and comment data into a polished Streamlit analytics experience focused on one core question:
How does YouTube content convert visibility into engagement and audience response?
The app combines engagement metrics, creator/category comparisons, correlation analysis, sentiment evaluation, and word cloud exploration to provide both strategic and exploratory insight across the YouTube ecosystem.
This project was built to go beyond surface-level metrics β treating YouTube's trending data not as a curiosity, but as a mirror into how digital content earns and sustains public attention.
- π Decode engagement dynamics also understand whether a video truly earns its views or just stumbles into them
- π Profile creator momentum and identify channels that trend repeatedly versus those with one-off spikes
- π Measure category efficiency as well as determine which content categories convert visibility into actual engagement most effectively
- π Analyze audience sentiment and go beyond likes/dislikes to understand the emotional tone of comment sections at scale
- π Explore language signals also investigate whether structural cues in video titles (like punctuation) correlate with stronger audience reactions
- π Build a production-grade dashboard not just a notebook, but a fully deployed, stakeholder-ready analytics application
- π Demonstrate end-to-end data storytelling from raw CSV ingestion to interactive visual narrative
- Which content categories drive the strongest like rates?
- Which creators sustain repeated trending presence over time?
- How strongly are reach and approval correlated and where do they diverge?
- What does comment sentiment reveal about audience reception beyond raw counts?
- Do title punctuation patterns influence engagement metrics?
| πΉ Metric | π Value |
|---|---|
| Trending Video Records | 679,050 |
| Unique Channels | 48,183 |
| Comment Records | 691,400 |
| Content Categories | 17 |
| Total Views | 833.11 Billion |
| Total Likes | 23.46 Billion |
| Total Comments | 2.63 Billion |
| Views β Likes Correlation | 0.78 |
| Countries Covered | Multiple (US, UK, CA, RU & more) |
| File | Purpose |
|---|---|
full_df.csv |
Primary enriched video dataset (views, likes, dislikes, rates, punc_count) |
comments_data.csv |
Comment-level text for sentiment & word cloud analysis |
*_category_id.json |
Maps numeric category IDs to human-readable names |
data/youtube_videos.parquet |
Deployment-optimized video data (compressed) |
data/comments_sentiment.parquet |
Pre-scored comments for fast cloud loading |
The notebook is organized into 11 distinct analytical sections, each building on the previous:
Foundation before insight.
- Multi-country CSV loading with encoding handling (
utf-8,latin-1,ISO-8859-1) - Null detection and removal via
dropna() - Deduplication across merged country datasets
- Export pipeline:
.csvβ.jsonβ MySQL DB via SQLAlchemy
What does the audience actually feel?
- NLTK VADER (Valence Aware Dictionary and sEntiment Reasoner) for comment scoring
- Compound score range:
β1.0(most negative) β+1.0(most positive) - Classification thresholds:
- Positive:
score β₯ 0.5 - Negative:
score β€ β0.5
- Positive:
- Applied to 691,400+ comment records
def get_sentiment(text):
score = analyzer.polarity_scores(str(text))
return score['compound']
df['sentiment_score'] = df['comment_text'].apply(get_sentiment)
What language defines positive vs. negative engagement?
- Separated corpora: positive comments vs. negative comments
- Stop-word filtering using
wordcloud.STOPWORDS - Side-by-side 18Γ7 matplotlib visualization
- Reveals dominant vocabulary patterns in audience reactions
The unspoken language of YouTube comments.
- Emoji extraction using the
emojilibrary'semoji_list()function - Top-10 most-used emojis ranked by frequency
- Interactive Plotly bar chart for visual exploration
- Provides cultural and emotional context beyond text
Scale before depth.
- Iterates all
*videos.csvfiles from multiple country datasets - Unified
full_dfDataFrame with consistent schema - Shape validation post-merge
Numbers mean nothing without context.
- Parsed
_category_id.jsonto buildcat_dict = {id: name} - Mapped numeric IDs to readable labels (e.g.,
10β"Music") - Seaborn strip plot of
likesdistribution per category
Raw counts lie. Rates tell the truth.
Three derived metrics computed as a percentage of views:
like_rate = (likes / views) Γ 100 dislike_rate = (dislikes / views) Γ 100 comment_count_rate = (comment_count / views) Γ 100
- 1Γ3 subplot grid showing rate distributions per category
- Reveals hidden performers vs. inflated-view content
How tightly do these signals move together?
Pearson correlation matrix: views, likes, dislikes
Seaborn regplot for views vs likes with regression line
Annotated heatmap for full numeric intuition
Who dominates the trending tab β and how often?
value_counts() on channel_title across all records
Top 20 most-trending channels ranked
Interactive Plotly bar chart with color gradient by count
Does expressive formatting drive stronger reactions?
def punctuation_count(text):
return len([char for char in text if char in string.punctuation])
full_df['punc_count'] = full_df['title'].apply(punctuation_count)
- 2Γ2 subplot grid:
punc_countvsviews,likes,dislikes,comment_count - Treated as a behavioral signal, not a primary driver
From notebook to production.
build_deployment_data.pyscript packages raw data into compressed Parquet files- Eliminates need to upload 600MB+ CSV to cloud
- Streamlit app detects
data/*.parquetand loads from there automatically
Here's what the data actually says β straight from the analysis.
The correlation between views and likes is 0.78 which is strong, but not deterministic. Some videos accumulate massive view counts without proportional likes, suggesting passive viewership. The best-performing content earns active approval, not just passive impressions.
Howto & Style leads all categories in average Like Rate. This means content that teaches tends to generate more deliberate, appreciative engagement even when raw view counts are modest compared to Entertainment or Music.
The channel with the highest recurring trending count was The Late Show with Stephen Colbert appearing 710 times in the dataset. This points to durable creator-level momentum, not viral luck. Consistent content cadence compounds into algorithmic staying power.
NickyJamTV led in total accumulated views. At the top of the distribution, raw reach is overwhelmingly driven by creator brand strength rather than category or title optimization.
Positive comments are dominated by words of admiration and entertainment. Negative comments cluster around criticism and controversy. High like counts alone can mask a divided comment section sentiment analysis is the corrective lens.
Videos with moderate punctuation in titles show marginally higher engagement across all metrics. However, this is a supporting signal expressive formatting may correlate with a certain content personality type rather than directly causing engagement.
Raw CSVs (Multi-Country)
β
βΌ
Data Cleaning βββββββββββββββββββββββββββββββββββββββββββββββ
(null removal, dedup, encoding handling) β
β β
βΌ βΌ
Feature Engineering Comment Processing
(like_rate, dislike_rate, (VADER sentiment scoring,
comment_count_rate, punc_count, word cloud generation,
category_name mapping) emoji frequency analysis)
β β
βββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
βΌ
Analysis & Visualization
(Seaborn, Matplotlib, Plotly)
β
βΌ
Export & Packaging
(CSV, JSON, MySQL, Parquet)
β
βΌ
Streamlit Dashboard (6 Pages)
youtube-data-analysis.streamlit.app
| Layer | Tools |
|---|---|
| Data Wrangling | pandas, numpy |
| NLP / Sentiment | nltk (VADER), wordcloud |
| Emoji Analysis | emoji, collections.Counter |
| Visualization | matplotlib, seaborn, plotly.express |
| Dashboard | streamlit |
| Database | sqlalchemy, pymysql (MySQL) |
| Deployment Data | pyarrow (Parquet) |
| String Processing | string, os |
Youtube_Data_Analysis/
β
βββ π youtube_data_analysis.ipynb β Main analysis notebook (225 cells)
βββ π₯οΈ app.py β Streamlit dashboard (6 pages)
βββ βοΈ build_deployment_data.py β Parquet packaging script
βββ π comments_data.csv β Comment records (691K rows)
βββ π full_df.csv β Full cleaned and enriched dataset
βββ π requirements.txt β Python dependencies
β
βββ π data/
β βββ youtube_videos.parquet β Compressed video dataset
β βββ comments_sentiment.parquet β Pre-scored comment dataset
β
βββ π assets/
β βββ img-1.png β img-6.png β Dashboard screenshots
β
βββ π Resources Link β Additional Databases, full_df.csv (google drive link)
β οΈ Note:full_df.csvis not committed to this repo (too large for GitHub). Generate it by running the notebook, or build from the packaged Parquet files.
git clone https://github.com/doyancha/Youtube_Data_Analysis.git
cd Youtube_Data_Analysispip install -r requirements.txtOption A: You already have full_df.csv also place it in the project root.
Option B: Regenerate from the notebook and run all cells in youtube_data_analysis.ipynb, then export:
full_df.to_csv("full_df.csv", index=False)Option C: Build compressed Parquet files for cloud deployment:
python build_deployment_data.pystreamlit run app.pyThe app will open at http://localhost:8501 π
streamlit==1.53.1
pandas==2.3.3
numpy==2.4.1
matplotlib==3.10.8
seaborn==0.13.2
plotly==6.5.2
nltk==3.9.2
wordcloud==1.9.6
pyarrow==23.0.0The repository includes exported dashboard screenshots in the assets folder so visitors can preview the interface before running the app locally.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
The dashboard uses a dark neon-glassmorphism design language built to feel like a modern BI product, not a notebook export.
| Page | Description |
|---|---|
| π Overview | Executive KPIs, views/likes scatter, trending channel bar chart, correlation heatmap |
| π Content Patterns | Category like rates, reach-to-engagement behavior, punctuation impact |
| π Channel & Category Explorer | Filterable KPI strip, creator comparisons, filtered scatter |
| π¬ Audience Sentiment | Sentiment mix donut, score distribution, punctuation vs. reaction |
| βοΈ Word Clouds | Positive/negative word clouds, most frequent terms |
| π Final Takeaway | Strategic recommendation cards for creators and analysts |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β MIR SHAHADUT HOSSAIN β β Data Analyst | Streamlit Developer β β β β Turning raw data into decisions, one dashboard β β at a time. β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Resource | Link |
|---|---|
| Streamlit Docs | docs.streamlit.io |
| Pandas Documentation | pandas.pydata.org/docs |
| Plotly Express | plotly.com/python/plotly-express |
| NLTK VADER Sentiment | nltk.org/api/nltk.sentiment.vader |
| WordCloud Library | amueller.github.io/word_cloud |
| Seaborn Gallery | seaborn.pydata.org/examples |
| emoji (PyPI) | pypi.org/project/emoji |
| SQLAlchemy | docs.sqlalchemy.org |
| Resource | Link |
|---|---|
| Kaggle: YouTube Trending Videos | kaggle.com/datasets/datasnaek/youtube-new |
| Kaggle: YouTube Comments | kaggle.com/datasets/datasnaek/youtube |
| Topic | Link |
|---|---|
| Sentiment Analysis with VADER | medium.com β VADER guide |
| Python Encoding Guide | docs.python.org/encoding |
| Streamlit App Deployment | docs.streamlit.io/streamlit-community-cloud |
| Parquet with Pandas | pandas.pydata.org/parquet |
| Plotly for Dashboards | plotly.com/python/getting-started |
| Data Storytelling Principles | storytellingwithdata.com |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Built with β€οΈ by Mir Shahadut Hossain | 2025-2026 Data Analyst Β· Python Developer Β· Streamlit Builder βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ





