An end-to-end Music Analytics Platform analyzing 15,000 tracks across 12 genres and 79 artists — combining advanced EDA, K-Means music archetype clustering, and Gradient Boosting popularity prediction.
| Metric | Value |
|---|---|
| 🎵 Total Tracks | 15,000 |
| 🎤 Artists | 79 |
| 🎶 Genres | 12 |
| 📡 Total Streams | 457 Billion |
| 🤖 Clustering Accuracy | 6 Music Archetypes |
| 📈 Popularity Prediction R² | 0.597 |
| 🎛️ Audio Features Analyzed | 8 |
spotify-music-intelligence/
├── 📁 data/
│ └── spotify_tracks.csv # 15K tracks with 22 features
├── 📁 src/
│ ├── generate_data.py # Realistic music dataset generator
│ ├── eda_viz.py # 4 professional EDA dashboards
│ └── ml_pipeline.py # K-Means + GBM + PCA pipeline
├── 📁 outputs/
│ ├── 📁 figures/
│ │ ├── 01_music_intelligence_dashboard.png
│ │ ├── 02_audio_features_deepdive.png
│ │ ├── 03_popularity_intelligence.png
│ │ ├── 04_genre_evolution.png
│ │ └── 05_ml_clustering_dashboard.png
│ └── 📁 models/
│ ├── popularity_gbm.pkl
│ ├── kmeans_archetypes.pkl
│ └── scaler.pkl
├── requirements.txt
└── README.md
git clone https://github.com/Munishx01/spotify-music-intelligence.git
cd spotify-music-intelligence
pip install -r requirements.txt
python src/generate_data.py # Generate dataset
python src/eda_viz.py # Run EDA visualizations
python src/ml_pipeline.py # Train ML modelsClusters 15,000 tracks into 6 distinct music archetypes based on 8 audio features using the Elbow Method for optimal k selection.
| Archetype | Description | Key Features |
|---|---|---|
| 🔥 Club Bangers | High energy, danceable | Energy>0.75, Dance>0.72 |
| 🎸 Dark Intensity | Aggressive, intense | Energy>0.80, Valence<0.45 |
| 🎻 Acoustic Soul | Organic, unplugged | Acousticness>0.60 |
| ☀️ Feel Good Vibes | Positive, relaxed | Valence>0.65, Energy<0.55 |
| 🎤 Rhythm & Flow | Groove-focused | Dance>0.70, Tempo>120 |
| 🌙 Mellow Groove | Mid-tempo, calm | Balanced features |
| Model | R² Score | MAE |
|---|---|---|
| Linear Regression | 0.439 | 10.99 |
| Random Forest | 0.583 | 9.34 |
| Gradient Boosting | 0.597 | 9.18 |
PCA reduces 8 audio dimensions to 2 components for cluster visualization, explaining ~68% of variance.
| Finding | Impact |
|---|---|
| Danceability is the #1 driver of popularity | High |
| EDM has highest energy (0.88 avg) of all genres | Medium |
| K-Pop leads in danceability + valence combo | Medium |
| Classical streams spike in Oct–Dec (holiday season) | Medium |
| Hip-Hop dominates streams despite 15% genre share | High |
| Music is trending louder & more danceable year-over-year | High |
| Club Bangers archetype has 2.3× more streams than Acoustic Soul | High |
| Feature | Range | Description |
|---|---|---|
| Danceability | 0.0–1.0 | How suitable for dancing |
| Energy | 0.0–1.0 | Intensity and activity level |
| Valence | 0.0–1.0 | Musical positivity |
| Acousticness | 0.0–1.0 | Acoustic vs electronic |
| Speechiness | 0.0–1.0 | Presence of spoken words |
| Liveness | 0.0–1.0 | Live audience detection |
| Tempo | 50–210 BPM | Track speed |
| Loudness | -20 to -3 dB | Overall loudness |
Python 3.10 Pandas NumPy Scikit-learn Matplotlib Seaborn
K-Means Clustering PCA Gradient Boosting Random Forest EDA
Munish Kumar — Data Analyst | Python | SQL | Machine Learning
📧 mk611453@gmail.com | 📍 Palampur, Himachal Pradesh
"Music is data. Data tells stories. Let the music speak." 🎵




