SoundGraph is a data-driven music discovery engine that builds knowledge graphs from SoundCloud metadata to uncover hidden relationships between tracks, artists, and users.
SoundGraph now supports two modes of operation:
Build your own music discovery graph on-demand without needing PostgreSQL:
- Start from any SoundCloud track
- Expand through related tracks via playlists
- Cache everything locally in SQLite
- Get instant recommendations
- Visualize your personal music network
Quick Start:
# Build a personal graph from a track
make build_graph TRACK_URL="https://soundcloud.com/artist/track"
# Deeper exploration
make build_graph_deep TRACK_URL="https://soundcloud.com/artist/track"
# With visualization
make build_graph_viz TRACK_URL="https://soundcloud.com/artist/track"Traditional workflow for building large-scale databases:
- Bulk genre-based collection
- PostgreSQL storage
- Materialized views for co-occurrence
- Production-ready for large datasets
SoundGraph goes beyond SoundCloud's built-in recommendations by creating a comprehensive knowledge graph that reveals:
- ๐ Track Relationships: Which songs appear together in playlists (co-occurrence analysis)
- ๐ค Artist Connections: How artists are linked through collaborations, shared playlists, and fan overlap
- ๐ฅ User Similarity: Find users with similar music taste based on their likes and playlists
- ๐ Deep Discovery: Get recommendations based on complex relationship patterns, not just individual track similarity
SoundCloud's algorithm is a "black box" - you can't see WHY you got a recommendation. SoundGraph creates a transparent, queryable music knowledge graph where you can:
- Input a track and see exactly WHY certain tracks are related
- Find the "missing links" between two different songs/artists
- Discover music through community behavior patterns (what do people who like X also like?)
- Build custom recommendation models on top of rich relational data
- Data Collection: Fetches public metadata from SoundCloud (tracks, playlists, users, interactions)
- Relationship Extraction: Builds a graph where edges represent relationships:
- Track โ Track (co-occurrence in playlists)
- User โ Track (likes, reposts)
- User โ User (similar taste patterns)
- Artist โ Artist (collaboration networks)
- Query Engine: Provides APIs to query relationships and find recommendations
- ML Ready: Exports graph data for training recommendation models
Instead of just analyzing individual track features, SoundGraph looks at behavioral patterns:
- If tracks A and B appear in many playlists together โ they're related
- If users who like track X also like track Y โ similarity signal
- If user P and user Q have 70% playlist overlap โ similar taste
The easiest way to start exploring music relationships:
- Python 3.11+
- SoundCloud API access (OAuth token)
- No database required! โจ
git clone https://github.com/your-username/soundgraph.git
cd soundgraph
# Setup environment
conda create -y -n sgr python=3.11
conda activate sgr
pip install -r requirements.txt
pip install -e .Create .env file:
# SoundCloud API (only these are required for personal graphs)
SOUNDCLOUD_ACCESS_TOKEN=your_oauth_token_here
SOUNDCLOUD_CLIENT_ID=your_client_id_here# 1. Find a track you like on SoundCloud
# Example: https://soundcloud.com/chillhop/floating-away
# 2. Build a personal graph from it
make build_graph TRACK_URL="https://soundcloud.com/chillhop/floating-away"
# This will:
# โ
Fetch the track and artist info
# โ
Explore the artist's playlists
# โ
Find related tracks via co-occurrence
# โ
Cache everything locally (data/cache/tracks.db)
# โ
Build a NetworkX graph
# โ
Give you recommendations!After running the command, you'll see:
- Track Statistics: How many tracks were collected
- Playlist Coverage: How many playlists were analyzed
- Recommendations: Top 5 related tracks based on graph structure
- Neighbors: Direct relationships to your seed track
- Graph Export: JSON file for further analysis
# Deeper exploration (2 hops instead of 1)
make build_graph_deep TRACK_URL="https://soundcloud.com/artist/track"
# With visualization
make build_graph_viz TRACK_URL="https://soundcloud.com/artist/track"
# Custom parameters
TRACK_URL="https://soundcloud.com/artist/track" \
DEPTH=3 \
MAX_TRACKS=2000 \
VISUALIZE=true \
python scripts/build_personal_graph.pyAll outputs are stored locally:
data/
โโโ cache/
โ โโโ tracks.db # SQLite cache (reusable across sessions)
โโโ graphs/
โโโ graph_xxx.json # NetworkX graph export
โโโ graph_xxx.png # Visualization (if VISUALIZE=true)
- ๐ No Database Setup: Just API token and you're ready
- ๐พ Smart Caching: Re-running doesn't re-fetch data
- ๐ฏ Personalized: Each user builds their own graph
- ๐ Scalable: Start small, expand as needed
- ๐ Transparent: See exactly why tracks are related
- ๐จ Visual: Export graphs for visualization
For production use cases requiring PostgreSQL and large-scale data collection:
- Python 3.11+
- PostgreSQL (local installation)
- SoundCloud API access (OAuth token preferred)
git clone https://github.com/your-username/soundgraph.git
cd soundgraph
# Setup environment
conda create -y -n sgr python=3.11
conda activate sgr
pip install -r requirements.txt
pip install -e .Create .env file:
# SoundCloud API
SOUNDCLOUD_ACCESS_TOKEN=your_oauth_token_here
SOUNDCLOUD_CLIENT_ID=your_client_id_here
# PostgreSQL Database
PGHOST=localhost
PGPORT=5432
PGUSER=sgr
PGPASSWORD=your_password
PGDATABASE=sgr
# Sample data
SAMPLE_QUERY=lofi# Ensure PostgreSQL is running
sudo systemctl start postgresql
# Create database and user (run once)
sudo -u postgres psql
CREATE DATABASE sgr;
CREATE USER sgr WITH PASSWORD 'your_password';
GRANT ALL PRIVILEGES ON DATABASE sgr TO sgr;
\q# Fetch tracks by search query
SAMPLE_QUERY="lofi hip hop" python scripts/ingest_sample.pyWhat this does: Searches SoundCloud for tracks matching your query and saves raw JSON data to data/raw/tracks_search_*.jsonl
Expected output: Found 50 tracks for query 'lofi hip hop', saved to data/raw/tracks_search_lofi_hip_hop_20250909.jsonl
# Process raw JSON into structured data
python -m sgr.clean.clean_tracksWhat this does: Converts raw JSON to structured parquet files with engagement scores, normalized tags, and clean metadata.
Expected output: wrote data/staging/tracks_search_lofi_hip_hop_20250909.parquet 50
# Create schema and load tracks + artists
python -m sgr.db.load_tracksWhat this does: Creates database tables and loads artists + tracks with proper relationships.
Expected output: Database tables created, artists and tracks inserted.
# Deep-dive into a specific track's ecosystem
TRACK_URL="https://soundcloud.com/artist/track-name" python scripts/resolve_and_crawl.pyWhat this does: Takes a track URL and crawls the artist's entire playlist ecosystem, collecting:
- All playlists by that artist
- All tracks in those playlists
- User interactions (likes, if available)
Expected output:
INFO: resolve: https://soundcloud.com/artist/track-name
INFO: track_id=123456 by user_id=789
INFO: playlists fetched: 15
INFO: playlist track entries: 342
SUCCESS: crawl complete
# Clean and load playlist data
python -m sgr.clean.clean_playlists
python -m sgr.db.load_playlistsWhat this does: Normalizes playlist data and loads users, playlists, and playlist_tracks relationships.
# Create advanced schema with materialized views
python scripts/create_schema_extras.py
# Build track co-occurrence matrix
python scripts/refresh_cooccur.pyWhat this does: Creates a materialized view that calculates how often tracks appear together in playlists - the core of the knowledge graph.
# Analyze a track's complete relationship network
TRACK_URL="https://soundcloud.com/artist/track-name" python scripts/unveil.pyWhat this does: Shows the complete "relationship profile" of a track:
- Basic track info
- Playlists containing it
- Related tracks (by co-occurrence)
- Related tracks (by tag similarity)
- Artist connections
- Engagement patterns
Expected output:
=== TRACK SUMMARY ===
Track: "Chill Lo-Fi Beats" by LoFiArtist
Genre: Hip Hop, Plays: 50,431, Likes: 1,203
=== PLAYLISTS CONTAINING THIS TRACK ===
- "Study Vibes" (15 tracks)
- "Late Night Coding" (23 tracks)
=== RELATED TRACKS (CO-OCCURRENCE) ===
1. "Midnight Study Session" - appeared together 8 times
2. "Coffee Shop Ambience" - appeared together 6 times
=== RELATED TRACKS (TAG SIMILARITY) ===
1. "Dreamy Loops" - 85% tag overlap
2. "Focus Beats" - 72% tag overlap
| Script | Purpose | Input | Output | Validation |
|---|---|---|---|---|
ingest_sample.py |
Fetch SoundCloud data | Search query | Raw JSONL files | Check data/raw/ for new files |
clean_tracks.py |
Normalize track data | Raw JSONL | Structured parquet | Check data/staging/ for parquet files |
load_tracks.py |
Load to database | Parquet files | DB records | SELECT COUNT(*) FROM tracks; |
resolve_and_crawl.py |
Deep crawl track ecosystem | Track URL | Playlist/user data | Check for new user_*_playlists.jsonl files |
clean_playlists.py |
Normalize playlist data | Raw playlist JSONL | Structured parquet | Check data/staging/ for playlist parquets |
load_playlists.py |
Load playlists to DB | Playlist parquets | DB records | SELECT COUNT(*) FROM playlists; |
create_schema_extras.py |
Advanced DB schema | None | Enhanced tables | Check for track_cooccurrence view |
refresh_cooccur.py |
Update relationships | Existing data | Updated view | SELECT COUNT(*) FROM track_cooccurrence; |
unveil.py |
Query relationships | Track URL/ID | Relationship report | Visual relationship output |
# Check data pipeline health
make test # Run basic API tests
# Check database content
psql -h localhost -U sgr -d sgr -c "
SELECT
(SELECT COUNT(*) FROM artists) as artists,
(SELECT COUNT(*) FROM tracks) as tracks,
(SELECT COUNT(*) FROM playlists) as playlists,
(SELECT COUNT(*) FROM track_cooccurrence) as relationships;
"
# Validate knowledge graph
psql -h localhost -U sgr -d sgr -c "
SELECT track_id_a, track_id_b, together
FROM track_cooccurrence
ORDER BY together DESC
LIMIT 10;
"Here's how to run the complete pipeline for a single track:
# 1. Collect general data about a genre
SAMPLE_QUERY="ambient electronic" python scripts/ingest_sample.py
python -m sgr.clean.clean_tracks
python -m sgr.db.load_tracks
# 2. Deep-dive into a specific track's ecosystem
TRACK_URL="https://soundcloud.com/ambient-artist/floating-dreams" python scripts/resolve_and_crawl.py
python -m sgr.clean.clean_playlists
python -m sgr.db.load_playlists
# 3. Build knowledge graph
python scripts/create_schema_extras.py
python scripts/refresh_cooccur.py
# 4. Query relationships
TRACK_URL="https://soundcloud.com/ambient-artist/floating-dreams" python scripts/unveil.pyOr use the automated pipeline:
make pipeline TRACK_URL="https://soundcloud.com/ambient-artist/floating-dreams"User Input (Track URL)
โ
1. Resolve Track โ SoundCloud API
โ
2. Smart Expansion
โโ Fetch artist's playlists
โโ Extract tracks from playlists
โโ Build co-occurrence relationships
โโ BFS expansion (configurable depth)
โ
3. Local Cache (SQLite)
โโ Tracks table
โโ Playlists table
โโ Related tracks table
โโ Fast retrieval
โ
4. Build Personal Graph (NetworkX)
โโ Track nodes
โโ Weighted edges
โโ Graph algorithms
โโ Recommendations
โ
5. Export & Visualize
โโ JSON export
โโ PNG visualization
โโ Query interface
Search Query
โ
Bulk Collection โ Raw JSONL
โ
Clean & Normalize โ Parquet
โ
Load to PostgreSQL
โ
Materialized Views (co-occurrence)
โ
Query & Analysis
soundgraph/
โโโ scripts/
โ โโโ build_personal_graph.py # NEW: User-facing personal graph builder
โ โโโ ingest_sample.py # Legacy: Bulk collection
โ โโโ resolve_and_crawl.py # Legacy: Deep crawl
โ โโโ unveil.py # Legacy: Query tool
โโโ src/sgr/
โ โโโ cache/ # NEW: SQLite caching
โ โ โโโ __init__.py
โ โ โโโ track_cache.py
โ โโโ collectors/ # NEW: Smart expansion
โ โ โโโ __init__.py
โ โ โโโ smart_expansion.py
โ โโโ graph/ # NEW: NetworkX graphs
โ โ โโโ __init__.py
โ โ โโโ personal_graph.py
โ โโโ clean/ # Legacy: Data normalization
โ โโโ db/ # Legacy: PostgreSQL ops
โ โโโ io/ # Shared: SoundCloud client
โโโ data/
โ โโโ cache/ # NEW: SQLite databases
โ โโโ graphs/ # NEW: Exported graphs
โ โโโ raw/ # Legacy: Raw JSON
โ โโโ staging/ # Legacy: Parquet files
โโโ tests/
โโโ test_user_driven_architecture.py # NEW: Tests
| Feature | Personal Graph Mode | Bulk Collection Mode |
|---|---|---|
| Database Required | โ No (SQLite only) | โ Yes (PostgreSQL) |
| Setup Complexity | ๐ข Low | ๐ก Medium |
| Collection Speed | ๐ข Fast (on-demand) | ๐ด Slow (bulk) |
| Data Volume | Small-Medium (100s-1000s tracks) | Large (10,000s+ tracks) |
| Use Case | Personal exploration, quick iteration | Production, large-scale analysis |
| Recommendations | โ Graph-based | โ SQL-based |
| Caching | โ Automatic (SQLite) | โ Manual |
| Visualization | โ Built-in | |
| Best For | Individual users, experimentation | Researchers, production systems |
Recommendation: Start with Personal Graph Mode for exploration, then move to Bulk Collection Mode if you need large-scale production data.
soundgraph/
โโโ scripts/ # Main orchestration scripts
โ โโโ ingest_sample.py # SoundCloud API data collection
โ โโโ resolve_and_crawl.py # Deep ecosystem crawling
โ โโโ create_schema_extras.py # Advanced database schema
โ โโโ refresh_cooccur.py # Knowledge graph updates
โ โโโ unveil.py # Relationship query engine
โโโ src/sgr/ # Core library
โ โโโ clean/ # Data normalization
โ โโโ db/ # Database operations
โ โโโ io/ # SoundCloud API client
โโโ sql/schema.sql # Database schema
โโโ data/
โ โโโ raw/ # Raw JSON from API
โ โโโ staging/ # Processed parquet files
โโโ configs/ # Configuration files
- REST API for querying relationships
- Real-time recommendation endpoints
- Graph visualization endpoints
- Web interface for exploration
- Interactive graph visualization
- User recommendation interface
- Graph Neural Networks for recommendations
- Audio feature analysis integration
- Collaborative filtering enhancement
This project is designed for community collaboration:
- Data Scientists: Extend the knowledge graph algorithms
- Backend Developers: Build the API layer
- Frontend Developers: Create visualization interfaces
- ML Engineers: Develop recommendation models
See CONTRIBUTING.md for guidelines.
MIT License - Build amazing music discovery tools!
This roadmap layers a knowledge-graphโenhanced, multi-task recommender on top of SoundGraph, inspired by the MMSS_MKR framework (multi-task, multi-channel, multi-loss) and its KG construction workflow. The goal: keep SoundGraphโs transparent co-occurrence engine, while adding joint training with KG embeddings, cross-&-compression feature sharing, and clear evaluation/ablation.
-
Add itemโattribute triples beyond co-occurrence edges:
Track โ[has_genre]โ Genre,Track โ[by_artist]โ Artist,Track โ[in_playlist]โ Playlist,Artist โ[collab_with]โ Artist.- Where available, include
Track โ[has_tag]โ Tag,Track โ[released_on]โ Date,Track โ[label]โ Label.
-
Normalize and fuse entities (dedupe artists, playlists, tags) with an entity alignment pass; keep provenance for transparency.
-
Export triples to
data/graph/triples.parquetwith schema:(head, relation, tail, weight, source).
Why: The paperโs KG is built from structured + semi/ unstructured sources, then fused and cleaned into triples before embedding; mirroring that improves signal breadth.
-
Implement a KGE trainer supporting TransE / TransH / TransR backends (start with TransE).
-
Inputs:
triples.parquet; Outputs:embeddings/{entity}.npy,embeddings/{relation}.npy. -
Provide a CLI:
python -m sgr.kge.train --triples data/graph/triples.parquet --model transe --dim 128 --epochs 50
-
Ship a score function helper that supports multiple activations (sigmoid / tanh / softsign / softplus) for robust triple scoring, as in the paperโs multi-calculation design.
-
Create a joint trainer that optimizes:
- Recommendation task (predict user โ track interaction) using your co-occurrence features + track/artist embeddings,
- KGE task (true vs. corrupted triples).
-
Bridge both with a Cross & Compression Unit to share interaction features between the rec module and KG module. Expose depth
L/Has hyperparams. -
CLI:
python -m sgr.train.joint \ --rec-dim 128 --kge-dim 128 --cross-depth 1 --epochs 200 \ --lr-rec 2e-5 --lr-kge 2e-5 --l2 2e-5 --kge-interval 64
-
Loss =
L_rec + L_kge + ฮปโwโยฒ, with multi-activation fusion in the KGE score and multi-prediction fusion (dot-product then sigmoid/tanh/softsign) in the rec head, following the paperโs recipe.
-
Create a benchmark split (e.g., 60/20/20) over userโtrack interactions; report AUC and Accuracy.
-
Add ablation flags:
--no-cross(remove Cross&Compression),--single-activation(disable multi-activation scoring),--single-pred(disable multi-prediction head),--no-kge(rec only).
-
CLI:
python -m sgr.eval.run --metrics auc,acc python -m sgr.eval.ablate --no-cross
-
Document gains vs. baselines similar to the paperโs tables (AUC/ACC deltas).
-
Add a KG build DAG mirroring the paperโs stages:
- Acquisition (SoundCloud metadata),
- Extraction (entity/rel/tag parsing),
- Fusion (entity alignment + dedupe),
- Triples (graphization),
- Cleaning (QC, missing/invalid fix).
-
Surface a single command:
make build_kg # runs ingest โ clean โ fuse โ triples โ validate -
Include a
KG_VALIDATION.mdwith sanity checks (degree dist., top relations, duplicate rate).
-
Extend
unveil.pyto explain recommendations with:- Paths in the KG (e.g.,
TrackA โ in_playlist โ P โ in_playlist โ TrackB), - Contribution from co-occurrence vs. KGE proximity vs. tag overlap,
- Confidence from the multi-prediction head.
- Paths in the KG (e.g.,
-
Provide an
/explanationsendpoint that returns evidence triples and normalized contributions.
- The paper did not publish its KG; it shared methodology and used public datasets for evaluation. Follow that precedentโship scripts/configs to rebuild the graph from public SoundCloud metadata youโre permitted to use, and clearly document rate limits and ToS.
src/sgr/kge/(TransE/H/R + trainers + negative sampling + multi-activation scoring).src/sgr/model/cross_compress.py(Cross & Compression unit).src/sgr/train/joint.py(multi-task loop, alternating updates;--kge-interval).src/sgr/eval/(AUC/ACC, ablations, hyperparam sweeps).docs/MMSS_MKR_README.md(math, symbols, and learning schedule overview; small diagrams of Fig.-style blocks).
- KG triples export + validation report
- TransE baseline + embeddings artifact
- Cross&Compression module integrated
- Joint training with multi-prediction & multi-activation scoring
- AUC/ACC metrics + ablation report
-
/explanationsAPI returning evidence paths
These steps keep SoundGraphโs transparency while adopting the KG + multi-task learning techniques that improved accuracy in the paperโgiving you both better recs and auditable reasons for each suggestion.