A from-scratch music discovery engine that finds songs by how they actually sound — using self-supervised audio embeddings and nearest-neighbor search — with an interactive, steerable personalization loop. Built to scratch a real itch: the kind of discovery streaming apps stopped giving me.
- Sonic similarity, not metadata. Every track is embedded into a vector by a music foundation model, so "similar" means it actually sounds alike — independent of genre tags, popularity, or who-else-listened-to-it. It surfaces the obscure deep cut that fits, not just the popular thing everyone already clicks.
- Multi-seed "throughline" search. Give it several songs you love and it finds the shared vibe — the centroid of their embeddings — the un-nameable thread between them. (This was a feature Spotify offered, removed, and a lot of people missed.)
- Steerable, transparent personalization. Thumb tracks up/down and the list re-ranks live against a model of your taste, with explicit knobs for pocket breadth and explore vs. exploit, plus a "why this track" signal for every result. Discovery you drive, not a black-box feed.
| Discovery results | Live personalization |
|---|---|
![]() |
![]() |
iTunes 30s previews ──► decode (PyAV) ──► embed: MusicFM ──► L2-normalized vectors
│
results ◄── MMR diversity + dedup ◄── cosine NN / multi-seed centroid
The discovery engine builds a dense corpus from free iTunes previews, decodes each to a waveform, and embeds it with MusicFM (a self-supervised music foundation model) into a 1024-d vector. Search is cosine nearest-neighbor — single-seed, or the centroid of several seeds for the multi-seed throughline — with MMR for intra-list diversity and embedding-space de-duplication of re-releases/remasters.
The personalization layer is a stateless FastAPI service over the cached embeddings (CPU-only — no model at serve time): an anchor-set model of the user's taste (seeds + thumbed tracks, recency-decayed), soft-top-k relevance scoring, and MMR-driven exploration so the loop discovers instead of collapsing into an echo chamber.
This began as a spike to test one question — does audio-embedding similarity actually feel right on real taste? — and the most valuable work turned out to be the decisions, documented as I went:
- Model selection — A/B-tested four embedding models (LAION-CLAP, MuQ, MERT, MusicFM) by ear on a real
corpus. Music-specialist models clearly beat the generic audio-text model; I landed on MusicFM —
self-hostable, 1024-d, and a match for the best by ear. →
docs/model-selection.md - License & data-provenance due diligence — most strong music models ship non-commercial weights, or
train on non-commercial data, even when their code is permissive. I traced the weights and training-data
licenses across the whole landscape so the constraints were explicit rather than assumed. →
docs/licensing.md - Product validation — before over-building, I ran market-demand and competitive analysis. The honest
finding: demand is real but niche, the space is a graveyard for indie consumer apps, and the incumbent is
moving into the exact wedge — so I deliberately scoped this as a research / portfolio project instead of
chasing a commercial build. Knowing when not to build is part of the engineering. →
docs/market-validation.md
The taste model is an anchor set (seeds + thumbed-up tracks, with recency decay), not a single drifting centroid; candidates are scored by soft-top-k similarity to that set, then MMR keeps the list diverse. Two knobs separate the axes that a single "more like this" slider conflates: pocket breadth (one tight pocket ↔ all your liked pockets) and explore/exploit (familiar ↔ novel). Thumbs-down hard-excludes the exact track.
The honest ceiling: feedback can only personalize within what the audio embedding expresses. If part of why you love a song isn't sonic, no amount of content-based feedback reaches it — that needs lyrics/metadata or collaborative filtering (which needs a crowd a solo project doesn't have). The spike made that limit visible, which is the point of a spike.
Python · NumPy · PyTorch (MusicFM inference) · FastAPI · vanilla JS · PyAV · iTunes Search API · per-model embedding cache + cosine NN over a ~16k-track corpus.
# setup (GPU box for embedding; the web app itself is CPU-only)
./setup.ps1
# build the corpus + embeddings
./run.ps1
# launch the interactive personalization app -> http://localhost:8000
./run_app.ps1See docs/running.md for details (Windows/CUDA notes, model swap, knobs).
- A quick embedding-layer sweep (we used MusicFM layer 7 by default) and multi-window pooling per track.
- A larger, denser catalog and an ANN index (FAISS / pgvector) to scale past the in-memory matrix.
- A learned global re-ranking head on top of the frozen backbone — trainable once there's aggregate feedback — as the bridge between the content-only ceiling and real personalization.
A research / portfolio project. Uses third-party models under their respective licenses; not a commercial product. Built by Tim Song — my first end-to-end ML system, as part of moving from backend into AI engineering.

