Skip to content

zhuconv/chorus

Repository files navigation

Chorus

Tired of stacking conference lectures, TED talks, and deep 1-on-1 interviews into a Watch Later you'll never actually finish? Chorus is for you — drop in the URL, walk away with a 5-minute two-voice short where a host narrates and the speaker themselves quotes verbatim.

Or hand it a name — "Dale Schuurmans", "Mo Gawdat" — and Chorus searches YouTube, picks 3–5 talks, and stitches a chronological audio biography that tracks how the person's views evolved era by era. Either mode also gives you themed ~10-minute long chapters alongside the short.

Claude Opus 4.6 agents write it; a critic agent grep -Fs every quoted line back into the raw transcript and blocks the pipeline if a single word drifts. Kokoro 82M renders it locally in two voices. Every quote in the script links back to the exact second on YouTube. Runs on your Claude Pro/Max subscription — no ANTHROPIC_API_KEY needed.

License MIT Python 3.11+ Built with Claude TTS Kokoro 82M yt-dlp

Listen first

A 4.9-minute short the pipeline produced from a 64-minute RLC 2025 keynote by Dale Schuurmans, Language Models and Computation. No human edits inside the run.

demo.mp4

🎧 Download the MP3 · 📄 Read the annotated script

The .md is the human-readable script — every quote is a blockquote with a clickable YouTube timestamp. The MP3 is what Kokoro did with the underlying TTS-tagged source.

Why this is not another summarizer

1. A critic agent audits every quote. Every line tagged [DALE] (or whatever the subject's first name is) must be a byte-for-byte substring of the raw transcript. The transcript-critic agent runs grep -F against transcript.txt and blocks the pipeline if even one word drifts. Same idea cross-source for name mode via corpus-script-critic.

2. The host and the subject sound different. Kokoro renders narration in one voice and quotes in another, paired by subject gender. "What the host says about Dale" and "what Dale actually said" are audibly distinct.

3. Source-linked Markdown is the published artifact. Every script ships as script.md — each verbatim quote becomes a blockquote with a deep-linked YouTube timestamp. Click → land on the exact second. The raw [HOST]/[DALE] script.txt is the TTS source; the .md is what humans read.

> [**12:37**](https://www.youtube.com/watch?v=YnMqbpdHcaY&t=757s) · _ICAPS 2024 Keynote_ · 2024-07-02
>
> there was this sense in which this started to look to me at least like an actual computer where you you were you know providing a problem...

4. No fabricated motivation. When a thinker's view shifts but they never said why, the era-aggregator marks it honestly and the corpus-script-critic blocks any chapter that invents a reason.

5. Runs off your Claude subscription. No API key, no billing dashboard. The Agent SDK shells out to your local claude binary — see Subscription auth below for the one-line trick that guarantees this.

Two modes

URL mode — one video → short + chapters

uvx chorus "https://www.youtube.com/watch?v=yGLoWZP1MyA"

A transcript-investigator extracts (theme, thesis, verbatim_quote, timestamp) tuples. The transcript-critic greps each quote. The opinion-aggregator clusters into 3–5 themes. The script-writer drafts a ~5-minute short plus one ~10-minute chapter per theme. The script-critic verifies. chorus-annotate produces script.md sidecars. Kokoro renders.

Name mode — one name → chronological audio biography

uvx chorus "Dale Schuurmans" --max-videos 5

This is the more ambitious mode. Chorus:

  1. Searches YouTube (chorus-find) across 6 keyword templates — keynote, interview, podcast, talk, lecture, fireside — applies a 20-min duration filter, dedupes, enriches with upload dates → candidates.json.
  2. Filters to formal talks BY the subject (interview-finder) — drops commentary, clips, wrong-person matches → videos.json.
  3. Investigates every video in paralleltranscript-investigator + transcript-critic loops fan out via asyncio.gather.
  4. Builds a chronological era structure (era-aggregator) — clusters opinions into 2–4 eras, identifies transitions with before/after quotes, surfaces stable themes that span every era. Only agent that sees more than one talk at once. Refuses to invent why a view shifted.
  5. Writes a chronological script (corpus-script-writer) — 5-minute overview plus one ~10-minute chapter per era. Each era chapter opens with a transition passage from the previous era.
  6. Cross-source critic (corpus-script-critic) — verbatim-grep against every per-video transcript, and a hard fabricated_motivation check on every transition opening.
  7. Annotatesscript.md + sources.json for every script (deterministic Python via chorus-annotate).
  8. Renders — Kokoro emits one MP3 per script.

You get an overview MP3 plus one MP3 per era. Re-run any time — --skip-search reuses the curated videos.json; finished stages resume from disk.

Quickstart

Pick whichever install style matches how you work:

One-shot: uvx

brew install espeak-ng ffmpeg
npm install -g @anthropic-ai/claude-code && claude    # one-time login
uvx chorus "https://www.youtube.com/watch?v=<ID>"     # URL mode
uvx chorus "Dale Schuurmans"                          # name mode
uvx chorus-doctor                                     # self-check deps

Inside Claude Code: install as a plugin

/plugin install zhuconv/chorus
/chorus https://www.youtube.com/watch?v=<ID>
/chorus "Dale Schuurmans" --max-videos 5

Ships all 9 agents plus the /chorus skill. Requires the chorus Python package (uv tool install chorus or pip install chorus).

Traditional clone-and-uv

brew install espeak-ng ffmpeg
npm install -g @anthropic-ai/claude-code && claude
git clone https://github.com/zhuconv/chorus && cd chorus
uv sync
uv run chorus "https://www.youtube.com/watch?v=<ID>"
open output/<Subject>/<date>__<id>__<slug>/short/short.mp3

Pipeline

Chorus pipeline: YouTube URL → fetch → investigator → transcript-critic → aggregator → script-writer → script-critic → Kokoro TTS → MP3s. Critics can retry their upstream agent up to 2x.

Every box is one claude_agent_sdk.query() call. System prompts live in agents/<role>.md — no code edits required to tweak behavior. Critics can send work back upstream for one retry; after that the orchestrator surfaces the failure.

The 9 agents:

Agent Mode Job
transcript-investigator both Extract (theme, thesis, verbatim_quote, timestamp) tuples.
transcript-critic both grep -F every quote against the transcript. Blocks on drift.
opinion-aggregator URL Cluster into 3–5 themes; assign subject_gender.
script-writer URL Draft short + per-theme chapter scripts.
script-critic URL Verbatim fidelity, TTS style, word counts.
interview-finder name Filter raw search to formal talks BY the subject.
era-aggregator name Build 2–4 chronological eras + transitions. Never fabricates motivation.
corpus-script-writer name Chronological overview + per-era chapters with transition openings.
corpus-script-critic name Cross-source verbatim grep + fabricated_motivation block.

Case studies

Name mode — Dale Schuurmans, six years of changing his mind

A single command pulled in 3 talks spanning 2019–2025 and produced 4 MP3s totaling ~35 minutes:

File Era Source talk
short/overview.mp3 All three Chronological overview
long/era01_subbasement.mp3 2019 — Optimization in RL DLRLSS 2019
long/era02_llms_as_computers.mp3 2024 — LLMs as a new kind of computer ICAPS 2024 Keynote
long/era03_hard_limits.mp3 2025 — Computational impossibility results RLC 2025 keynote

🎧 Listen to the chronological overview · 📄 Read the annotated script (every quote has a clickable YouTube timestamp)

The era-aggregator identified one stable theme running through all three eras — "Think computationally, not statistically" — supported by verbatim quotes from each talk. That theme becomes the through-line of the overview script.

URL mode — same pipeline, two genres

Source Length Kind Result
Dale Schuurmans — RLC 2025 keynote, Language Models and Computation 64 min Academic lecture 5 MP3s, ~42 min total. Zero human edits inside the pipeline. Listen.
Mo Gawdat — interview on AI, UBI, and the job market 40 min Pop interview 5 MP3s, ~29 min total. Validated end-to-end.

Two genres, same pipeline, coherent output.

Output layout

URL mode

output/
└── Dale_Schuurmans/
    └── 2025-08-25__yGLoWZP1MyA__dale_schuurmans_language_models/
        ├── metadata.json
        ├── transcript.srt / transcript.txt
        ├── opinions.raw.json              # investigator
        ├── transcript_critic_report.json
        ├── opinions.json                  # aggregator — themes, subject_gender
        ├── script_critic_report.json
        ├── short/
        │   ├── script.txt                 # raw [HOST]/[DALE] tagged TTS source
        │   ├── script.md                  # human-readable, source-linked
        │   ├── sources.json
        │   └── short.mp3
        └── long/
            ├── ch01_<slug>_script.txt     # ~1500 words / ~10 min per chapter
            ├── ch01_<slug>.md             # source-linked sidecar
            ├── ch01_<slug>_sources.json
            ├── ch01_<slug>.mp3
            └── ...

Name mode

output/
└── Dale_Schuurmans/
    ├── _corpus/
    │   ├── candidates.json              # chorus-find raw search
    │   ├── videos.json                  # interview-finder verdict
    │   ├── opinions_index.json
    │   ├── evolution.json               # era-aggregator — eras, transitions, stable themes
    │   ├── script_critic_report.json
    │   └── <date>__<video_id>__<slug>/  # per-video artifacts
    ├── short/
    │   ├── script.txt
    │   ├── script.md                    # chronological overview, source-linked
    │   ├── sources.json
    │   └── overview.mp3
    └── long/
        ├── era01_<slug>_script.txt      # one chapter per era
        ├── era01_<slug>.md
        ├── era01_<slug>_sources.json
        ├── era01_<slug>.mp3
        └── ...

Source-linked Markdown sidecars

Every run produces a script.md next to script.txt plus sources.json. Each verbatim quote is a blockquote with a deep-linked YouTube timestamp:

> [**1:03:19**](https://www.youtube.com/watch?v=yGLoWZP1MyA&t=3799s) · _Dale Schuurmans, Language Models and Computation_ · 2025-08-25
>
> machine learning is awesome. Reinforcement learning even more so, but computer science matters. Especially when you're trying to train LLMs to to serve a whole range of problem instances. You are now confronted with the laws of computation.

Click the timestamp → YouTube jumps to that exact second. Each [HOST] paragraph carries the nearest preceding quote's video as a "near" attribution in sources.json, so downstream tools can highlight which talk a paraphrase passage draws from.

Run chorus-annotate on any existing run to regenerate sidecars without touching the agent pipeline.

Subscription auth

The Agent SDK shells out to your local claude binary, which inherits Pro/Max OAuth from claude login. To make sure that path is taken even if you have an ANTHROPIC_API_KEY exported in your shell, every query() call uses:

ClaudeAgentOptions(env={"ANTHROPIC_API_KEY": ""})

The SDK merges this dict last when building the subprocess env. The clobbered key is empty, so the claude CLI falls back to its OAuth-stored subscription auth. The orchestrator prints a note: at startup if a key was detected — so you know it was bypassed, not used.

Script format

[HOST] Narration spoken by the host voice.
[DALE] "A verbatim direct quote in quotation marks."
[HOST] More narration.
  • [HOST] — narrator voice. All framing and paraphrase.
  • [<FIRSTNAME>] — subject voice. Verbatim quotes only, wrapped in "…". Mo Gawdat → [MO]. Naval Ravikant → [NAVAL].
  • Blank line = paragraph break (longer audio pause).

This .txt is the TTS source Kokoro reads. Humans read the .md sidecar that chorus-annotate generates next to it.

Two Kokoro voices, paired by subject gender so the narrator and subject are always audibly different:

Subject gender Host voice Quote voice
Male af_heart am_puck
Female am_puck af_heart

Override with --host-voice / --quote-voice on chorus-tts.

Single-stage CLIs

Every stage is a standalone command:

uv run chorus-fetch <URL> -o output/                          # transcript only
uv run chorus-find "<name>" -o output/<Subject>/_corpus/      # YouTube search only
uv run chorus-tts path/to/script.txt -o out.mp3 \             # TTS from any tagged script
                  --subject-gender male
uv run chorus-annotate output/<Subject>/                      # regenerate .md sidecars
uv run chorus-doctor                                          # dependency self-check

Flags

uv run chorus <URL_or_name> [--model claude-opus-4-6]
                            # URL mode:
                            [--skip-fetch]         # reuse existing transcript
                            # name mode:
                            [--max-videos 5]       # cap on videos selected
                            [--min-duration 1200]  # min seconds per candidate
                            [--skip-search]        # reuse candidates.json + videos.json
                            # both:
                            [--skip-synth]         # stop before Kokoro

Setup in detail

Expand for prerequisites, Python, auth

System (macOS tested):

brew install espeak-ng ffmpeg

Claude Code CLI — the Agent SDK spawns this and inherits its login. No ANTHROPIC_API_KEY is used anywhere in the orchestrator. Even if you have one exported, the orchestrator clobbers it inside the subprocess env via ClaudeAgentOptions(env={"ANTHROPIC_API_KEY": ""}), so every agent turn bills against your Pro/Max subscription. A note: is printed at run start if a key is detected.

npm install -g @anthropic-ai/claude-code
claude        # one-time login; Pro/Max covers all agent calls

Python 3.11–3.12:

uv sync       # installs yt-dlp, kokoro, claude-agent-sdk, torch, etc.

First Kokoro run downloads ~330 MB of model weights from Hugging Face. After that it runs fully offline on CPU or Apple Silicon MPS.

Customizing the agents

Agent behavior lives in markdown under agents/ at the repo root (with a .claude/agents symlink so in-repo Claude Code sessions also see them). The YAML frontmatter names the role and declares allowed tools; the body is the system prompt. Next run picks up your changes — no code edit required.

Useful tweaks:

  • Loosen script-critic / corpus-script-critic acronym-expansion for technical audiences (they currently nag "AI" and "LLM" on every occurrence).
  • Change chapter-count target in opinion-aggregator (defaults to 3–5 based on content density).
  • Tighten transcript-investigator density targets for longer sources.
  • Change the search keyword menu in src/chorus/youtube_search/find.py::DEFAULT_QUERIES or raise DEFAULT_PER_QUERY if the finder needs a bigger pool.
  • Tune the era boundaries in era-aggregator.md — 2–4 eras is the default envelope; widen if you're processing someone with a 30-year corpus.

Known limitations

  • Name-mode runtime is long. A 5-video corpus takes ~20 min of yt-dlp work plus ~20–30 min of parallel agent wall-clock per video. Expect 1–2 hours end-to-end.
  • URL-mode runtime is also long. A 60-min source takes ~45–60 min of agent wall-clock plus ~6 min of Kokoro synth on Apple Silicon. The investigator alone runs 8–15 min because it grep-verifies every quote.
  • Critic warnings are noisy. Many warn-level issues (acronyms, homographs) don't actually degrade audio. Treat the verdict as advisory unless it's block.
  • YouTube search quality is coarse. For less-indexed academics or people with common names, the interview-finder agent excludes aggressively — you'll get 3 videos when you asked for 5. Pass a more specific name ("Dale Schuurmans University of Alberta") to tighten matches.
  • No motivation fabrication. The era-aggregator and corpus-script-critic jointly refuse to invent why a view changed when the subject never said. Honest, but the script will sometimes say "he has not said publicly why this shifted."
  • English only. Kokoro supports other languages but the voice conventions and script rules are tuned for English.
  • macOS-tested. Dependencies (espeak-ng, ffmpeg, torch) exist on Linux and WSL, but I haven't verified the full pipeline there.

Roadmap

  • --subject-name-hint / --institution flags to disambiguate same-name people at the finder stage.
  • Relax script-critic / corpus-script-critic acronym strictness automatically for technical audiences.
  • Optional --style preset for the script writers (editorial / narrative / lecture-notes).
  • Parallel chapter synthesis in chorus-tts (currently sequential per script).
  • Cached candidates.json warmup — fall back to a stale cache with a warning when --skip-search is set but no cache exists.
  • Publish to PyPI so uvx chorus works without cloning.

Repo map

chorus/
├── pyproject.toml                 # deps + 6 CLI entry points
├── .claude-plugin/plugin.json     # Claude Code plugin manifest
├── agents/                        # 9 markdown agent specs (editable)
│   ├── transcript-investigator.md
│   ├── transcript-critic.md
│   ├── opinion-aggregator.md
│   ├── script-writer.md
│   ├── script-critic.md
│   ├── interview-finder.md        # name mode: filter search to formal talks
│   ├── era-aggregator.md          # name mode: chronological clustering + transitions
│   ├── corpus-script-writer.md    # name mode: per-era chapters with transition openings
│   └── corpus-script-critic.md    # name mode: cross-source verbatim + fabrication checks
├── skills/chorus/SKILL.md         # /chorus slash-command spec
├── .claude/                       # symlinks to agents/ and skills/ for in-repo dev
├── src/chorus/
│   ├── orchestrator.py            # `chorus` CLI — URL + name mode driver
│   ├── youtube_fetcher/fetch.py   # `chorus-fetch` — yt-dlp transcript fetcher
│   ├── youtube_search/find.py     # `chorus-find` — yt-dlp name-based search
│   ├── kokoro_tts/synthesize.py   # `chorus-tts` — two-voice Kokoro wrapper
│   ├── annotate/markdown.py       # `chorus-annotate` — .md sidecar generator
│   └── doctor.py                  # `chorus-doctor` — dependency self-check
├── examples/                      # Dale Schuurmans short (mp3 + md + txt)
└── output/                        # per-video + per-subject artifacts (gitignored)

License

MIT for the Chorus code. Kokoro model weights are Apache-2.0. You are responsible for complying with YouTube's Terms of Service for any transcripts you fetch.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages