stdbKit

Webhook-first toolkit for indexing YouTube subtitle dialogue and searching by phrase.

stdbKit watches YouTube playlists, downloads or generates subtitles for each video, stores timed dialogue segments in SQLite, and exposes phrase search with deep links back to the exact moment in the video.

It is domain-agnostic: any YouTube content with spoken dialogue and captions (or Whisper transcription) can be indexed. Domain-specific metadata — film titles, course names, tags, UI — belongs in your application layer on top of stdbKit.

What it does

Capability	Description
Playlist sync	Poll watched playlists and detect added/removed videos
Subtitle fetch	Download manual or auto-generated captions via yt-dlp (VTT)
Subtitle generation	Optional local Whisper transcription when YouTube has no captions
Timed storage	Persist subtitle segments with start/end timestamps per video
Phrase search	Full-text search (SQLite FTS5) over indexed dialogue
Deep links	Search results include a YouTube URL with timestamp

How it works

YouTube does not send push notifications when a video is added to a playlist. PubSubHubbub only covers channel uploads and metadata updates.

stdbKit separates concerns:

PlaylistRelay — the only component that polls YouTube (default: every 30s)
WebhookServer — receives POST /webhooks/playlist-video and indexes subtitles
SQLite + FTS5 — stores videos, segments, and searchable dialogue

Typical flow:

Register playlist → Relay detects new video → Webhook fires → Subtitles indexed → Phrase search

Requirements

Python 3.12+
yt-dlp on your PATH
ffmpeg on your PATH (only for generated subtitles)

Install

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

# Optional: Whisper transcription when YouTube has no captions
pip install -e ".[transcribe]"

Quickstart

# Create database
stdbkit init data/stdbkit.db

# Register one or more YouTube playlists to watch
stdbkit watch add "https://www.youtube.com/playlist?list=PLxxxx" --db data/stdbkit.db

# Run relay + webhook server together
stdbkit daemon --db data/stdbkit.db --interval 30 --port 8787

When a new video appears in a watched playlist, the relay posts a webhook, subtitles are indexed, and the dialogue becomes searchable.

CLI

stdbkit init data/stdbkit.db
stdbkit watch add <playlist-url> --db data/stdbkit.db
stdbkit daemon --db data/stdbkit.db
stdbkit search "exact phrase from the video" --db data/stdbkit.db
stdbkit status --db data/stdbkit.db
stdbkit index-pending --db data/stdbkit.db
stdbkit index-pending --db data/stdbkit.db --generate-subs

Generated subtitles (Whisper)

When YouTube has no captions, stdbKit can download audio and transcribe locally with faster-whisper (OpenAI Whisper, CPU/GPU, multilingual).

pip install -e ".[transcribe]"

# Enable for daemon / webhook server / re-indexing
stdbkit daemon --db data/stdbkit.db --generate-subs
stdbkit index-pending --db data/stdbkit.db --generate-subs

# Override model: --whisper-model medium

With --whisper-model auto (default), stdbKit picks:

Hardware	Model	Precision	Notes
NVIDIA GPU (e.g. RTX 3060 12 GB)	`large-v3`	`float16`	Best quality/speed balance on GPU
CPU only	`base`	`int8`	Fallback without CUDA

Requires CUDA-enabled ctranslate2 (installed automatically with faster-whisper when drivers are present).

Python API:

kit = StdbKit("data/stdbkit.db", generate_subtitles=True)  # auto → large-v3 on GPU

Generated tracks are stored with subtitle_source="generated". YouTube manual/auto captions are always preferred when available.

Split processes

# Terminal 1
stdbkit serve --db data/stdbkit.db --port 8787

# Terminal 2
stdbkit relay --db data/stdbkit.db --callback http://127.0.0.1:8787 --interval 30

Python API

from stdbkit import StdbKit

kit = StdbKit("data/stdbkit.db")
kit.watch_playlist("https://www.youtube.com/playlist?list=PLxxxx")
kit.run_daemon(interval_sec=30, port=8787)

results = kit.search_phrase("exact phrase from the video")
for result in results:
    print(result.matched_text, result.youtube_url_with_timestamp)

See also examples/daemon.py and examples/minimal_search.py.

Webhook contract

POST /webhooks/playlist-video

{
  "playlist_id": "PLxxxx",
  "youtube_id": "abc123",
  "title": "Video title from YouTube",
  "position": 12,
  "url": "https://www.youtube.com/watch?v=abc123"
}

POST /webhooks/playlist-removed

{
  "playlist_id": "PLxxxx",
  "youtube_id": "abc123"
}

If a secret is configured, send header X-StdbKit-Secret.

External systems (Zapier, Google Apps Script, custom scripts) can call the same endpoints. The playlist must be registered with stdbkit watch add before webhooks for that playlist are accepted.

Video lifecycle

Status	Meaning
`discovered`	Webhook received, not indexed yet
`indexing`	Subtitles downloading or generating
`ready`	Searchable by phrase
`no_dialogue`	No speech detected (e.g. music-only); not searchable
`failed`	Subtitle fetch/generation failed; retry with `index-pending`
`removed_from_playlist`	Removed from a watched playlist

Scope and limitations

stdbKit intentionally stays small. These are the current boundaries:

YouTube only — indexing uses yt-dlp against YouTube URLs/IDs
Playlist-driven ingestion — the built-in relay watches registered playlists; there is no channel-wide watcher or single-video CLI command
Subtitle dialogue search — FTS runs over timed caption text, not video titles, channels, or custom metadata (use the metadata JSON field and your own app for that)
Speech expected — videos without dialogue are marked no_dialogue and skipped from search results
Captions or Whisper — without YouTube subtitles and without --generate-subs, indexing fails

Example use cases

These are illustrative patterns, not built-in modes:

Curated clip libraries — a playlist of short excerpts; search finds the line and jumps to the timestamp
Lecture or course archives — index a playlist of talks and search for a concept mentioned in the audio
Podcast / interview collections — same pipeline; Whisper fills in when auto-captions are missing
Internal tooling — run the daemon, expose search via your own UI or API using the Python library

In all cases the workflow is the same: register playlists → index subtitles → search phrases → open the timestamped YouTube link.

Building on top

stdbKit stores generic video fields (title, channel, duration_sec, thumbnail_url) and an extensible metadata JSON blob per video. Anything beyond subtitle search — custom taxonomies, auth, web UI, enrichment — lives in a separate application that installs stdbKit as a dependency.

Project layout

stdbkit/
  db/          SQLite schema + persistence
  relay/       Playlist polling + diff + webhook dispatch
  webhooks/    HTTP server + event handler
  subtitles/   yt-dlp fetch + VTT parsing + indexing
  search/      FTS phrase search
  kit.py       Public facade
  cli.py       Typer CLI
examples/      Minimal Python usage samples

Development

pytest

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
examples		examples
stdbkit		stdbkit
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

stdbKit

What it does

How it works

Requirements

Install

Quickstart

CLI

Generated subtitles (Whisper)

Split processes

Python API

Webhook contract

Video lifecycle

Scope and limitations

Example use cases

Building on top

Project layout

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

stdbKit

What it does

How it works

Requirements

Install

Quickstart

CLI

Generated subtitles (Whisper)

Split processes

Python API

Webhook contract

Video lifecycle

Scope and limitations

Example use cases

Building on top

Project layout

Development

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages