Skip to content

Arzuparreta/stdbKit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

stdbKit

Webhook-first toolkit for indexing YouTube subtitle dialogue and searching by phrase.

stdbKit watches YouTube playlists, downloads or generates subtitles for each video, stores timed dialogue segments in SQLite, and exposes phrase search with deep links back to the exact moment in the video.

It is domain-agnostic: any YouTube content with spoken dialogue and captions (or Whisper transcription) can be indexed. Domain-specific metadata — film titles, course names, tags, UI — belongs in your application layer on top of stdbKit.

What it does

Capability Description
Playlist sync Poll watched playlists and detect added/removed videos
Subtitle fetch Download manual or auto-generated captions via yt-dlp (VTT)
Subtitle generation Optional local Whisper transcription when YouTube has no captions
Timed storage Persist subtitle segments with start/end timestamps per video
Phrase search Full-text search (SQLite FTS5) over indexed dialogue
Deep links Search results include a YouTube URL with timestamp

How it works

YouTube does not send push notifications when a video is added to a playlist. PubSubHubbub only covers channel uploads and metadata updates.

stdbKit separates concerns:

  1. PlaylistRelay — the only component that polls YouTube (default: every 30s)
  2. WebhookServer — receives POST /webhooks/playlist-video and indexes subtitles
  3. SQLite + FTS5 — stores videos, segments, and searchable dialogue

Typical flow:

Register playlist → Relay detects new video → Webhook fires → Subtitles indexed → Phrase search

Requirements

  • Python 3.12+
  • yt-dlp on your PATH
  • ffmpeg on your PATH (only for generated subtitles)

Install

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

# Optional: Whisper transcription when YouTube has no captions
pip install -e ".[transcribe]"

Quickstart

# Create database
stdbkit init data/stdbkit.db

# Register one or more YouTube playlists to watch
stdbkit watch add "https://www.youtube.com/playlist?list=PLxxxx" --db data/stdbkit.db

# Run relay + webhook server together
stdbkit daemon --db data/stdbkit.db --interval 30 --port 8787

When a new video appears in a watched playlist, the relay posts a webhook, subtitles are indexed, and the dialogue becomes searchable.

CLI

stdbkit init data/stdbkit.db
stdbkit watch add <playlist-url> --db data/stdbkit.db
stdbkit daemon --db data/stdbkit.db
stdbkit search "exact phrase from the video" --db data/stdbkit.db
stdbkit status --db data/stdbkit.db
stdbkit index-pending --db data/stdbkit.db
stdbkit index-pending --db data/stdbkit.db --generate-subs

Generated subtitles (Whisper)

When YouTube has no captions, stdbKit can download audio and transcribe locally with faster-whisper (OpenAI Whisper, CPU/GPU, multilingual).

pip install -e ".[transcribe]"

# Enable for daemon / webhook server / re-indexing
stdbkit daemon --db data/stdbkit.db --generate-subs
stdbkit index-pending --db data/stdbkit.db --generate-subs

# Override model: --whisper-model medium

With --whisper-model auto (default), stdbKit picks:

Hardware Model Precision Notes
NVIDIA GPU (e.g. RTX 3060 12 GB) large-v3 float16 Best quality/speed balance on GPU
CPU only base int8 Fallback without CUDA

Requires CUDA-enabled ctranslate2 (installed automatically with faster-whisper when drivers are present).

Python API:

kit = StdbKit("data/stdbkit.db", generate_subtitles=True)  # auto → large-v3 on GPU

Generated tracks are stored with subtitle_source="generated". YouTube manual/auto captions are always preferred when available.

Split processes

# Terminal 1
stdbkit serve --db data/stdbkit.db --port 8787

# Terminal 2
stdbkit relay --db data/stdbkit.db --callback http://127.0.0.1:8787 --interval 30

Python API

from stdbkit import StdbKit

kit = StdbKit("data/stdbkit.db")
kit.watch_playlist("https://www.youtube.com/playlist?list=PLxxxx")
kit.run_daemon(interval_sec=30, port=8787)

results = kit.search_phrase("exact phrase from the video")
for result in results:
    print(result.matched_text, result.youtube_url_with_timestamp)

See also examples/daemon.py and examples/minimal_search.py.

Webhook contract

POST /webhooks/playlist-video

{
  "playlist_id": "PLxxxx",
  "youtube_id": "abc123",
  "title": "Video title from YouTube",
  "position": 12,
  "url": "https://www.youtube.com/watch?v=abc123"
}

POST /webhooks/playlist-removed

{
  "playlist_id": "PLxxxx",
  "youtube_id": "abc123"
}

If a secret is configured, send header X-StdbKit-Secret.

External systems (Zapier, Google Apps Script, custom scripts) can call the same endpoints. The playlist must be registered with stdbkit watch add before webhooks for that playlist are accepted.

Video lifecycle

Status Meaning
discovered Webhook received, not indexed yet
indexing Subtitles downloading or generating
ready Searchable by phrase
no_dialogue No speech detected (e.g. music-only); not searchable
failed Subtitle fetch/generation failed; retry with index-pending
removed_from_playlist Removed from a watched playlist

Scope and limitations

stdbKit intentionally stays small. These are the current boundaries:

  • YouTube only — indexing uses yt-dlp against YouTube URLs/IDs
  • Playlist-driven ingestion — the built-in relay watches registered playlists; there is no channel-wide watcher or single-video CLI command
  • Subtitle dialogue search — FTS runs over timed caption text, not video titles, channels, or custom metadata (use the metadata JSON field and your own app for that)
  • Speech expected — videos without dialogue are marked no_dialogue and skipped from search results
  • Captions or Whisper — without YouTube subtitles and without --generate-subs, indexing fails

Example use cases

These are illustrative patterns, not built-in modes:

  • Curated clip libraries — a playlist of short excerpts; search finds the line and jumps to the timestamp
  • Lecture or course archives — index a playlist of talks and search for a concept mentioned in the audio
  • Podcast / interview collections — same pipeline; Whisper fills in when auto-captions are missing
  • Internal tooling — run the daemon, expose search via your own UI or API using the Python library

In all cases the workflow is the same: register playlists → index subtitles → search phrases → open the timestamped YouTube link.

Building on top

stdbKit stores generic video fields (title, channel, duration_sec, thumbnail_url) and an extensible metadata JSON blob per video. Anything beyond subtitle search — custom taxonomies, auth, web UI, enrichment — lives in a separate application that installs stdbKit as a dependency.

Project layout

stdbkit/
  db/          SQLite schema + persistence
  relay/       Playlist polling + diff + webhook dispatch
  webhooks/    HTTP server + event handler
  subtitles/   yt-dlp fetch + VTT parsing + indexing
  search/      FTS phrase search
  kit.py       Public facade
  cli.py       Typer CLI
examples/      Minimal Python usage samples

Development

pytest

License

MIT

About

Toolkit to index youtube video subtitles linked to timestamps and expose search by word/quote. In real time (relay pulls the playlist every 30s for our homemade WebHook service).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages