Add transcribe + speaker diarization pipeline by everettVT · Pull Request #32 · Eventual-Inc/daft-examples

everettVT · 2026-04-16T19:46:43Z

Summary

Adds a four-phase audio pipeline: faster-whisper transcription → pyannote speaker diarization → timestamp-overlap merge → interactive HTML report
Phases run sequentially with deferred imports so Whisper and pyannote models don't compete for memory; intermediate parquets are cached across runs
HTML report features color-coded speakers, editable names (persisted in localStorage), and "Copy as Markdown" export

Cleanup from prototype

Removed hardcoded local path (/Users/.../Desktop/...) — --source is now required
Fixed duplicate # ── Main ── section header
Added module docstring to diarize_schema.py
EnvironmentError → OSError per ruff UP024
Import ordering fixed for ruff I001
Registered in test suite (skipped in CI: requires model download + HF_TOKEN)
Updated CONTRIBUTING.md repo structure to include transcribe_diarize/

Test plan

Syntax verified: python -c "import ast; ast.parse(...)"
No hardcoded local paths: grep -r '/Users\|~/\|Desktop' pipelines/transcribe_diarize/ returns nothing
ruff check and ruff format --check pass (pre-commit hook verified)
Registered in tests/registry.py with skip="requires faster-whisper + pyannote model download"
Manual run: uv run pipelines/transcribe_diarize/transcribe_diarize.py --source <audio.m4a> (requires HF_TOKEN + ffmpeg)

Made with Cursor

Four-phase pipeline: faster-whisper transcription, pyannote diarization, timestamp-overlap merge, and interactive HTML report with speaker renaming. Cleanup from prototype: removed hardcoded paths, fixed __main__ guard, added module docstrings, registered in test suite (skipped in CI). Made-with: Cursor

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a7135229ec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-16T19:49:08Z

+    if not args.no_cache and os.path.exists(tx_parquet):
+        print("\n[2/5] Transcription (cached)")


Key phase caches by source input

This cache check reuses transcription.parquet solely based on file existence, so running the pipeline on a second audio file with the same --dest silently reuses stale results from the first file; the same pattern is repeated for diarization later in main. In practice this can corrupt experiments by producing a report for file B that actually contains file A’s transcript/labels unless users remember --no-cache every time, so cache keys need to include source identity (or validate cached metadata against args.source) before skipping phases.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-16T19:49:08Z

+    basename = os.path.splitext(os.path.basename(source))[0]
+    wav_path = os.path.join(dest_dir, f"{basename}.wav")


Disambiguate cached WAV filename per source

The WAV cache key is only the basename, so different inputs like /a/meeting.m4a and /b/meeting.mp3 both map to <dest>/meeting.wav; the second run will incorrectly reuse the first converted audio. Because this happens before phase cache checks, it can produce wrong transcriptions even when --no-cache is set, so the cached WAV path should incorporate a unique source fingerprint (e.g., full path hash or content hash) or be validated before reuse.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-16T19:49:08Z

+        f'      <h3><strong class="speaker-label" data-speaker-id="{t["speaker"]}" '
+        f'style="color:{speaker_colors[t["speaker"]]}">{t["speaker"]}</strong> '
+        f"<code>[{format_timestamp(t['start'])} - {format_timestamp(t['end'])}]</code></h3>\n"
+        f"      <p>{t['text']}</p>\n"


HTML-escape transcript fields before rendering

Transcript text is inserted directly into HTML without escaping, so any recognized text containing characters like </& (or tag-like content) is interpreted as markup rather than plain text, which can break report rendering and allow script injection in the generated file. Apply HTML escaping to dynamic text/attribute values before substituting into the template.

Useful? React with 👍 / 👎.

- pipelines/catalog.py: shared catalog factory backed by PyIceberg SqlCatalog. Configure via DAFT_WAREHOUSE / DAFT_CATALOG_URI env vars. Local SQLite by default, trivially switchable to GCS BigLake or S3. - transcribe_diarize: migrated from raw parquet files to catalog tables (transcribe_diarize.transcription, .diarization, .merged). Cache validation now checks source identity — switching audio files invalidates stale results. WAV filenames include a source-path fingerprint to prevent basename collisions. All transcript text is HTML-escaped before rendering in the report. - pipelines/README.md: centralized documentation of catalog architecture, configuration, table namespaces, and a reference of all pipelines. Made-with: Cursor

chatgpt-codex-connector Bot reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add transcribe + speaker diarization pipeline#32

Add transcribe + speaker diarization pipeline#32
everettVT wants to merge 2 commits into
mainfrom
everettVT/add-transcribe-diarize-pipeline

everettVT commented Apr 16, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 16, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 16, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if not args.no_cache and os.path.exists(tx_parquet):
		print("\n[2/5] Transcription (cached)")

		basename = os.path.splitext(os.path.basename(source))[0]
		wav_path = os.path.join(dest_dir, f"{basename}.wav")

Conversation

everettVT commented Apr 16, 2026

Summary

Cleanup from prototype

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant