Add transcribe + speaker diarization pipeline#32
Conversation
Four-phase pipeline: faster-whisper transcription, pyannote diarization, timestamp-overlap merge, and interactive HTML report with speaker renaming. Cleanup from prototype: removed hardcoded paths, fixed __main__ guard, added module docstrings, registered in test suite (skipped in CI). Made-with: Cursor
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a7135229ec
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if not args.no_cache and os.path.exists(tx_parquet): | ||
| print("\n[2/5] Transcription (cached)") |
There was a problem hiding this comment.
Key phase caches by source input
This cache check reuses transcription.parquet solely based on file existence, so running the pipeline on a second audio file with the same --dest silently reuses stale results from the first file; the same pattern is repeated for diarization later in main. In practice this can corrupt experiments by producing a report for file B that actually contains file A’s transcript/labels unless users remember --no-cache every time, so cache keys need to include source identity (or validate cached metadata against args.source) before skipping phases.
Useful? React with 👍 / 👎.
| basename = os.path.splitext(os.path.basename(source))[0] | ||
| wav_path = os.path.join(dest_dir, f"{basename}.wav") |
There was a problem hiding this comment.
Disambiguate cached WAV filename per source
The WAV cache key is only the basename, so different inputs like /a/meeting.m4a and /b/meeting.mp3 both map to <dest>/meeting.wav; the second run will incorrectly reuse the first converted audio. Because this happens before phase cache checks, it can produce wrong transcriptions even when --no-cache is set, so the cached WAV path should incorporate a unique source fingerprint (e.g., full path hash or content hash) or be validated before reuse.
Useful? React with 👍 / 👎.
| f' <h3><strong class="speaker-label" data-speaker-id="{t["speaker"]}" ' | ||
| f'style="color:{speaker_colors[t["speaker"]]}">{t["speaker"]}</strong> ' | ||
| f"<code>[{format_timestamp(t['start'])} - {format_timestamp(t['end'])}]</code></h3>\n" | ||
| f" <p>{t['text']}</p>\n" |
There was a problem hiding this comment.
HTML-escape transcript fields before rendering
Transcript text is inserted directly into HTML without escaping, so any recognized text containing characters like </& (or tag-like content) is interpreted as markup rather than plain text, which can break report rendering and allow script injection in the generated file. Apply HTML escaping to dynamic text/attribute values before substituting into the template.
Useful? React with 👍 / 👎.
- pipelines/catalog.py: shared catalog factory backed by PyIceberg SqlCatalog. Configure via DAFT_WAREHOUSE / DAFT_CATALOG_URI env vars. Local SQLite by default, trivially switchable to GCS BigLake or S3. - transcribe_diarize: migrated from raw parquet files to catalog tables (transcribe_diarize.transcription, .diarization, .merged). Cache validation now checks source identity — switching audio files invalidates stale results. WAV filenames include a source-path fingerprint to prevent basename collisions. All transcript text is HTML-escaped before rendering in the report. - pipelines/README.md: centralized documentation of catalog architecture, configuration, table namespaces, and a reference of all pipelines. Made-with: Cursor
Summary
Cleanup from prototype
/Users/.../Desktop/...) —--sourceis now required# ── Main ──section headerdiarize_schema.pyEnvironmentError→OSErrorper ruff UP024CONTRIBUTING.mdrepo structure to includetranscribe_diarize/Test plan
python -c "import ast; ast.parse(...)"grep -r '/Users\|~/\|Desktop' pipelines/transcribe_diarize/returns nothingruff checkandruff format --checkpass (pre-commit hook verified)tests/registry.pywithskip="requires faster-whisper + pyannote model download"uv run pipelines/transcribe_diarize/transcribe_diarize.py --source <audio.m4a>(requires HF_TOKEN + ffmpeg)Made with Cursor