Skip to content

Add transcribe + speaker diarization pipeline#32

Open
everettVT wants to merge 2 commits into
mainfrom
everettVT/add-transcribe-diarize-pipeline
Open

Add transcribe + speaker diarization pipeline#32
everettVT wants to merge 2 commits into
mainfrom
everettVT/add-transcribe-diarize-pipeline

Conversation

@everettVT
Copy link
Copy Markdown
Contributor

Summary

  • Adds a four-phase audio pipeline: faster-whisper transcription → pyannote speaker diarization → timestamp-overlap merge → interactive HTML report
  • Phases run sequentially with deferred imports so Whisper and pyannote models don't compete for memory; intermediate parquets are cached across runs
  • HTML report features color-coded speakers, editable names (persisted in localStorage), and "Copy as Markdown" export

Cleanup from prototype

  • Removed hardcoded local path (/Users/.../Desktop/...) — --source is now required
  • Fixed duplicate # ── Main ── section header
  • Added module docstring to diarize_schema.py
  • EnvironmentErrorOSError per ruff UP024
  • Import ordering fixed for ruff I001
  • Registered in test suite (skipped in CI: requires model download + HF_TOKEN)
  • Updated CONTRIBUTING.md repo structure to include transcribe_diarize/

Test plan

  • Syntax verified: python -c "import ast; ast.parse(...)"
  • No hardcoded local paths: grep -r '/Users\|~/\|Desktop' pipelines/transcribe_diarize/ returns nothing
  • ruff check and ruff format --check pass (pre-commit hook verified)
  • Registered in tests/registry.py with skip="requires faster-whisper + pyannote model download"
  • Manual run: uv run pipelines/transcribe_diarize/transcribe_diarize.py --source <audio.m4a> (requires HF_TOKEN + ffmpeg)

Made with Cursor

Four-phase pipeline: faster-whisper transcription, pyannote diarization,
timestamp-overlap merge, and interactive HTML report with speaker renaming.

Cleanup from prototype: removed hardcoded paths, fixed __main__ guard,
added module docstrings, registered in test suite (skipped in CI).

Made-with: Cursor
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a7135229ec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +413 to +414
if not args.no_cache and os.path.exists(tx_parquet):
print("\n[2/5] Transcription (cached)")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Key phase caches by source input

This cache check reuses transcription.parquet solely based on file existence, so running the pipeline on a second audio file with the same --dest silently reuses stale results from the first file; the same pattern is repeated for diarization later in main. In practice this can corrupt experiments by producing a report for file B that actually contains file A’s transcript/labels unless users remember --no-cache every time, so cache keys need to include source identity (or validate cached metadata against args.source) before skipping phases.

Useful? React with 👍 / 👎.

Comment on lines +70 to +71
basename = os.path.splitext(os.path.basename(source))[0]
wav_path = os.path.join(dest_dir, f"{basename}.wav")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Disambiguate cached WAV filename per source

The WAV cache key is only the basename, so different inputs like /a/meeting.m4a and /b/meeting.mp3 both map to <dest>/meeting.wav; the second run will incorrectly reuse the first converted audio. Because this happens before phase cache checks, it can produce wrong transcriptions even when --no-cache is set, so the cached WAV path should incorporate a unique source fingerprint (e.g., full path hash or content hash) or be validated before reuse.

Useful? React with 👍 / 👎.

f' <h3><strong class="speaker-label" data-speaker-id="{t["speaker"]}" '
f'style="color:{speaker_colors[t["speaker"]]}">{t["speaker"]}</strong> '
f"<code>[{format_timestamp(t['start'])} - {format_timestamp(t['end'])}]</code></h3>\n"
f" <p>{t['text']}</p>\n"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge HTML-escape transcript fields before rendering

Transcript text is inserted directly into HTML without escaping, so any recognized text containing characters like </& (or tag-like content) is interpreted as markup rather than plain text, which can break report rendering and allow script injection in the generated file. Apply HTML escaping to dynamic text/attribute values before substituting into the template.

Useful? React with 👍 / 👎.

- pipelines/catalog.py: shared catalog factory backed by PyIceberg SqlCatalog.
  Configure via DAFT_WAREHOUSE / DAFT_CATALOG_URI env vars. Local SQLite by
  default, trivially switchable to GCS BigLake or S3.

- transcribe_diarize: migrated from raw parquet files to catalog tables
  (transcribe_diarize.transcription, .diarization, .merged). Cache validation
  now checks source identity — switching audio files invalidates stale results.
  WAV filenames include a source-path fingerprint to prevent basename collisions.
  All transcript text is HTML-escaped before rendering in the report.

- pipelines/README.md: centralized documentation of catalog architecture,
  configuration, table namespaces, and a reference of all pipelines.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant