Skip to content

Multi-voice disambiguation + accurate per-segment spoken-at times#15

Merged
etiennechabert merged 1 commit into
mainfrom
claude/multi-voice-disambiguation
Apr 21, 2026
Merged

Multi-voice disambiguation + accurate per-segment spoken-at times#15
etiennechabert merged 1 commit into
mainfrom
claude/multi-voice-disambiguation

Conversation

@etiennechabert

Copy link
Copy Markdown
Owner

Summary

Two related fixes validated against a live Meet call:

  1. Multi-voice disambiguation. When two (or more) distinct pyannote voices in one batch resolve to the same caption name — common when several people share a single Meet tile or someone is screen-sharing — we now label them ` Speaker Mvp #1`, ` Speaker Fix WebSocket freezing during processing #2`, … ordered chronologically by first appearance, instead of collapsing them all to one label. The one-voice case keeps the plain name.

  2. Accurate per-sentence `:SS` timestamps. Both partial and final WS payloads now carry `batch_end_ts_ms` + `audio_duration_secs`. The admin + viewer renderers compute each bullet's wall-clock time as `batch_end - (audio_duration - seg.start) * 1000` instead of anchoring on "when the viewer first received the partial," which was making every bullet of a multi-segment utterance display the same `:SS`. Viewer also now carries `start`/`end` through from translated segments (they were being dropped).

Test plan

  • Validated live against a multi-voice Meet call: a single "Etienne Chabert's Presentation" tile split into `Speaker Mvp #1`, `Speaker Fix WebSocket freezing during processing #2`, `Speaker Improvement of the threading with WS #3` across a 60-second batch. Starts spread chronologically.
  • Diagnostic admin + viewer(EN) + viewer(DE) socket clients confirm matching `utterance_id`, `batch_end_ts_ms`, `audio_duration_secs` across streams.
  • Falls back cleanly to old anchor when a legacy payload arrives without `batch_end_ts_ms`.

🤖 Generated with Claude Code

- resolve_speaker_identity() now disambiguates when multiple pyannote
  IDs in one batch resolve to the same caption name (common when two
  people share a single Meet tile or one participant is
  screen-sharing). Distinct voices get "<Name> Speaker #1",
  "<Name> Speaker #2", … ordered chronologically by first appearance.
  One-voice case keeps the plain name.
- Both partial and final WS payloads now carry batch_end_ts_ms +
  audio_duration_secs. The frontend computes each sentence's true
  spoken-at timestamp as batch_end - (audio_duration - seg.start)*1000
  instead of anchoring on "when the viewer received the first partial",
  which had all bullets showing the same :SS for multi-segment
  utterances.
- Viewer new_translation handler now carries start/end through from
  translated segments (they were dropped, making every bullet display
  the utterance base time).
- Admin + viewer renderers use the batch anchor when present and fall
  back to the old formula for legacy payloads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@etiennechabert etiennechabert merged commit d1bb450 into main Apr 21, 2026
2 of 6 checks passed
@etiennechabert etiennechabert deleted the claude/multi-voice-disambiguation branch April 21, 2026 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant