Multi-voice disambiguation + accurate per-segment spoken-at times by etiennechabert · Pull Request #15 · etiennechabert/polyglot

etiennechabert · 2026-04-21T15:55:46Z

Summary

Two related fixes validated against a live Meet call:

Multi-voice disambiguation. When two (or more) distinct pyannote voices in one batch resolve to the same caption name — common when several people share a single Meet tile or someone is screen-sharing — we now label them ` Speaker Mvp #1`, ` Speaker Fix WebSocket freezing during processing #2`, … ordered chronologically by first appearance, instead of collapsing them all to one label. The one-voice case keeps the plain name.
Accurate per-sentence `:SS` timestamps. Both partial and final WS payloads now carry `batch_end_ts_ms` + `audio_duration_secs`. The admin + viewer renderers compute each bullet's wall-clock time as `batch_end - (audio_duration - seg.start) * 1000` instead of anchoring on "when the viewer first received the partial," which was making every bullet of a multi-segment utterance display the same `:SS`. Viewer also now carries `start`/`end` through from translated segments (they were being dropped).

Test plan

Validated live against a multi-voice Meet call: a single "Etienne Chabert's Presentation" tile split into `Speaker Mvp #1`, `Speaker Fix WebSocket freezing during processing #2`, `Speaker Improvement of the threading with WS #3` across a 60-second batch. Starts spread chronologically.
Diagnostic admin + viewer(EN) + viewer(DE) socket clients confirm matching `utterance_id`, `batch_end_ts_ms`, `audio_duration_secs` across streams.
Falls back cleanly to old anchor when a legacy payload arrives without `batch_end_ts_ms`.

🤖 Generated with Claude Code

- resolve_speaker_identity() now disambiguates when multiple pyannote IDs in one batch resolve to the same caption name (common when two people share a single Meet tile or one participant is screen-sharing). Distinct voices get "<Name> Speaker #1", "<Name> Speaker #2", … ordered chronologically by first appearance. One-voice case keeps the plain name. - Both partial and final WS payloads now carry batch_end_ts_ms + audio_duration_secs. The frontend computes each sentence's true spoken-at timestamp as batch_end - (audio_duration - seg.start)*1000 instead of anchoring on "when the viewer received the first partial", which had all bullets showing the same :SS for multi-segment utterances. - Viewer new_translation handler now carries start/end through from translated segments (they were dropped, making every bullet display the utterance base time). - Admin + viewer renderers use the batch anchor when present and fall back to the old formula for legacy payloads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

etiennechabert merged commit d1bb450 into main Apr 21, 2026
2 of 6 checks passed

etiennechabert deleted the claude/multi-voice-disambiguation branch April 21, 2026 15:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-voice disambiguation + accurate per-segment spoken-at times#15

Multi-voice disambiguation + accurate per-segment spoken-at times#15
etiennechabert merged 1 commit into
mainfrom
claude/multi-voice-disambiguation

etiennechabert commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

etiennechabert commented Apr 21, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant