Multi-voice disambiguation + accurate per-segment spoken-at times#15
Merged
Conversation
- resolve_speaker_identity() now disambiguates when multiple pyannote IDs in one batch resolve to the same caption name (common when two people share a single Meet tile or one participant is screen-sharing). Distinct voices get "<Name> Speaker #1", "<Name> Speaker #2", … ordered chronologically by first appearance. One-voice case keeps the plain name. - Both partial and final WS payloads now carry batch_end_ts_ms + audio_duration_secs. The frontend computes each sentence's true spoken-at timestamp as batch_end - (audio_duration - seg.start)*1000 instead of anchoring on "when the viewer received the first partial", which had all bullets showing the same :SS for multi-segment utterances. - Viewer new_translation handler now carries start/end through from translated segments (they were dropped, making every bullet display the utterance base time). - Admin + viewer renderers use the batch anchor when present and fall back to the old formula for legacy payloads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two related fixes validated against a live Meet call:
Multi-voice disambiguation. When two (or more) distinct pyannote voices in one batch resolve to the same caption name — common when several people share a single Meet tile or someone is screen-sharing — we now label them ` Speaker Mvp #1`, ` Speaker Fix WebSocket freezing during processing #2`, … ordered chronologically by first appearance, instead of collapsing them all to one label. The one-voice case keeps the plain name.
Accurate per-sentence `:SS` timestamps. Both partial and final WS payloads now carry `batch_end_ts_ms` + `audio_duration_secs`. The admin + viewer renderers compute each bullet's wall-clock time as `batch_end - (audio_duration - seg.start) * 1000` instead of anchoring on "when the viewer first received the partial," which was making every bullet of a multi-segment utterance display the same `:SS`. Viewer also now carries `start`/`end` through from translated segments (they were being dropped).
Test plan
🤖 Generated with Claude Code