Google Meet speaker identity: bot join, audio capture, captions-driven speaker attribution#13
Merged
Merged
Conversation
Phase 1 of Google Meet speaker-identity integration. Playwright bot that joins a Meet URL as an unauthenticated guest and waits to be admitted. Audio capture, DOM scraping for active speaker, and Polyglot WebSocket wiring come in subsequent phases. All Meet DOM selectors are centralized in selectors.js so future UI rotations are a one-file fix. https://claude.ai/code/session_019SWkcdJekyEmJqkwSPMbPH
Complete end-to-end pipeline: the bot joins a Meet call, streams audio to Polyglot over Socket.IO, and resolves pyannote's SPEAKER_XX labels to real display names by overlapping diarization against a wall-clock speaker timeline built from Meet's live-captions DOM. - Bot audio capture: RTCPeerConnection init-script taps all remote audio tracks into __pgStream; AudioWorklet resamples to 16 kHz PCM16 in 20 ms frames, base64-bridged to node and forwarded to Polyglot's /meet_bot namespace. - Speaker detection via captions: enables Meet captions via toolbar button (keyboard fallback), observes the aria-label="Captions" region, and extracts speaker names from each caption block's .NWpY1d span. Falls back to legacy data-is-speaking / aria-label signals. - Polyglot ingest: /meet_bot Socket.IO namespace rechunks 320-sample bot frames into CHUNK_SIZE batches, maintains a 500-entry speaker_timeline deque of closed intervals, tracks _active_speaker open intervals, and records the full meeting audio to transcripts/<name>.wav for offline retranscription. - Phase 5 resolution: resolve_speaker_identity overlaps each pyannote segment's wall-clock range against the timeline (closed + still-open), picks the majority-vote name if it covers >=30% of the segment, and emits rename_speaker WebSocket events. Resolved names replace SPEAKER_XX labels directly in transcript segments before WS emit. - Speaker-switch batching: when the bot reports a new active speaker, process_audio flushes the current batch so each turn transcribes as one unit (up to BOT_MAX_BATCH_SEC = 60s). - Admin UI: bot status badge, roster panel, and retroactive rename_speaker handler that rewrites SPEAKER_XX labels in-place. - Persistent chrome-profile for Google sign-in cookies, 15s Polyglot connection timeout with auto-reconnect, forced-click fallback for the join button under overlays. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Admin "Start bot" input + button: paste a Meet ID (or full URL) and spawn the Node bot as a subprocess from Polyglot; Stop button kills it. Backend handlers normalize the URL and track one instance at a time. - Buffer bar now displays seconds with a fixed 60 s cap (matches BOT_MAX_BATCH_SEC), computed as chunks * 1024 / 16000. - Drop level-based silence detection in bot mode. The only flush triggers are (1) a NEW speaker starting and (2) the 60 s cap. A single speaker's natural pauses fire speaker_end/speaker_start toggles that we deliberately ignore, so their turn stays one batch. Mic-only mode keeps the original level-based silence heuristic as a fallback. - Live "currently speaking" banner on admin + viewer, driven by a new active_speakers socket event broadcast whenever _active_speaker_starts changes. - Viewer now renders the speaker name above each translated segment and handles rename_speaker retroactively, so late name resolutions update already-displayed rows in place. - Admin UI fixes: removed the orphan header audio-visualizer strip (was rendering outside any container), restored the full AUDIO SIGNAL panel as the third column of the top row, forced minmax(0, 1fr) on the two system-stats grids so VRAM and System Resources keep equal widths, hid the SILENCE bar when the bot is connected since it no longer drives flushing there. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
End-to-end Google Meet integration for Polyglot: a Playwright bot joins a Meet call, streams audio to Polyglot over Socket.IO, and resolves pyannote's
SPEAKER_XXlabels to real names by overlapping diarization against a wall-clock speaker timeline built from Meet's live-captions DOM.What ships
span.NWpY1dinside therole="region" aria-label="Captions"container).app.py):/meet_botnamespace receives audio + speaker events, saves the full meeting WAV totranscripts/<name>.wavfor offline retranscription, maintains a 500-entryspeaker_timelinedeque +_active_speaker_startsdict.resolve_speaker_identity()overlaps each diarization segment's wall-clock range against the timeline, picks the majority-vote name if it covers ≥ 30 % of the segment, substitutes real names directly into segments before WS emit, and firesrename_speakerfor retroactive updates.Test plan
[BOT] Speaking: <name>fires from caption mutations (validated: Sergey Berezhnoy, Giorgio Pessina, GIT)[BOT] Resolved SPEAKER_XX → <name>fires on transcription (15+ resolutions observed in one session, 0 errors)SPEAKER_XX(Sergey Berezhnoy: a lot, Uzer…)writeframespatches on every call)🤖 Generated with Claude Code