Google Meet speaker identity: bot join, audio capture, captions-driven speaker attribution by etiennechabert · Pull Request #13 · etiennechabert/polyglot

etiennechabert · 2026-04-21T05:56:41Z

Summary

End-to-end Google Meet integration for Polyglot: a Playwright bot joins a Meet call, streams audio to Polyglot over Socket.IO, and resolves pyannote's SPEAKER_XX labels to real names by overlapping diarization against a wall-clock speaker timeline built from Meet's live-captions DOM.

What ships

Bot (meet-bot/): anonymous-guest + signed-in join flows, persistent chrome-profile, RTCPeerConnection tap → 16 kHz mono PCM streamed via Socket.IO, captions-based speaker detection (scrapes span.NWpY1d inside the role="region" aria-label="Captions" container).
Polyglot (app.py): /meet_bot namespace receives audio + speaker events, saves the full meeting WAV to transcripts/<name>.wav for offline retranscription, maintains a 500-entry speaker_timeline deque + _active_speaker_starts dict.
Phase 5 resolution: resolve_speaker_identity() overlaps each diarization segment's wall-clock range against the timeline, picks the majority-vote name if it covers ≥ 30 % of the segment, substitutes real names directly into segments before WS emit, and fires rename_speaker for retroactive updates.
Speaker-turn batching: in bot mode, batches flush on new speaker or 60 s cap. Level-based silence detection is disabled (a speaker's natural pauses don't chop their turn). Mic-only mode keeps the original silence heuristic.
Admin UI: Start bot input (accepts full URL or just the meeting ID), roster panel, live active-speaker banner, retroactive rename handler, SILENCE bar hidden in bot mode, buffer shown in seconds against the 60 s cap.
Viewer UI: speaker names rendered above each translated segment, live active-speaker banner, retroactive rename handler.

Test plan

Bot joins a real meeting (signed-in, 110 participants) and streams audio
[BOT] Speaking: <name> fires from caption mutations (validated: Sergey Berezhnoy, Giorgio Pessina, GIT)
[BOT] Resolved SPEAKER_XX → <name> fires on transcription (15+ resolutions observed in one session, 0 errors)
Transcript renders real names instead of SPEAKER_XX (Sergey Berezhnoy: a lot, Uzer…)
WAV file saved with valid header (writeframes patches on every call)
Admin Start/Stop bot spawns/terminates the Node subprocess
UI layout: 3-column top row, 2-column system-stats row, no orphan visualizers
Multilingual speakers — deferred; name extraction is language-agnostic but Meet's caption text may garble non-English
Continuous/streaming re-transcription every ~10 s — deferred as a follow-up feature

🤖 Generated with Claude Code

Phase 1 of Google Meet speaker-identity integration. Playwright bot that joins a Meet URL as an unauthenticated guest and waits to be admitted. Audio capture, DOM scraping for active speaker, and Polyglot WebSocket wiring come in subsequent phases. All Meet DOM selectors are centralized in selectors.js so future UI rotations are a one-file fix. https://claude.ai/code/session_019SWkcdJekyEmJqkwSPMbPH

Complete end-to-end pipeline: the bot joins a Meet call, streams audio to Polyglot over Socket.IO, and resolves pyannote's SPEAKER_XX labels to real display names by overlapping diarization against a wall-clock speaker timeline built from Meet's live-captions DOM. - Bot audio capture: RTCPeerConnection init-script taps all remote audio tracks into __pgStream; AudioWorklet resamples to 16 kHz PCM16 in 20 ms frames, base64-bridged to node and forwarded to Polyglot's /meet_bot namespace. - Speaker detection via captions: enables Meet captions via toolbar button (keyboard fallback), observes the aria-label="Captions" region, and extracts speaker names from each caption block's .NWpY1d span. Falls back to legacy data-is-speaking / aria-label signals. - Polyglot ingest: /meet_bot Socket.IO namespace rechunks 320-sample bot frames into CHUNK_SIZE batches, maintains a 500-entry speaker_timeline deque of closed intervals, tracks _active_speaker open intervals, and records the full meeting audio to transcripts/<name>.wav for offline retranscription. - Phase 5 resolution: resolve_speaker_identity overlaps each pyannote segment's wall-clock range against the timeline (closed + still-open), picks the majority-vote name if it covers >=30% of the segment, and emits rename_speaker WebSocket events. Resolved names replace SPEAKER_XX labels directly in transcript segments before WS emit. - Speaker-switch batching: when the bot reports a new active speaker, process_audio flushes the current batch so each turn transcribes as one unit (up to BOT_MAX_BATCH_SEC = 60s). - Admin UI: bot status badge, roster panel, and retroactive rename_speaker handler that rewrites SPEAKER_XX labels in-place. - Persistent chrome-profile for Google sign-in cookies, 15s Polyglot connection timeout with auto-reconnect, forced-click fallback for the join button under overlays. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Admin "Start bot" input + button: paste a Meet ID (or full URL) and spawn the Node bot as a subprocess from Polyglot; Stop button kills it. Backend handlers normalize the URL and track one instance at a time. - Buffer bar now displays seconds with a fixed 60 s cap (matches BOT_MAX_BATCH_SEC), computed as chunks * 1024 / 16000. - Drop level-based silence detection in bot mode. The only flush triggers are (1) a NEW speaker starting and (2) the 60 s cap. A single speaker's natural pauses fire speaker_end/speaker_start toggles that we deliberately ignore, so their turn stays one batch. Mic-only mode keeps the original level-based silence heuristic as a fallback. - Live "currently speaking" banner on admin + viewer, driven by a new active_speakers socket event broadcast whenever _active_speaker_starts changes. - Viewer now renders the speaker name above each translated segment and handles rename_speaker retroactively, so late name resolutions update already-displayed rows in place. - Admin UI fixes: removed the orphan header audio-visualizer strip (was rendering outside any container), restored the full AUDIO SIGNAL panel as the third column of the top row, forced minmax(0, 1fr) on the two system-stats grids so VRAM and System Resources keep equal widths, hid the SILENCE bar when the bot is connected since it no longer drives flushing there. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude and others added 3 commits April 21, 2026 05:38

etiennechabert changed the title ~~Google Meet speaker identity — phase 1: bot join scaffold~~ Google Meet speaker identity: bot join, audio capture, captions-driven speaker attribution Apr 21, 2026

etiennechabert marked this pull request as ready for review April 21, 2026 10:07

etiennechabert merged commit 39e66a1 into main Apr 21, 2026
2 of 6 checks passed

etiennechabert deleted the claude/google-meet-speaker-id-vkz3L branch April 21, 2026 10:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google Meet speaker identity: bot join, audio capture, captions-driven speaker attribution#13

Google Meet speaker identity: bot join, audio capture, captions-driven speaker attribution#13
etiennechabert merged 3 commits into
mainfrom
claude/google-meet-speaker-id-vkz3L

etiennechabert commented Apr 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

etiennechabert commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What ships

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

etiennechabert commented Apr 21, 2026 •

edited

Loading