Skip to content

Google Meet speaker identity: bot join, audio capture, captions-driven speaker attribution#13

Merged
etiennechabert merged 3 commits into
mainfrom
claude/google-meet-speaker-id-vkz3L
Apr 21, 2026
Merged

Google Meet speaker identity: bot join, audio capture, captions-driven speaker attribution#13
etiennechabert merged 3 commits into
mainfrom
claude/google-meet-speaker-id-vkz3L

Conversation

@etiennechabert

@etiennechabert etiennechabert commented Apr 21, 2026

Copy link
Copy Markdown
Owner

Summary

End-to-end Google Meet integration for Polyglot: a Playwright bot joins a Meet call, streams audio to Polyglot over Socket.IO, and resolves pyannote's SPEAKER_XX labels to real names by overlapping diarization against a wall-clock speaker timeline built from Meet's live-captions DOM.

What ships

  • Bot (meet-bot/): anonymous-guest + signed-in join flows, persistent chrome-profile, RTCPeerConnection tap → 16 kHz mono PCM streamed via Socket.IO, captions-based speaker detection (scrapes span.NWpY1d inside the role="region" aria-label="Captions" container).
  • Polyglot (app.py): /meet_bot namespace receives audio + speaker events, saves the full meeting WAV to transcripts/<name>.wav for offline retranscription, maintains a 500-entry speaker_timeline deque + _active_speaker_starts dict.
  • Phase 5 resolution: resolve_speaker_identity() overlaps each diarization segment's wall-clock range against the timeline, picks the majority-vote name if it covers ≥ 30 % of the segment, substitutes real names directly into segments before WS emit, and fires rename_speaker for retroactive updates.
  • Speaker-turn batching: in bot mode, batches flush on new speaker or 60 s cap. Level-based silence detection is disabled (a speaker's natural pauses don't chop their turn). Mic-only mode keeps the original silence heuristic.
  • Admin UI: Start bot input (accepts full URL or just the meeting ID), roster panel, live active-speaker banner, retroactive rename handler, SILENCE bar hidden in bot mode, buffer shown in seconds against the 60 s cap.
  • Viewer UI: speaker names rendered above each translated segment, live active-speaker banner, retroactive rename handler.

Test plan

  • Bot joins a real meeting (signed-in, 110 participants) and streams audio
  • [BOT] Speaking: <name> fires from caption mutations (validated: Sergey Berezhnoy, Giorgio Pessina, GIT)
  • [BOT] Resolved SPEAKER_XX → <name> fires on transcription (15+ resolutions observed in one session, 0 errors)
  • Transcript renders real names instead of SPEAKER_XX (Sergey Berezhnoy: a lot, Uzer…)
  • WAV file saved with valid header (writeframes patches on every call)
  • Admin Start/Stop bot spawns/terminates the Node subprocess
  • UI layout: 3-column top row, 2-column system-stats row, no orphan visualizers
  • Multilingual speakers — deferred; name extraction is language-agnostic but Meet's caption text may garble non-English
  • Continuous/streaming re-transcription every ~10 s — deferred as a follow-up feature

🤖 Generated with Claude Code

claude and others added 3 commits April 21, 2026 05:38
Phase 1 of Google Meet speaker-identity integration. Playwright bot that
joins a Meet URL as an unauthenticated guest and waits to be admitted.
Audio capture, DOM scraping for active speaker, and Polyglot WebSocket
wiring come in subsequent phases.

All Meet DOM selectors are centralized in selectors.js so future UI
rotations are a one-file fix.

https://claude.ai/code/session_019SWkcdJekyEmJqkwSPMbPH
Complete end-to-end pipeline: the bot joins a Meet call, streams audio
to Polyglot over Socket.IO, and resolves pyannote's SPEAKER_XX labels
to real display names by overlapping diarization against a wall-clock
speaker timeline built from Meet's live-captions DOM.

- Bot audio capture: RTCPeerConnection init-script taps all remote
  audio tracks into __pgStream; AudioWorklet resamples to 16 kHz PCM16
  in 20 ms frames, base64-bridged to node and forwarded to Polyglot's
  /meet_bot namespace.
- Speaker detection via captions: enables Meet captions via toolbar
  button (keyboard fallback), observes the aria-label="Captions" region,
  and extracts speaker names from each caption block's .NWpY1d span.
  Falls back to legacy data-is-speaking / aria-label signals.
- Polyglot ingest: /meet_bot Socket.IO namespace rechunks 320-sample
  bot frames into CHUNK_SIZE batches, maintains a 500-entry
  speaker_timeline deque of closed intervals, tracks _active_speaker
  open intervals, and records the full meeting audio to
  transcripts/<name>.wav for offline retranscription.
- Phase 5 resolution: resolve_speaker_identity overlaps each pyannote
  segment's wall-clock range against the timeline (closed + still-open),
  picks the majority-vote name if it covers >=30% of the segment, and
  emits rename_speaker WebSocket events. Resolved names replace
  SPEAKER_XX labels directly in transcript segments before WS emit.
- Speaker-switch batching: when the bot reports a new active speaker,
  process_audio flushes the current batch so each turn transcribes as
  one unit (up to BOT_MAX_BATCH_SEC = 60s).
- Admin UI: bot status badge, roster panel, and retroactive
  rename_speaker handler that rewrites SPEAKER_XX labels in-place.
- Persistent chrome-profile for Google sign-in cookies, 15s Polyglot
  connection timeout with auto-reconnect, forced-click fallback for
  the join button under overlays.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Admin "Start bot" input + button: paste a Meet ID (or full URL) and
  spawn the Node bot as a subprocess from Polyglot; Stop button kills
  it. Backend handlers normalize the URL and track one instance at a
  time.
- Buffer bar now displays seconds with a fixed 60 s cap (matches
  BOT_MAX_BATCH_SEC), computed as chunks * 1024 / 16000.
- Drop level-based silence detection in bot mode. The only flush
  triggers are (1) a NEW speaker starting and (2) the 60 s cap. A
  single speaker's natural pauses fire speaker_end/speaker_start
  toggles that we deliberately ignore, so their turn stays one batch.
  Mic-only mode keeps the original level-based silence heuristic as
  a fallback.
- Live "currently speaking" banner on admin + viewer, driven by a
  new active_speakers socket event broadcast whenever
  _active_speaker_starts changes.
- Viewer now renders the speaker name above each translated segment
  and handles rename_speaker retroactively, so late name resolutions
  update already-displayed rows in place.
- Admin UI fixes: removed the orphan header audio-visualizer strip
  (was rendering outside any container), restored the full AUDIO
  SIGNAL panel as the third column of the top row, forced
  minmax(0, 1fr) on the two system-stats grids so VRAM and System
  Resources keep equal widths, hid the SILENCE bar when the bot is
  connected since it no longer drives flushing there.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@etiennechabert etiennechabert changed the title Google Meet speaker identity — phase 1: bot join scaffold Google Meet speaker identity: bot join, audio capture, captions-driven speaker attribution Apr 21, 2026
@etiennechabert etiennechabert marked this pull request as ready for review April 21, 2026 10:07
@etiennechabert etiennechabert merged commit 39e66a1 into main Apr 21, 2026
2 of 6 checks passed
@etiennechabert etiennechabert deleted the claude/google-meet-speaker-id-vkz3L branch April 21, 2026 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants