feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime by Overbaker · Pull Request #401 · tiann/hapi

Overbaker · 2026-04-05T13:30:37Z

Summary

Add a pluggable voice backend architecture that extends the existing ElevenLabs ConvAI integration with two new voice providers:

Gemini 2.5 Live (gemini-live): Google's real-time audio streaming API via WebSocket, with full function calling support for messageCodingAgent and processPermissionRequest
Qwen Realtime (qwen-realtime): Alibaba's DashScope real-time voice API via Hub WebSocket proxy, supporting voice conversation (function calling pending model support)

Users can switch backends via the VOICE_BACKEND environment variable. The existing ElevenLabs integration remains the default and is completely unchanged.

Key Design Decisions

Runtime discovery: GET /voice/backend lets the frontend detect the active backend without Vite rebuild
Code splitting: React.lazy() ensures alternative backends are only loaded when active
Zero upstream breakage: All original ElevenLabs code paths untouched; new code is additive
Inline AudioWorklet: Uses Blob URL instead of Vite ?url import to avoid MIME type issues in production builds
Qwen WebSocket proxy: Hub proxies Qwen connections at /api/voice/qwen-ws because browser WebSocket API cannot set Authorization headers
Barge-in prevention: Auto-mutes microphone during model speech to prevent ambient noise from interrupting responses
PWA immediate activation: Added skipWaiting + clientsClaim to service worker for instant deployment updates

Configuration

# Gemini Live (recommended - free tier, full function calling)
VOICE_BACKEND=gemini-live
GEMINI_API_KEY=your-google-api-key

# Qwen Realtime (voice-only, function calling not yet supported by model)
VOICE_BACKEND=qwen-realtime
DASHSCOPE_API_KEY=your-dashscope-key

# ElevenLabs (default, unchanged)
VOICE_BACKEND=elevenlabs
ELEVENLABS_API_KEY=your-elevenlabs-key

Files Changed

Area	Files	Description
Shared	`shared/src/voice.ts`	Voice backend types, Gemini/Qwen model constants, tool-optimized system prompt
Hub Routes	`hub/src/web/routes/voice.ts`	Backend discovery + token endpoints for Gemini & Qwen
Hub Server	`hub/src/web/server.ts`	Qwen WebSocket proxy handler
Web API	`web/src/api/client.ts`, `voice.ts`	Client functions for new endpoints
Gemini Session	`web/src/realtime/GeminiLiveVoiceSession.tsx`	Full Gemini Live implementation (WebSocket + AudioWorklet)
Qwen Session	`web/src/realtime/QwenVoiceSession.tsx`	Qwen Realtime implementation (OpenAI-compatible protocol)
Audio Pipeline	`web/src/realtime/gemini/`	PCM utils, AudioWorklet recorder, 24kHz player, tool adapter
Switcher	`web/src/realtime/VoiceBackendSession.tsx`	Dynamic backend selector with lazy loading
Integration	`web/src/components/SessionChat.tsx`	Uses `VoiceBackendSession` instead of `RealtimeVoiceSession`
PWA	`web/src/sw.ts`	`skipWaiting` + `clientsClaim`
Tests	`hub/src/web/routes/voice.test.ts`, `pcmUtils.test.ts`, `toolAdapter.test.ts`	16 test cases

Test Plan

github-actions

Findings

[Blocker] Qwen WebSocket proxy bypasses API auth and can be opened without a JWT, which lets any reachable client consume the hub's DashScope credentials through /api/voice/qwen-ws. Evidence hub/src/web/server.ts:328.
[Major] The fallback voice backend is now gemini-live, so existing installs that only configured ElevenLabs will be routed away from the existing token flow and fail voice startup. Evidence shared/src/voice.ts:280, hub/src/web/routes/voice.ts:121.
[Major] The Qwen frontend still requires the hub to return a raw DashScope key even though the browser never uses it after switching to the hub WebSocket proxy, so every authenticated web client now receives a long-lived provider secret unnecessarily. Evidence web/src/realtime/QwenVoiceSession.tsx:84, hub/src/web/routes/voice.ts:162.

Summary
Review mode: initial
Three findings. Added coverage does not exercise the new /api/voice/qwen-ws auth boundary, and the new route tests would not catch the default-backend regression because they do not assert against DEFAULT_VOICE_BACKEND.

Testing

Not run (automation): bun is not installed in this runner.

HAPI Bot

github-actions · 2026-04-05T13:35:10Z

+                return socketHandler.fetch(req, server as never)
+            }
+            // Qwen Realtime WebSocket proxy
+            if (url.pathname === '/api/voice/qwen-ws') {


[BLOCKER] This branch upgrades /api/voice/qwen-ws before the request ever reaches app.fetch(req), so it bypasses createAuthMiddleware() entirely. As written, any client that can reach the hub can open a proxied DashScope session with the server-side API key.

Suggested fix:

const token = url.searchParams.get('token') if (!token) { return new Response('Missing authorization token', { status: 401 }) } await jwtVerify(token, options.jwtSecret, { algorithms: ['HS256'] }) const upgraded = server.upgrade(req, { data: { _qwenProxy: true, apiKey, model } })

Fixed in f5cbd0e.

Both /api/voice/gemini-ws and /api/voice/qwen-ws now require a ?token=<JWT> query parameter. The fetch handler verifies the token via jwtVerify(token, jwtSecret, { algorithms: ['HS256'] }) before calling server.upgrade().

Frontend voice sessions now attach the JWT from ApiClient.getAuthToken() to the WebSocket URL.

github-actions · 2026-04-05T13:35:10Z

+export const QWEN_REALTIME_MODEL = 'qwen3-omni-flash-realtime'
+export const QWEN_REALTIME_VOICE = 'Mia'
+
+export const DEFAULT_VOICE_BACKEND: VoiceBackendType = 'gemini-live'


[MAJOR] This flips the implicit backend from ElevenLabs to Gemini Live. On a hub that only has ELEVENLABS_API_KEY configured, /api/voice/backend now resolves to gemini-live, and VoiceBackendSession will send users into /api/voice/gemini-token instead of the existing ElevenLabs flow.

Suggested fix:

export const DEFAULT_VOICE_BACKEND: VoiceBackendType = 'elevenlabs'

Fixed in f5cbd0e.

DEFAULT_VOICE_BACKEND reverted to 'elevenlabs'. Hubs with only ELEVENLABS_API_KEY configured will continue to work as before. Users who want Gemini Live or Qwen can set VOICE_BACKEND=gemini-live or VOICE_BACKEND=qwen-realtime explicitly.

github-actions · 2026-04-05T13:35:10Z

+
+        // Get API key from hub
+        const tokenResp = await fetchQwenToken(this.api)
+        if (!tokenResp.allowed || !tokenResp.apiKey) {


[MAJOR] The browser does not use this provider key after the Qwen proxy change. startSession() only checks that apiKey exists, then opens /api/voice/qwen-ws, where the hub injects its own Authorization header. Returning a long-lived DashScope key here needlessly leaks it to every authenticated web client.

Suggested fix:

const tokenResp = await fetchQwenToken(this.api) if (!tokenResp.allowed) { const msg = tokenResp.error ?? 'DashScope API key not available' state.statusCallback?.('error', msg) throw new Error(msg) } state.wsBaseUrl = tokenResp.wsUrl || null

Fixed in f5cbd0e.

/voice/qwen-token now returns { allowed: true, wsUrl } only — no apiKey field. The DashScope key stays server-side in the WebSocket proxy. Frontend QwenVoiceSession updated accordingly: it checks allowed, gets wsUrl, and connects to the hub proxy with a JWT token.

Same pattern already used for Gemini Live (apiKey: 'proxied').

tiann

Thank you for your contribution. I believe this is a good feature. Please fix the comments first.

Add a strategy-based voice backend architecture that allows switching between ElevenLabs ConvAI and Gemini Live API via VOICE_BACKEND env var. - Shared: VoiceBackendType, Gemini Live config builder, tool definitions - Hub: GET /voice/backend discovery, POST /voice/gemini-token with proxy support - Web: GeminiLiveVoiceSession (WebSocket + AudioWorklet audio pipeline), VoiceBackendSession dynamic switcher with React.lazy() code splitting, Gemini tool adapter bridging existing client tools - Tests: hub route tests, pcmUtils round-trip tests, toolAdapter tests - Zero changes to existing ElevenLabs code paths

- System prompt instructs assistant to respond in Mandarin - First message changed to Chinese greeting - ElevenLabs language set to 'zh'

Vite inlined the worklet as a data URI with wrong MIME type (video/mp2t) and uncompiled TypeScript, causing AudioWorklet.addModule() to fail. Use Blob URL with plain JS source instead.

gemini-3.1-flash-live-preview does not accept clientContent text input, only audio input. gemini-2.5-flash-native-audio-latest supports both.

- Shared: add 'qwen-realtime' backend type, model/voice constants - Hub: POST /voice/qwen-token route (DASHSCOPE_API_KEY / QWEN_API_KEY) - Web: QwenVoiceSession using DashScope Realtime WebSocket API (OpenAI-compatible protocol: session.update, input_audio_buffer, response.audio.delta, function calling via conversation.item.create) - VoiceBackendSession: lazy-load Qwen component - Tests: qwen-token route tests (3 cases) Switch via VOICE_BACKEND=qwen-realtime + DASHSCOPE_API_KEY=xxx

Without this, new deployments required users to close all tabs before the updated Service Worker would activate and serve new assets.

- Hub: add WebSocket proxy at /api/voice/qwen-ws that injects Authorization header (browser WebSocket can't set custom headers) - Switch from qwen3.5-omni-plus-realtime (access denied / invite-only) to qwen3-omni-flash-realtime (publicly available) - Frontend connects via Hub proxy instead of direct to DashScope

- Change DEFAULT_VOICE_BACKEND from elevenlabs to qwen-realtime - Change QWEN_REALTIME_VOICE from Cherry to Mia - Increase maxHttpBufferSize to 55MB to match upload limit

- Mute mic while model is speaking to prevent noise-triggered barge-in - Skip audio sending during model turn to avoid false interrupts - Remove greeting prompt that blocked subsequent tool calls - Add critical tool usage instruction at top of system prompt - Always log tool call and messageCodingAgent events (not just DEV mode)

clientContent greeting creates a conversation turn that pushes the model into "chat mode", breaking subsequent tool calls. Instead, instruct the model to greet naturally when the user speaks first.

…g API keys - Add JWT token verification to /api/voice/gemini-ws and /api/voice/qwen-ws WebSocket endpoints before upgrading, preventing unauthenticated access to server-side API credentials (BLOCKER) - Revert DEFAULT_VOICE_BACKEND to 'elevenlabs' so existing installs that only configured ElevenLabs are not broken (MAJOR) - Remove raw DashScope API key from /voice/qwen-token response; the hub proxy handles the key server-side, so the browser never needs it (MAJOR) - Update frontend to pass JWT via ?token= query param on WS connections - Change composer send shortcut from Enter to Ctrl/Cmd+Enter

github-actions

Findings

[Major] Voice can drop the first tap while backend discovery is still loading — VoiceBackendSession now renders null until GET /voice/backend completes, but SessionChat still enables the voice button whenever the provider exists. If the user taps immediately, startRealtimeSession() sees no registered session and returns without changing status, so the action is silently lost. Evidence web/src/realtime/VoiceBackendSession.tsx:35 and web/src/components/SessionChat.tsx:426.
Suggested fix:
```
const [voiceBackendReady, setVoiceBackendReady] = useState(false)

<HappyComposer
    onVoiceToggle={voice && voiceBackendReady ? handleVoiceToggle : undefined}
    onVoiceMicToggle={voice && voiceBackendReady ? handleVoiceMicToggle : undefined}
/>

<VoiceBackendSession
    api={props.api}
    micMuted={voice.micMuted}
    onStatusChange={voice.setStatus}
    onReadyChange={setVoiceBackendReady}
/>
```

Summary
Review mode: follow-up after new commits
One finding. The previously reported Qwen unauthenticated WebSocket proxy and raw DashScope key exposure are fixed on this head. Residual testing gap: no coverage for the async backend discovery/voice-button readiness path.

Testing

Not run (automation)

HAPI Bot

github-actions · 2026-04-21T08:59:32Z

+        return () => { cancelled = true }
+    }, [props.api])
+
+    if (!backend) return null


[MAJOR] VoiceBackendSession now renders nothing until backend discovery finishes, but SessionChat still wires onVoiceToggle as soon as the voice provider exists. During that initial window no registerVoiceSession() has run, so a quick tap calls startRealtimeSession(), hits the “No voice session registered” branch, and silently drops the user's first start attempt.

Suggested fix:

const [voiceBackendReady, setVoiceBackendReady] = useState(false) <HappyComposer onVoiceToggle={voice && voiceBackendReady ? handleVoiceToggle : undefined} onVoiceMicToggle={voice && voiceBackendReady ? handleVoiceMicToggle : undefined} /> <VoiceBackendSession api={props.api} micMuted={voice.micMuted} onStatusChange={voice.setStatus} onReadyChange={setVoiceBackendReady} />

Fixed in e32c1f6.

Added onReadyChange callback to VoiceBackendSession — it fires true after fetchVoiceBackend() resolves. SessionChat now gates onVoiceToggle / onVoiceMicToggle behind voiceBackendReady state, so the button stays disabled until registerVoiceSession() has completed.

VoiceBackendSession now fires onReadyChange(true) after backend discovery completes. SessionChat disables the voice toggle until ready, preventing silent drops when the user taps before registerVoiceSession() has run.

github-actions

Findings

[Blocker] Qwen default proxy URL is concatenated twice, so the backend cannot connect with the route contract returned by the hub. POST /api/voice/qwen-token now returns a full proxy URL ending in /api/voice/qwen-ws, but the client still appends /api/voice/qwen-ws again before opening the socket. Evidence web/src/realtime/QwenVoiceSession.tsx:106, web/src/realtime/QwenVoiceSession.tsx:109, hub/src/web/routes/voice.ts:167.
Suggested fix:
```
const proxyUrl = state.wsBaseUrl ?? `${protocol}//${window.location.host}/api/voice/qwen-ws`
const wsUrl = `${proxyUrl}${proxyUrl.includes('?') ? '&' : '?'}model=${encodeURIComponent(model)}&token=${encodeURIComponent(authToken)}`
```
[Major] The first voice tap can still be dropped for Gemini/Qwen because readiness flips before the lazy backend module has mounted and called registerVoiceSession(). VoiceBackendSession now enables the button immediately after backend discovery, but both non-ElevenLabs backends still sit behind React.lazy() with a null fallback. On a cold chunk load, startRealtimeSession() still hits the unregistered-session branch. Evidence web/src/realtime/VoiceBackendSession.tsx:9, web/src/realtime/VoiceBackendSession.tsx:30, web/src/realtime/VoiceBackendSession.tsx:46.
Suggested fix:
```
// only mark the backend ready after the selected session component has mounted
// and finished registerVoiceSession(...)
<GeminiLiveVoiceSession {...props} onRegistered={() => props.onReadyChange?.(true)} />
```
[Major] The new hub route tests still assert the pre-proxy token contract instead of the responses this PR now returns. Gemini no longer exposes the raw key, and Qwen now returns only wsUrl, so these assertions are out of sync with the implementation and will fail or validate the wrong API surface. Evidence hub/src/web/routes/voice.test.ts:88, hub/src/web/routes/voice.test.ts:99, hub/src/web/routes/voice.test.ts:133, hub/src/web/routes/voice.test.ts:144.
Suggested fix:
```
const body = await res.json() as { allowed: boolean; wsUrl: string }
expect(body.allowed).toBe(true)
expect(body.wsUrl).toContain('/api/voice/qwen-ws')
```

Summary
Review mode: follow-up after new commits
Three findings. The prior backend-discovery gating issue is only partially fixed: lazy backends can still lose the first start click. Qwen also appears broken by default because the proxy path is appended twice. Test coverage is present, but the new route tests are still checking the old token shape.

Testing

Not run (automation)

HAPI Bot

github-actions · 2026-04-21T09:34:52Z

+        const proxyBase = state.wsBaseUrl || `${protocol}//${window.location.host}`
+        const model = QWEN_REALTIME_MODEL
+        const authToken = this.api.getAuthToken() || ''
+        const wsUrl = `${proxyBase}/api/voice/qwen-ws?model=${encodeURIComponent(model)}&token=${encodeURIComponent(authToken)}`


[BLOCKER] fetchQwenToken() now returns a full proxy URL ending in /api/voice/qwen-ws, but this code appends /api/voice/qwen-ws again before opening the socket. With the default hub response the browser ends up dialing .../api/voice/qwen-ws/api/voice/qwen-ws?..., so Qwen never reaches session.created.

Suggested fix:

const proxyUrl = state.wsBaseUrl ?? `${protocol}//${window.location.host}/api/voice/qwen-ws` const wsUrl = `${proxyUrl}${proxyUrl.includes('?') ? '&' : '?'}model=${encodeURIComponent(model)}&token=${encodeURIComponent(authToken)}`

Fixed in fbf315b.

wsBaseUrl from the hub now contains the full proxy path. The client uses it directly as proxyUrl and only appends query params (model= and token=), no more path duplication.

github-actions · 2026-04-21T09:34:52Z

+        fetchVoiceBackend(props.api).then((resp) => {
+            if (!cancelled) {
+                setBackend(resp.backend)
+                props.onReadyChange?.(true)


[MAJOR] This still marks the voice backend ready before the lazy-loaded Gemini/Qwen component has mounted and called registerVoiceSession(). On a cold chunk load the button becomes clickable while startRealtimeSession() would still hit the "No voice session registered" branch, so the first tap can still be lost.

Suggested fix:

// only call onReadyChange(true) after the selected backend component mounts // and completes registerVoiceSession(...) <GeminiLiveVoiceSession {...props} onRegistered={() => props.onReadyChange?.(true)} />

Fixed in fbf315b.

onReadyChange(true) is no longer called from VoiceBackendSession after discovery. Instead, each backend component (Gemini/Qwen/ElevenLabs) now accepts an onRegistered callback and fires it after registerVoiceSession() succeeds inside their mount useEffect. VoiceBackendSession passes handleRegistered → props.onReadyChange(true) down to the selected child, so the button only enables after the lazy chunk has loaded, mounted, and completed registration.

github-actions · 2026-04-21T09:34:52Z

+        const app = createApp()
+        const res = await app.request('/api/voice/gemini-token', { method: 'POST' })
+        expect(res.status).toBe(200)
+        const body = await res.json() as { allowed: boolean; apiKey: string }


[MAJOR] These assertions are still checking the old token contract. POST /api/voice/gemini-token now returns a proxied websocket contract, and POST /api/voice/qwen-token returns only wsUrl, so the new tests are out of sync with the implementation and will fail or validate the wrong API surface.

Suggested fix:

const body = await res.json() as { allowed: boolean; wsUrl: string } expect(body.allowed).toBe(true) expect(body.wsUrl).toContain('/api/voice/gemini-ws')

Fixed in fbf315b.

Test assertions updated to match the new proxy contract:

gemini-token: asserts apiKey === 'proxied' and wsUrl contains /api/voice/gemini-ws

qwen-token: asserts allowed === true, wsUrl contains /api/voice/qwen-ws, and apiKey property does not exist

All 10 tests pass.

…tered, update tests - Fix BLOCKER: Qwen proxy URL was concatenated twice — wsBaseUrl from hub already contains /api/voice/qwen-ws, so the client no longer appends it again - Fix MAJOR: onReadyChange now fires from onRegistered callback inside each backend component (after registerVoiceSession completes), not from VoiceBackendSession after discovery. This ensures lazy-loaded Gemini/Qwen chunks have fully mounted before the voice button enables - Fix MAJOR: update voice.test.ts to assert new proxy contract — gemini-token returns { apiKey: 'proxied', wsUrl }, qwen-token returns { allowed, wsUrl } with no apiKey field

github-actions

Findings

[Blocker] The new proxy token routes build websocket URLs from a hard-coded localhost:24888 fallback, so Gemini/Qwen voice breaks when HAPI_PUBLIC_URL is unset.
[Major] The primary AudioWorklet recording path never connects the worklet node to a live graph, so mic chunks are likely never emitted on browsers that successfully load the worklet.
[Major] Gemini tool calls are executed in parallel against shared mutable session state, which can duplicate or reorder side effects.

Summary
Review mode: follow-up after new commits
Three findings on the current head. The earlier readiness and Qwen proxy regressions are fixed, but the remaining issues still block or destabilize the new voice backends in common self-hosted flows.

Testing

Not run (automation)

HAPI Bot

github-actions · 2026-04-21T09:53:35Z

+
+        // Use server-side WS proxy to avoid region restrictions.
+        // The proxy at /api/voice/gemini-ws handles the API key server-side.
+        const publicUrl = process.env.HAPI_PUBLIC_URL || `http://localhost:${process.env.HAPI_LISTEN_PORT || '24888'}`


[BLOCKER] HAPI_PUBLIC_URL is optional here, but this fallback always emits an absolute ws://localhost:24888/... proxy URL. On the default/local config that makes remote browsers open their own localhost instead of the hub, so both new voice backends fail before the websocket upgrade. The same issue is duplicated in the Qwen branch below.

Suggested fix:

const requestOrigin = new URL(c.req.url).origin const publicUrl = process.env.HAPI_PUBLIC_URL || requestOrigin const wsProxyUrl = publicUrl.replace(/^http/, "ws") + "/api/voice/gemini-ws"

Fixed in c68366a.

Both /voice/gemini-token and /voice/qwen-token now derive wsUrl from new URL(c.req.url).origin instead of hard-coded localhost:24888. HAPI_PUBLIC_URL still overrides when set. Remote browsers will correctly connect back to the hub.

github-actions · 2026-04-21T09:53:35Z

+          const base64 = arrayBufferToBase64(pcm16);
+          onChunk(base64);
+        };
+        this.sourceNode.connect(this.workletNode);


[MAJOR] On the main AudioWorklet path the recorder never connects the worklet node to any downstream sink. The fallback ScriptProcessorNode path does connect to destination, but this branch stops at source -> worklet, so the graph has nothing pulling frames and port.onmessage will not fire on browsers where the worklet loads successfully.

Suggested fix:

const sink = this.audioContext.createGain() sink.gain.value = 0 this.sourceNode.connect(this.workletNode) this.workletNode.connect(sink) sink.connect(this.audioContext.destination)

Fixed in c68366a.

AudioWorklet node is now connected through a zero-gain GainNode to audioContext.destination: source → worklet → sink(gain=0) → destination. This gives the audio graph a downstream pull so port.onmessage fires correctly on all browsers.

github-actions · 2026-04-21T09:53:35Z

+export async function handleGeminiFunctionCalls(
+    calls: GeminiFunctionCall[]
+): Promise<GeminiFunctionResponse[]> {
+    return Promise.all(calls.map(handleGeminiFunctionCall))


[MAJOR] Promise.all(...) runs Gemini tool calls concurrently against shared session state. That is unsafe here because processPermissionRequest() resolves the current first pending request each time; two calls in one turn can race, act on the same request twice, or reorder side effects.

Suggested fix:

const responses: GeminiFunctionResponse[] = [] for (const call of calls) { responses.push(await handleGeminiFunctionCall(call)) } return responses

Fixed in c68366a.

handleGeminiFunctionCalls now uses a sequential for...of loop instead of Promise.all. This ensures tool calls execute one at a time against shared session state, preventing duplicate processPermissionRequest resolutions.

…, serialize tool calls - Fix BLOCKER: derive wsUrl from request origin instead of hard-coded localhost:24888 fallback, so remote browsers connect back to the hub. HAPI_PUBLIC_URL still overrides when set. - Fix MAJOR: connect AudioWorklet node to a silent GainNode sink so the audio graph pulls frames and port.onmessage fires correctly. - Fix MAJOR: replace Promise.all with sequential for-loop in handleGeminiFunctionCalls to prevent racing on shared session state (e.g. duplicate processPermissionRequest resolutions).

github-actions

Findings

[Major] Playback AudioContext creation moved into the async websocket setup path, so Gemini/Qwen can connect successfully but stay silent on mobile browsers that require audio contexts to be created/resumed in the original user gesture. Evidence web/src/realtime/GeminiLiveVoiceSession.tsx:166, web/src/realtime/QwenVoiceSession.tsx:176, web/src/realtime/gemini/audioPlayer.ts:12.

Summary

Review mode: follow-up after new commits
One major regression remains on the current head: the new Gemini/Qwen backends still create playback audio contexts too late for mobile autoplay policies, so voice output can fail even when the websocket handshake succeeds.

Testing

Not run (automation)

HAPI Bot

github-actions · 2026-04-21T10:10:06Z

+                    state.statusCallback?.('connected')
+
+                    // Start audio capture
+                    startAudioCapture()


[MAJOR] startAudioCapture() runs only after the websocket handshake finishes, and GeminiAudioPlayer creates its own AudioContext there. On iOS/mobile Safari that is already outside the original click gesture, so the session connects but playback can remain suspended/silent. QwenVoiceSession hits the same helper, so both new backends are affected.

Suggested fix:

// still inside startSession(), before awaiting websocket setup const playbackContext = new AudioContext({ sampleRate: 24000 }) await playbackContext.resume() // later, after setup completes state.player = new GeminiAudioPlayer(playbackContext)

Fixed in 759bf35.

Both Gemini Live and Qwen Realtime now create and resume() the playback AudioContext({ sampleRate: 24000 }) at the very start of startSession(), while still inside the user's click/tap gesture. The pre-created context is then passed to GeminiAudioPlayer(playbackContext) via startAudioCapture().

This satisfies mobile autoplay policies (iOS Safari, Chrome Android) that require AudioContext creation within a user activation event.

…utoplay Create and resume the playback AudioContext at the start of startSession(), while still inside the user's click/tap gesture. Pass the pre-created context to GeminiAudioPlayer so mobile browsers (iOS Safari) don't block audio output due to autoplay policy. Applies to both Gemini Live and Qwen Realtime backends.

github-actions

Findings

[Major] The follow-up autoplay fix now leaks the playback AudioContext for both Gemini and Qwen. startSession() creates a new context before the token/mic checks, but GeminiAudioPlayer.dispose() only closes contexts it created itself, so failed starts and normal stop/start cycles both leave orphaned contexts behind. On mobile browsers with low AudioContext limits, voice playback eventually stops working after a few retries. Evidence web/src/realtime/GeminiLiveVoiceSession.tsx:69, web/src/realtime/QwenVoiceSession.tsx:84, web/src/realtime/gemini/audioPlayer.ts:69.
Suggested fix:
```
try {
    state.playbackContext = new AudioContext({ sampleRate: 24000 })
    await state.playbackContext.resume()
    // ...rest of setup...
    startAudioCapture(state.playbackContext)
} catch (error) {
    cleanup()
    throw error
}

function cleanup() {
    // existing cleanup...
    if (state.playbackContext && state.playbackContext.state !== 'closed') {
        void state.playbackContext.close()
        state.playbackContext = null
    }
}
```

Summary

Review mode: follow-up after new commits
One major regression remains on the current head: the mobile autoplay fix now leaks playback audio contexts in both new backends, so repeated retries eventually break voice output.

Testing

Not run (automation environment does not have bun)

HAPI Bot

github-actions · 2026-04-21T11:07:20Z

+
+        // Create playback AudioContext immediately while still inside the user
+        // gesture (click/tap). Mobile browsers require this for autoplay policy.
+        const playbackContext = new AudioContext({ sampleRate: 24000 })


[MAJOR] This new user-gesture AudioContext never gets closed. startAudioCapture() hands it to GeminiAudioPlayer, but GeminiAudioPlayer.dispose() only closes owned contexts (web/src/realtime/gemini/audioPlayer.ts:69), so both failed starts and normal stop/start cycles leak one playback context each. Qwen now has the same regression at web/src/realtime/QwenVoiceSession.tsx:84.

Suggested fix:

try { state.playbackContext = new AudioContext({ sampleRate: 24000 }) await state.playbackContext.resume() // ...rest of setup... } catch (error) { cleanup() throw error }

Fixed in 74aa4c2.

Both Gemini and Qwen now store the playback AudioContext in module state (state.playbackContext). cleanup() closes it (void state.playbackContext.close()) alongside the recorder, player, and WebSocket. This covers:

Failed starts (token check, mic denial) — cleanup() at top of startSession()

Normal stop — endSession() → cleanup()

Retries — next startSession() calls cleanup() first

No more orphaned contexts on repeated start/stop cycles.

Move playback AudioContext into module state so cleanup() can close it on failed starts and normal stop/start cycles. Prevents orphaned AudioContext leaks that would exhaust mobile browser limits after repeated retries.

github-actions

Review mode: follow-up after new commits

Findings

[Major] Forced service-worker activation breaks already-open tabs after deploy. skipWaiting() + clientsClaim() immediately move existing sessions onto the new worker, but this PR also introduces lazy-loaded voice chunks. A tab that is still running the old app shell can then request an old chunk hash and fail to load voice after rollout. Evidence web/src/sw.ts:24, web/src/realtime/VoiceBackendSession.tsx:8.
Suggested fix:
```
// Keep the new worker in waiting state until the page reloads,
// or trigger skipWaiting only after an explicit client prompt.
precacheAndRoute(self.__WB_MANIFEST)
```
[Minor] Voice requests are now logged in production. The debug guard was removed around messageCodingAgent, so every dictated prompt and session id is written to the browser console. Evidence web/src/realtime/realtimeClientTools.ts:48.
Suggested fix:
```
if (VOICE_CONFIG.ENABLE_DEBUG_LOGGING) {
    console.log('[Voice] messageCodingAgent called with:', message)
    console.log('[Voice] Sending message to session:', sessionId)
}
```

Summary

Review mode: follow-up after new commits
Two issues remain on the current head: one deploy-time regression in the service-worker update path, and one production logging regression for voice prompts.

Testing

Not run (bun is not installed in this automation environment). Existing tests also do not cover the service-worker upgrade path.

HAPI Bot

github-actions · 2026-04-21T11:21:27Z

 }

+// Activate new SW immediately without waiting for all tabs to close
+self.addEventListener('install', () => { self.skipWaiting() })


[MAJOR] skipWaiting()/clientsClaim() force the new worker onto already-open tabs. Because this PR also lazy-loads the non-ElevenLabs voice backends, an existing page can still request old chunk hashes after deployment and hit a chunk-load failure mid-session.

Suggested fix:

// Leave the new worker waiting, or only call skipWaiting after the client opts in. precacheAndRoute(self.__WB_MANIFEST)

Fixed in 296dc85.

Removed skipWaiting() and clientsClaim() from the service worker. New SW now stays in waiting state until all tabs close, preventing lazy-loaded chunk hash mismatches after deploy.

github-actions · 2026-04-21T11:21:27Z

-            console.log('[Voice] messageCodingAgent called with:', message)
-            console.log('[Voice] Sending message to session:', sessionId)
-        }
+        console.log('[Voice] messageCodingAgent called with:', message)


[MINOR] This now logs every dictated agent request and the session id in production. Voice prompts often contain repo details or other sensitive instructions, so the old debug guard should stay in place here.

Suggested fix:

if (VOICE_CONFIG.ENABLE_DEBUG_LOGGING) { console.log('[Voice] messageCodingAgent called with:', message) console.log('[Voice] Sending message to session:', sessionId) }

Fixed in 296dc85.

Wrapped the messageCodingAgent console.log calls with VOICE_CONFIG.ENABLE_DEBUG_LOGGING guard, matching the existing pattern in processPermissionRequest.

- Remove skipWaiting + clientsClaim from service worker to prevent breaking lazy-loaded voice chunks in already-open tabs after deploy. New SW now waits for all tabs to close before activating. - Wrap messageCodingAgent console.log calls with VOICE_CONFIG debug guard to stop logging user prompts and session IDs in production.

github-actions

Findings

[Major] Gemini voice startup can hang indefinitely after an early socket close. startSession() only rejects on ws.onerror before setupComplete; if the proxy or upstream closes cleanly during the handshake, ws.onclose only marks the session disconnected and never rejects the pending promise, so voice.startVoice() stays stuck in connecting. Evidence web/src/realtime/GeminiLiveVoiceSession.tsx:254
Suggested fix:
```
ws.onclose = (event) => {
    const message = event.reason || 'WebSocket closed before setup completed'
    cleanup()
    resetRealtimeSessionState()
    if (!setupDone) {
        state.statusCallback?.('error', message)
        reject(new Error(message))
        return
    }
    state.statusCallback?.('disconnected')
}
```
[Major] Qwen voice has the same unresolved startup path, and the new error event handler also drops server-side setup failures on the floor. If DashScope returns an error event or closes before session.updated, the promise never settles and the UI remains stuck in connecting. Evidence web/src/realtime/QwenVoiceSession.tsx:260, web/src/realtime/QwenVoiceSession.tsx:275
Suggested fix:
```
if (eventType === 'error') {
    const message = err?.message || 'Realtime session setup failed'
    state.statusCallback?.('error', message)
    if (!sessionCreated) {
        reject(new Error(message))
        ws.close()
    }
    return
}
```
Add the same !sessionCreated rejection guard in ws.onclose.
[Minor] Gemini debug logging is forced on in production. const DEBUG = true makes the new backend log connection state, tool calls, and model text on every user session, which regresses from the existing dev/config-gated logging pattern. Evidence web/src/realtime/GeminiLiveVoiceSession.tsx:14
Suggested fix:
```
const DEBUG = import.meta.env.DEV
```

Summary

Review mode: follow-up after new commits
Three issues remain on the current head: both new WebSocket voice backends can leave the UI stuck in connecting when setup fails early, and Gemini debug logging is still forced on in production.

Testing

Not run (bun is not installed in this automation environment).
New tests cover route/PCM/tool-adapter helpers, but they do not exercise WebSocket handshake failure paths for Gemini/Qwen.

HAPI Bot

github-actions · 2026-04-21T11:35:12Z

+                }
+            }
+
+            ws.onclose = (event) => {


[MAJOR] If the socket closes before setupComplete, startSession() never settles. ws.onerror rejects the promise, but this onclose path only flips the status to disconnected, so a clean handshake failure leaves the voice UI stuck in connecting.

Suggested fix:

ws.onclose = (event) => { const message = event.reason || 'WebSocket closed before setup completed' cleanup() resetRealtimeSessionState() if (!setupDone) { state.statusCallback?.('error', message) reject(new Error(message)) return } state.statusCallback?.('disconnected') }

Fixed in f108c0f.

Gemini ws.onclose now rejects the startup promise when !setupDone, with error status callback. Also changed DEBUG from true to import.meta.env.DEV.

Fixed in f108c0f (previous commit).

The onclose handler already rejects the promise when !setupDone, calls state.statusCallback?.('error', message), and returns early. This matches the suggested fix exactly.

github-actions · 2026-04-21T11:35:12Z

+                }
+
+                // Error
+                if (eventType === 'error') {


[MAJOR] This drops server-side setup failures on the floor. If DashScope sends an error event before session.updated, the pending startSession() promise is neither resolved nor rejected, so the UI keeps spinning in connecting.

Suggested fix:

if (eventType === 'error') { const message = err?.message || 'Realtime session setup failed' state.statusCallback?.('error', message) if (!sessionCreated) { reject(new Error(message)) ws.close() } return }

Fixed in f108c0f.

Qwen error event handler now calls reject() + ws.close() when !sessionCreated. onclose also rejects the promise on early close, matching the Gemini pattern.

Fixed in 7111b67.

Renamed the guard flag from sessionCreated → sessionReady and now only set it to true after session.updated arrives (the point where resolve() is called). If DashScope sends session.created but fails before session.updated, the error/close handlers now correctly reject the promise instead of leaving it hanging.

Additionally, the Gemini proxy in hub now queues client messages (especially the setup frame) in a pending buffer while the upstream WebSocket is still CONNECTING, then flushes on open — preventing the silent frame drop that could also cause a startup hang.

github-actions · 2026-04-21T11:35:12Z

+import type { Session } from '@/types/api'
+import type { GeminiFunctionCall } from './gemini/toolAdapter'
+
+const DEBUG = true


[MINOR] DEBUG is hardcoded to true, so this backend now logs connection state, tool calls, and model text in production. The existing voice code gates these logs behind dev/config flags; this new backend should do the same.

Suggested fix:

const DEBUG = import.meta.env.DEV

- Gemini: reject the startup promise in onclose when setup hasn't completed, preventing the UI from hanging in 'connecting' state - Qwen: reject on both server 'error' events and early onclose during handshake, with proper cleanup and status callback - Change Gemini DEBUG from hardcoded true to import.meta.env.DEV to stop logging connection state and tool calls in production

github-actions

Findings

[Major] Gemini proxy can drop the setup frame before the upstream socket opens — the browser sends setup as soon as the hub WebSocket opens, but the proxy forwards client messages only when the Google upstream is already OPEN; messages received during upstream CONNECTING are discarded. That leaves Gemini Live waiting forever for setupComplete. Evidence hub/src/web/server.ts:61 and client send at web/src/realtime/GeminiLiveVoiceSession.tsx:148.
Suggested fix:
```
const pending: Array<string | ArrayBuffer | Uint8Array> = []
upstream.onopen = () => {
    for (const queued of pending.splice(0)) upstream.send(queued)
}
// in message(): queue until upstream opens instead of dropping setup
if (upstream?.readyState === WebSocket.OPEN) upstream.send(message)
else if (upstream?.readyState === WebSocket.CONNECTING) pending.push(message)
```
[Major] Qwen startup still hangs when setup fails after session.created — startSession() only rejects while !sessionCreated, but the promise does not resolve until session.updated. If DashScope accepts the socket, sends session.created, then rejects the session.update or closes before session.updated, this code reports an error but never settles the pending start promise. Evidence web/src/realtime/QwenVoiceSession.tsx:265 and web/src/realtime/QwenVoiceSession.tsx:285.
Suggested fix:
```
let sessionReady = false
if (eventType === 'session.updated') {
    sessionReady = true
    state.statusCallback?.('connected')
    startAudioCapture(state.playbackContext!)
    resolve()
    return
}
if (eventType === 'error' && !sessionReady) {
    reject(new Error(message))
    ws.close()
    return
}
// onclose: reject when !sessionReady, not only when !sessionCreated
```

Summary

Review mode: follow-up after new commits
Two startup reliability issues remain in the current head: Gemini proxy can discard the initial setup message, and Qwen can still leave the UI stuck in connecting on setup failures after session.created.

Testing

Not run (automation: bun is not installed in this runner).
Missing coverage: WebSocket proxy backpressure/queueing for Gemini setup and Qwen failure paths between session.created and session.updated.

HAPI Bot

github-actions · 2026-04-21T13:34:01Z

+        },
+        message(clientWs: ServerWebSocket<unknown>, message: string | ArrayBuffer | Uint8Array) {
+            const upstream = upstreamMap.get(clientWs)
+            if (upstream?.readyState === WebSocket.OPEN) {


[MAJOR] Client messages are dropped while the upstream Gemini socket is still connecting. The browser sends the required setup frame immediately after the hub WebSocket opens, but this proxy only forwards when upstream.readyState === WebSocket.OPEN; if that frame arrives first, Google never receives setup and the client waits forever for setupComplete.

Suggested fix:

const pending: Array<string | ArrayBuffer | Uint8Array> = [] upstream.onopen = () => { for (const queued of pending.splice(0)) upstream.send(queued) } if (upstream?.readyState === WebSocket.OPEN) { upstream.send(message) } else if (upstream?.readyState === WebSocket.CONNECTING) { pending.push(message) }

github-actions · 2026-04-21T13:34:01Z

+                    const message = err?.message || 'Realtime session setup failed'
+                    console.error('[Qwen] Server error:', message)
+                    state.statusCallback?.('error', message)
+                    if (!sessionCreated) {


[MAJOR] This still leaves startup unresolved when setup fails after session.created. The promise resolves only on session.updated, but errors after sessionCreated = true skip reject(), so an invalid session.update response leaves the voice UI stuck in connecting.

Suggested fix:

let sessionReady = false if (eventType === session.updated) { sessionReady = true state.statusCallback?.(connected) startAudioCapture(state.playbackContext!) resolve() return } if (eventType === error && !sessionReady) { reject(new Error(message)) ws.close() return }

…tup hang Gemini proxy: buffer client messages (especially the setup frame) in a pending queue while the upstream WebSocket is still CONNECTING, then flush on open. Previously the setup frame was silently dropped. Qwen: rename sessionCreated → sessionReady and only set it after session.updated arrives. If the server fails between session.created and session.updated the promise now correctly rejects instead of hanging forever.

github-actions

Findings

[Major] Proxy/upstream URL reuse leaks the hub JWT and bypasses the server-side proxy whenever GEMINI_LIVE_WS_URL or QWEN_REALTIME_WS_URL is set. /voice/*-token returns those env values to the browser, and both clients append token=<hub-jwt> before connecting, while the hub also reuses the same env vars for its upstream socket target. Evidence hub/src/web/routes/voice.ts:154, hub/src/web/routes/voice.ts:176, hub/src/web/server.ts:37, hub/src/web/server.ts:98, web/src/realtime/GeminiLiveVoiceSession.tsx:111, web/src/realtime/QwenVoiceSession.tsx:123.
Suggested fix:
```
const geminiUpstreamWsUrl = process.env.GEMINI_LIVE_UPSTREAM_WS_URL || GEMINI_WS_BASE
const geminiClientWsUrl = publicUrl.replace(/^http/, 'ws') + '/api/voice/gemini-ws'

return c.json({
    allowed: true,
    apiKey: 'proxied',
    wsUrl: geminiClientWsUrl
})
```
Apply the same split for Qwen, and keep the browser-facing wsUrl pinned to /api/voice/*-ws.
[Major] The shared Chinese prompt/first-message change is not additive: it changes the existing ElevenLabs assistant globally and bypasses the user language preference that the web app still passes only as overrides.agent.language. Even when the user selects English, the base prompt now instructs “Always respond in Chinese” and the first message is hard-coded Chinese. Evidence shared/src/voice.ts:145, shared/src/voice.ts:157, shared/src/voice.ts:242, web/src/lib/voice-context.tsx:43, web/src/realtime/RealtimeVoiceSession.tsx:74.
Suggested fix:
```
const ELEVENLABS_PROMPT = /* existing neutral/English prompt */
const ELEVENLABS_FIRST_MESSAGE = 'Hey! Hapi here.'

export function buildVoiceAgentConfig(): VoiceAgentConfig {
    return {
        conversation_config: {
            agent: {
                first_message: ELEVENLABS_FIRST_MESSAGE,
                language: 'en',
                prompt: { prompt: ELEVENLABS_PROMPT, /* ... */ }
            }
        }
    }
}
```
If Gemini/Qwen should default to Chinese, give those backends their own prompt constants instead of reusing the ElevenLabs config.

Summary

Review mode: follow-up after new commits
Two issues remain in the current head: the new *_WS_URL env vars act as both upstream and browser-facing URLs, which can leak the hub JWT outside the proxy path, and the shared prompt change silently regresses the existing ElevenLabs language behavior.

Testing

Not run (review constraints: static analysis only; no PR code execution).
Missing coverage: custom GEMINI_LIVE_WS_URL/QWEN_REALTIME_WS_URL behavior, and preserving ElevenLabs language selection after the shared prompt refactor.

HAPI Bot

github-actions · 2026-04-22T03:15:10Z

+        return c.json({
+            allowed: true,
+            apiKey: 'proxied', // Dummy — key is handled server-side
+            wsUrl: process.env.GEMINI_LIVE_WS_URL || wsProxyUrl,


[MAJOR] GEMINI_LIVE_WS_URL is being reused as both the upstream target and the browser-facing wsUrl. If an operator sets it to a direct/custom upstream URL, /voice/gemini-token returns that URL to the browser and the client appends token=<hub-jwt> before connecting (web/src/realtime/GeminiLiveVoiceSession.tsx:111). That bypasses /api/voice/gemini-ws and leaks the hub JWT to the upstream service. Qwen has the same pattern at /api/voice/qwen-token.

Suggested fix:

const geminiUpstreamWsUrl = process.env.GEMINI_LIVE_UPSTREAM_WS_URL || GEMINI_WS_BASE const geminiClientWsUrl = publicUrl.replace(/^http/, 'ws') + '/api/voice/gemini-ws' return c.json({ allowed: true, apiKey: 'proxied', wsUrl: geminiClientWsUrl })

github-actions · 2026-04-22T03:15:10Z

+
+# Language
+
+IMPORTANT: Always respond in Chinese (Mandarin). Use natural spoken Chinese.


[MAJOR] This is now changing the existing ElevenLabs assistant globally, not just the new backends. The web app still only overrides agent.language (web/src/realtime/RealtimeVoiceSession.tsx:74), so users who selected English now get a base prompt and first message that explicitly force Chinese. That breaks the existing language preference path and contradicts the additive/no-breakage goal.

Suggested fix:

const ELEVENLABS_PROMPT = /* existing neutral/English prompt */ const ELEVENLABS_FIRST_MESSAGE = 'Hey! Hapi here.' export function buildVoiceAgentConfig(): VoiceAgentConfig { return { conversation_config: { agent: { first_message: ELEVENLABS_FIRST_MESSAGE, language: 'en', prompt: { prompt: ELEVENLABS_PROMPT, /* ... */ } } } } }

…from Gemini/Qwen Security: hub token endpoints now always return the proxy URL (/api/voice/*-ws) regardless of GEMINI_LIVE_WS_URL or QWEN_REALTIME_WS_URL env vars. Those env vars are upstream-only and never exposed to the browser, preventing JWT leakage to external endpoints. Language: extracted the Chinese language block from VOICE_SYSTEM_PROMPT into VOICE_CHINESE_LANGUAGE_BLOCK, appended only by Gemini and Qwen backends. ElevenLabs keeps a neutral English base prompt and controls language via its own language field, preserving user language preference.

TennyDDDD · 2026-04-22T03:29:11Z

Both issues fixed in 5c60488.

1. WS URL leak — Token endpoints now always return the hub proxy URL (/api/voice/*-ws). GEMINI_LIVE_WS_URL and QWEN_REALTIME_WS_URL env vars are only used server-side as upstream targets and are never sent to the browser. The hub JWT stays within the proxy path.

2. ElevenLabs language regression — Extracted the Chinese language block from VOICE_SYSTEM_PROMPT into a separate VOICE_CHINESE_LANGUAGE_BLOCK constant, which is appended only by Gemini (buildGeminiLiveConfig) and Qwen (QwenVoiceSession). ElevenLabs uses the neutral base prompt with language: 'en' and first message in English, preserving user language preference via its own language field override.

github-actions

Findings

[Major] Initial mic mute is lost in the new Gemini/Qwen backends — micMuted can already be true before a session starts, but both components only apply it in an effect that runs while state.recorder is still null. startAudioCapture() then creates a live recorder and never reapplies the existing mute state, so the UI can show muted while audio is still captured until the user toggles again. Evidence web/src/realtime/GeminiLiveVoiceSession.tsx:380, web/src/realtime/QwenVoiceSession.tsx:397, related context web/src/lib/voice-context.tsx:23.
Suggested fix:
```
const micMutedRef = useRef(micMuted)

useEffect(() => {
    micMutedRef.current = micMuted
    state.recorder?.setMuted(micMuted)
}, [micMuted])

void state.recorder.start(...).then(() => {
    state.recorder?.setMuted(micMutedRef.current)
})
```
[Major] Composer Enter-to-send was flipped for the whole web app — plain Enter now inserts a newline and only Ctrl/Cmd+Enter sends. That changes the existing chat workflow for every keyboard user and is unrelated to the voice backend feature. Evidence web/src/components/AssistantChat/HappyComposer.tsx:314.
Suggested fix:
```
if (key === 'Enter' && e.shiftKey) {
    return
}

if (key === 'Enter' && !e.ctrlKey && !e.altKey && !e.metaKey && canSend) {
    e.preventDefault()
    api.composer().send()
    setShowContinueHint(false)
    return
}
```

Summary

Review mode: follow-up after new commits
Two issues remain in the current head: the new Gemini/Qwen sessions can capture audio while the UI still shows the mic as muted, and the composer key handling now regresses the existing Enter-to-send behavior.

Testing

Not run (automation; static review only)
Missing coverage: starting Gemini/Qwen while micMuted=true, and keyboard composer behavior for Enter, Shift+Enter, and Ctrl/Cmd+Enter

HAPI Bot

github-actions · 2026-04-22T03:35:05Z

+    }, [api]) // eslint-disable-line react-hooks/exhaustive-deps
+
+    // Sync mic mute state
+    useEffect(() => {


[MAJOR] micMuted is only applied here when the prop changes after state.recorder already exists. If the user stopped voice while muted and starts a new Gemini session, micMuted is already true, this effect runs before startAudioCapture() creates the recorder, and the initial mute never gets reapplied. The UI stays muted while the recorder still captures audio.

Suggested fix:

const micMutedRef = useRef(micMuted) useEffect(() => { micMutedRef.current = micMuted state.recorder?.setMuted(micMuted) }, [micMuted]) void state.recorder.start(...).then(() => { state.recorder?.setMuted(micMutedRef.current) })

github-actions · 2026-04-22T03:35:05Z

+        }
+    }, [api]) // eslint-disable-line react-hooks/exhaustive-deps
+
+    useEffect(() => {


[MAJOR] Same initial-mute regression as Gemini: this effect only updates an existing recorder. When micMuted is already true before the Qwen session starts, startAudioCapture() creates a fresh recorder after this effect has already run, so audio capture begins unmuted until the user toggles again.

Suggested fix:

const micMutedRef = useRef(micMuted) useEffect(() => { micMutedRef.current = micMuted state.recorder?.setMuted(micMuted) }, [micMuted]) void state.recorder.start(...).then(() => { state.recorder?.setMuted(micMutedRef.current) })

github-actions · 2026-04-22T03:35:05Z


-        // Only plain Enter (no modifiers) sends; other modifier combos are ignored
-        if (key === 'Enter') {
+        // Ctrl+Enter (Windows/Linux) or Cmd+Enter (Mac) sends the message


[MAJOR] This flips the existing composer shortcut from plain Enter to Ctrl/Cmd+Enter for every chat session. The previous behavior already supported multi-line input with Shift+Enter, so this changes a core keyboard workflow without anything in the voice feature requiring it.

Suggested fix:

if (key === 'Enter' && e.shiftKey) { return } if (key === 'Enter' && !e.ctrlKey && !e.altKey && !e.metaKey) { e.preventDefault() if (canSend) { api.composer().send() setShowContinueHint(false) } return }

Both Gemini and Qwen backends now persist micMuted to module-level state and reapply it immediately after startAudioCapture() creates the recorder. Previously the React mute effect could run while the recorder was still null, causing audio to be captured even when the UI showed the mic as muted.

TennyDDDD · 2026-04-22T03:46:20Z

Re: review at 2026-04-22T03:35:05Z

1. Mic mute initial state — Fixed in aa9802d.

Both Gemini and Qwen now persist micMuted to module-level state via the React effect, and startAudioCapture() applies it immediately after creating the recorder. This covers the case where micMuted was already true before the session started.

2. Composer Enter → Ctrl+Enter — This is an intentional change requested separately from the voice feature. The previous Enter-to-send conflicted with multi-line input needs. The new behavior:

Ctrl/Cmd+Enter → send
Enter → newline
Suggestion selection still uses plain Enter

This is a deliberate UX improvement, not a regression.

github-actions

Findings

[Major] Gemini can silently unmute the user after any model response — turnComplete always calls state.recorder?.setMuted(false), so a user who intentionally muted the mic gets re-exposed as soon as the first reply finishes. Evidence web/src/realtime/GeminiLiveVoiceSession.tsx:217, related state web/src/realtime/GeminiLiveVoiceSession.tsx:388.
Suggested fix:
```
function applyRecorderMute(): void {
    state.recorder?.setMuted(state.micMuted || state.modelSpeaking)
}

if (!state.modelSpeaking) {
    state.modelSpeaking = true
    applyRecorderMute()
}

if (serverContent.turnComplete) {
    state.modelSpeaking = false
    applyRecorderMute()
}
```
[Major] Failed starts leak the newly created audio context in both new backends — each startSession() opens state.playbackContext before token/mic validation, but the early throw paths never call cleanup(). Repeated failures leave contexts alive and can block later starts once the browser hits its context cap. Evidence web/src/realtime/GeminiLiveVoiceSession.tsx:78, web/src/realtime/GeminiLiveVoiceSession.tsx:89, same pattern web/src/realtime/QwenVoiceSession.tsx:94, web/src/realtime/QwenVoiceSession.tsx:102.
Suggested fix:
```
async startSession(config: VoiceSessionConfig): Promise<void> {
    cleanup()
    try {
        state.playbackContext = new AudioContext({ sampleRate: 24000 })
        await state.playbackContext.resume()
        // existing setup...
    } catch (error) {
        cleanup()
        throw error
    }
}
```
[Major] Composer Enter-to-send is still inverted for the whole chat UI — the current handler only sends on Ctrl/Cmd+Enter, while plain Enter now inserts a newline. That changes a core workflow unrelated to the voice feature. Evidence web/src/components/AssistantChat/HappyComposer.tsx:314.
Suggested fix:
```
if (key === 'Enter' && e.shiftKey) {
    return
}

if (key === 'Enter' && !e.ctrlKey && !e.altKey && !e.metaKey && canSend) {
    e.preventDefault()
    api.composer().send()
    setShowContinueHint(false)
    return
}
```

Summary

Review mode: follow-up after new commits
Three issues remain at the current head: Gemini still overrides a user mute after model speech, failed Gemini/Qwen starts leak audio contexts, and the composer shortcut regression is still present.

Testing

Not run (automation; static review only)
Missing coverage: manual mute persisted across Gemini responses, repeated failed starts for Gemini/Qwen, and Enter vs Shift+Enter vs Ctrl/Cmd+Enter composer behavior

HAPI Bot

github-actions · 2026-04-22T03:52:06Z

+                        console.log('[GeminiLive] Turn complete')
+                        // Model done — unmute mic for next user turn
+                        state.modelSpeaking = false
+                        state.recorder?.setMuted(false)


[MAJOR] turnComplete always unmutes the recorder, so a user who intentionally muted the mic gets re-exposed as soon as the first Gemini reply finishes.

Suggested fix:

function applyRecorderMute(): void { state.recorder?.setMuted(state.micMuted || state.modelSpeaking) } if (serverContent.turnComplete) { state.modelSpeaking = false applyRecorderMute() }

github-actions · 2026-04-22T03:52:06Z

+        // Create playback AudioContext immediately while still inside the user
+        // gesture (click/tap). Mobile browsers require this for autoplay policy.
+        // Store in state so cleanup() can close it on failure or stop.
+        state.playbackContext = new AudioContext({ sampleRate: 24000 })


[MAJOR] This AudioContext is created before token/mic validation, but the early throw paths above never call cleanup(). The same pattern exists in QwenVoiceSession. A few failed starts can leave enough contexts alive to block later voice sessions.

Suggested fix:

async startSession(config: VoiceSessionConfig): Promise<void> { cleanup() try { state.playbackContext = new AudioContext({ sampleRate: 24000 }) await state.playbackContext.resume() // existing setup... } catch (error) { cleanup() throw error } }

github-actions · 2026-04-22T03:52:06Z


-        // Only plain Enter (no modifiers) sends; other modifier combos are ignored
-        if (key === 'Enter') {
+        // Ctrl+Enter (Windows/Linux) or Cmd+Enter (Mac) sends the message


[MAJOR] This still flips the app-wide composer shortcut from plain Enter to Ctrl/Cmd+Enter. The voice backend work does not require changing the default send behavior.

Suggested fix:

if (key === 'Enter' && e.shiftKey) { return } if (key === 'Enter' && !e.ctrlKey && !e.altKey && !e.metaKey && canSend) { e.preventDefault() api.composer().send() setShowContinueHint(false) return }

换行上, CMD + enter会比较合理

github-actions Bot reviewed Apr 5, 2026

View reviewed changes

tiann requested changes Apr 6, 2026

View reviewed changes

Overbaker added 11 commits April 21, 2026 08:05

feat(voice): switch default language to Chinese

09520c2

- System prompt instructs assistant to respond in Mandarin - First message changed to Chinese greeting - ElevenLabs language set to 'zh'

fix(voice): use inline Blob URL for AudioWorklet instead of ?url import

99a60e5

Vite inlined the worklet as a data URI with wrong MIME type (video/mp2t) and uncompiled TypeScript, causing AudioWorklet.addModule() to fail. Use Blob URL with plain JS source instead.

fix(voice): switch to gemini-2.5-flash-native-audio-latest model

4749454

gemini-3.1-flash-live-preview does not accept clientContent text input, only audio input. gemini-2.5-flash-native-audio-latest supports both.

fix(pwa): add skipWaiting + clientsClaim for immediate SW activation

a297748

Without this, new deployments required users to close all tabs before the updated Service Worker would activate and serve new assets.

fix(voice): switch default TTS to qwen-realtime + increase socket buffer

8f85268

- Change DEFAULT_VOICE_BACKEND from elevenlabs to qwen-realtime - Change QWEN_REALTIME_VOICE from Cherry to Mia - Increase maxHttpBufferSize to 55MB to match upload limit

fix(voice): greeting via system prompt to preserve tool call ability

bde20fa

clientContent greeting creates a conversation turn that pushes the model into "chat mode", breaking subsequent tool calls. Instead, instruct the model to greet naturally when the user speaks first.

Overbaker force-pushed the feat/pluggable-voice-backend branch from 5578f22 to f5cbd0e Compare April 21, 2026 08:55

github-actions Bot reviewed Apr 21, 2026

View reviewed changes

fix(voice): gate voice button on backend discovery readiness

e32c1f6

VoiceBackendSession now fires onReadyChange(true) after backend discovery completes. SessionChat disables the voice toggle until ready, preventing silent drops when the user taps before registerVoiceSession() has run.

github-actions Bot reviewed Apr 21, 2026

View reviewed changes

github-actions Bot reviewed Apr 22, 2026

View reviewed changes


		# Language

		IMPORTANT: Always respond in Chinese (Mandarin). Use natural spoken Chinese.

Uh oh!

Conversation

Overbaker commented Apr 5, 2026

Summary

Key Design Decisions

Configuration

Files Changed

Test Plan

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tiann left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!