feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime#401
feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime#401Overbaker wants to merge 21 commits intotiann:mainfrom
Conversation
There was a problem hiding this comment.
Findings
- [Blocker] Qwen WebSocket proxy bypasses API auth and can be opened without a JWT, which lets any reachable client consume the hub's DashScope credentials through
/api/voice/qwen-ws. Evidencehub/src/web/server.ts:328. - [Major] The fallback voice backend is now
gemini-live, so existing installs that only configured ElevenLabs will be routed away from the existing token flow and fail voice startup. Evidenceshared/src/voice.ts:280,hub/src/web/routes/voice.ts:121. - [Major] The Qwen frontend still requires the hub to return a raw DashScope key even though the browser never uses it after switching to the hub WebSocket proxy, so every authenticated web client now receives a long-lived provider secret unnecessarily. Evidence
web/src/realtime/QwenVoiceSession.tsx:84,hub/src/web/routes/voice.ts:162.
Summary
Review mode: initial
Three findings. Added coverage does not exercise the new /api/voice/qwen-ws auth boundary, and the new route tests would not catch the default-backend regression because they do not assert against DEFAULT_VOICE_BACKEND.
Testing
- Not run (automation):
bunis not installed in this runner.
HAPI Bot
| return socketHandler.fetch(req, server as never) | ||
| } | ||
| // Qwen Realtime WebSocket proxy | ||
| if (url.pathname === '/api/voice/qwen-ws') { |
There was a problem hiding this comment.
[BLOCKER] This branch upgrades /api/voice/qwen-ws before the request ever reaches app.fetch(req), so it bypasses createAuthMiddleware() entirely. As written, any client that can reach the hub can open a proxied DashScope session with the server-side API key.
Suggested fix:
const token = url.searchParams.get('token')
if (!token) {
return new Response('Missing authorization token', { status: 401 })
}
await jwtVerify(token, options.jwtSecret, { algorithms: ['HS256'] })
const upgraded = server.upgrade(req, {
data: { _qwenProxy: true, apiKey, model }
})There was a problem hiding this comment.
Fixed in f5cbd0e.
Both /api/voice/gemini-ws and /api/voice/qwen-ws now require a ?token=<JWT> query parameter. The fetch handler verifies the token via jwtVerify(token, jwtSecret, { algorithms: ['HS256'] }) before calling server.upgrade().
Frontend voice sessions now attach the JWT from ApiClient.getAuthToken() to the WebSocket URL.
| export const QWEN_REALTIME_MODEL = 'qwen3-omni-flash-realtime' | ||
| export const QWEN_REALTIME_VOICE = 'Mia' | ||
|
|
||
| export const DEFAULT_VOICE_BACKEND: VoiceBackendType = 'gemini-live' |
There was a problem hiding this comment.
[MAJOR] This flips the implicit backend from ElevenLabs to Gemini Live. On a hub that only has ELEVENLABS_API_KEY configured, /api/voice/backend now resolves to gemini-live, and VoiceBackendSession will send users into /api/voice/gemini-token instead of the existing ElevenLabs flow.
Suggested fix:
export const DEFAULT_VOICE_BACKEND: VoiceBackendType = 'elevenlabs'There was a problem hiding this comment.
Fixed in f5cbd0e.
DEFAULT_VOICE_BACKEND reverted to 'elevenlabs'. Hubs with only ELEVENLABS_API_KEY configured will continue to work as before. Users who want Gemini Live or Qwen can set VOICE_BACKEND=gemini-live or VOICE_BACKEND=qwen-realtime explicitly.
|
|
||
| // Get API key from hub | ||
| const tokenResp = await fetchQwenToken(this.api) | ||
| if (!tokenResp.allowed || !tokenResp.apiKey) { |
There was a problem hiding this comment.
[MAJOR] The browser does not use this provider key after the Qwen proxy change. startSession() only checks that apiKey exists, then opens /api/voice/qwen-ws, where the hub injects its own Authorization header. Returning a long-lived DashScope key here needlessly leaks it to every authenticated web client.
Suggested fix:
const tokenResp = await fetchQwenToken(this.api)
if (!tokenResp.allowed) {
const msg = tokenResp.error ?? 'DashScope API key not available'
state.statusCallback?.('error', msg)
throw new Error(msg)
}
state.wsBaseUrl = tokenResp.wsUrl || nullThere was a problem hiding this comment.
Fixed in f5cbd0e.
/voice/qwen-token now returns { allowed: true, wsUrl } only — no apiKey field. The DashScope key stays server-side in the WebSocket proxy. Frontend QwenVoiceSession updated accordingly: it checks allowed, gets wsUrl, and connects to the hub proxy with a JWT token.
Same pattern already used for Gemini Live (apiKey: 'proxied').
tiann
left a comment
There was a problem hiding this comment.
Thank you for your contribution. I believe this is a good feature. Please fix the comments first.
Add a strategy-based voice backend architecture that allows switching between ElevenLabs ConvAI and Gemini Live API via VOICE_BACKEND env var. - Shared: VoiceBackendType, Gemini Live config builder, tool definitions - Hub: GET /voice/backend discovery, POST /voice/gemini-token with proxy support - Web: GeminiLiveVoiceSession (WebSocket + AudioWorklet audio pipeline), VoiceBackendSession dynamic switcher with React.lazy() code splitting, Gemini tool adapter bridging existing client tools - Tests: hub route tests, pcmUtils round-trip tests, toolAdapter tests - Zero changes to existing ElevenLabs code paths
- System prompt instructs assistant to respond in Mandarin - First message changed to Chinese greeting - ElevenLabs language set to 'zh'
Vite inlined the worklet as a data URI with wrong MIME type (video/mp2t) and uncompiled TypeScript, causing AudioWorklet.addModule() to fail. Use Blob URL with plain JS source instead.
gemini-3.1-flash-live-preview does not accept clientContent text input, only audio input. gemini-2.5-flash-native-audio-latest supports both.
- Shared: add 'qwen-realtime' backend type, model/voice constants - Hub: POST /voice/qwen-token route (DASHSCOPE_API_KEY / QWEN_API_KEY) - Web: QwenVoiceSession using DashScope Realtime WebSocket API (OpenAI-compatible protocol: session.update, input_audio_buffer, response.audio.delta, function calling via conversation.item.create) - VoiceBackendSession: lazy-load Qwen component - Tests: qwen-token route tests (3 cases) Switch via VOICE_BACKEND=qwen-realtime + DASHSCOPE_API_KEY=xxx
Without this, new deployments required users to close all tabs before the updated Service Worker would activate and serve new assets.
- Hub: add WebSocket proxy at /api/voice/qwen-ws that injects Authorization header (browser WebSocket can't set custom headers) - Switch from qwen3.5-omni-plus-realtime (access denied / invite-only) to qwen3-omni-flash-realtime (publicly available) - Frontend connects via Hub proxy instead of direct to DashScope
- Change DEFAULT_VOICE_BACKEND from elevenlabs to qwen-realtime - Change QWEN_REALTIME_VOICE from Cherry to Mia - Increase maxHttpBufferSize to 55MB to match upload limit
- Mute mic while model is speaking to prevent noise-triggered barge-in - Skip audio sending during model turn to avoid false interrupts - Remove greeting prompt that blocked subsequent tool calls - Add critical tool usage instruction at top of system prompt - Always log tool call and messageCodingAgent events (not just DEV mode)
clientContent greeting creates a conversation turn that pushes the model into "chat mode", breaking subsequent tool calls. Instead, instruct the model to greet naturally when the user speaks first.
…g API keys - Add JWT token verification to /api/voice/gemini-ws and /api/voice/qwen-ws WebSocket endpoints before upgrading, preventing unauthenticated access to server-side API credentials (BLOCKER) - Revert DEFAULT_VOICE_BACKEND to 'elevenlabs' so existing installs that only configured ElevenLabs are not broken (MAJOR) - Remove raw DashScope API key from /voice/qwen-token response; the hub proxy handles the key server-side, so the browser never needs it (MAJOR) - Update frontend to pass JWT via ?token= query param on WS connections - Change composer send shortcut from Enter to Ctrl/Cmd+Enter
5578f22 to
f5cbd0e
Compare
There was a problem hiding this comment.
Findings
- [Major] Voice can drop the first tap while backend discovery is still loading —
VoiceBackendSessionnow rendersnulluntilGET /voice/backendcompletes, butSessionChatstill enables the voice button whenever the provider exists. If the user taps immediately,startRealtimeSession()sees no registered session and returns without changing status, so the action is silently lost. Evidenceweb/src/realtime/VoiceBackendSession.tsx:35andweb/src/components/SessionChat.tsx:426.
Suggested fix:const [voiceBackendReady, setVoiceBackendReady] = useState(false) <HappyComposer onVoiceToggle={voice && voiceBackendReady ? handleVoiceToggle : undefined} onVoiceMicToggle={voice && voiceBackendReady ? handleVoiceMicToggle : undefined} /> <VoiceBackendSession api={props.api} micMuted={voice.micMuted} onStatusChange={voice.setStatus} onReadyChange={setVoiceBackendReady} />
Summary
Review mode: follow-up after new commits
One finding. The previously reported Qwen unauthenticated WebSocket proxy and raw DashScope key exposure are fixed on this head. Residual testing gap: no coverage for the async backend discovery/voice-button readiness path.
Testing
- Not run (automation)
HAPI Bot
| return () => { cancelled = true } | ||
| }, [props.api]) | ||
|
|
||
| if (!backend) return null |
There was a problem hiding this comment.
[MAJOR] VoiceBackendSession now renders nothing until backend discovery finishes, but SessionChat still wires onVoiceToggle as soon as the voice provider exists. During that initial window no registerVoiceSession() has run, so a quick tap calls startRealtimeSession(), hits the “No voice session registered” branch, and silently drops the user's first start attempt.
Suggested fix:
const [voiceBackendReady, setVoiceBackendReady] = useState(false)
<HappyComposer
onVoiceToggle={voice && voiceBackendReady ? handleVoiceToggle : undefined}
onVoiceMicToggle={voice && voiceBackendReady ? handleVoiceMicToggle : undefined}
/>
<VoiceBackendSession
api={props.api}
micMuted={voice.micMuted}
onStatusChange={voice.setStatus}
onReadyChange={setVoiceBackendReady}
/>There was a problem hiding this comment.
Fixed in e32c1f6.
Added onReadyChange callback to VoiceBackendSession — it fires true after fetchVoiceBackend() resolves. SessionChat now gates onVoiceToggle / onVoiceMicToggle behind voiceBackendReady state, so the button stays disabled until registerVoiceSession() has completed.
VoiceBackendSession now fires onReadyChange(true) after backend discovery completes. SessionChat disables the voice toggle until ready, preventing silent drops when the user taps before registerVoiceSession() has run.
There was a problem hiding this comment.
Findings
- [Blocker] Qwen default proxy URL is concatenated twice, so the backend cannot connect with the route contract returned by the hub.
POST /api/voice/qwen-tokennow returns a full proxy URL ending in/api/voice/qwen-ws, but the client still appends/api/voice/qwen-wsagain before opening the socket. Evidenceweb/src/realtime/QwenVoiceSession.tsx:106,web/src/realtime/QwenVoiceSession.tsx:109,hub/src/web/routes/voice.ts:167.
Suggested fix:const proxyUrl = state.wsBaseUrl ?? `${protocol}//${window.location.host}/api/voice/qwen-ws` const wsUrl = `${proxyUrl}${proxyUrl.includes('?') ? '&' : '?'}model=${encodeURIComponent(model)}&token=${encodeURIComponent(authToken)}`
- [Major] The first voice tap can still be dropped for Gemini/Qwen because readiness flips before the lazy backend module has mounted and called
registerVoiceSession().VoiceBackendSessionnow enables the button immediately after backend discovery, but both non-ElevenLabs backends still sit behindReact.lazy()with anullfallback. On a cold chunk load,startRealtimeSession()still hits the unregistered-session branch. Evidenceweb/src/realtime/VoiceBackendSession.tsx:9,web/src/realtime/VoiceBackendSession.tsx:30,web/src/realtime/VoiceBackendSession.tsx:46.
Suggested fix:// only mark the backend ready after the selected session component has mounted // and finished registerVoiceSession(...) <GeminiLiveVoiceSession {...props} onRegistered={() => props.onReadyChange?.(true)} />
- [Major] The new hub route tests still assert the pre-proxy token contract instead of the responses this PR now returns. Gemini no longer exposes the raw key, and Qwen now returns only
wsUrl, so these assertions are out of sync with the implementation and will fail or validate the wrong API surface. Evidencehub/src/web/routes/voice.test.ts:88,hub/src/web/routes/voice.test.ts:99,hub/src/web/routes/voice.test.ts:133,hub/src/web/routes/voice.test.ts:144.
Suggested fix:const body = await res.json() as { allowed: boolean; wsUrl: string } expect(body.allowed).toBe(true) expect(body.wsUrl).toContain('/api/voice/qwen-ws')
Summary
Review mode: follow-up after new commits
Three findings. The prior backend-discovery gating issue is only partially fixed: lazy backends can still lose the first start click. Qwen also appears broken by default because the proxy path is appended twice. Test coverage is present, but the new route tests are still checking the old token shape.
Testing
- Not run (automation)
HAPI Bot
| const proxyBase = state.wsBaseUrl || `${protocol}//${window.location.host}` | ||
| const model = QWEN_REALTIME_MODEL | ||
| const authToken = this.api.getAuthToken() || '' | ||
| const wsUrl = `${proxyBase}/api/voice/qwen-ws?model=${encodeURIComponent(model)}&token=${encodeURIComponent(authToken)}` |
There was a problem hiding this comment.
[BLOCKER] fetchQwenToken() now returns a full proxy URL ending in /api/voice/qwen-ws, but this code appends /api/voice/qwen-ws again before opening the socket. With the default hub response the browser ends up dialing .../api/voice/qwen-ws/api/voice/qwen-ws?..., so Qwen never reaches session.created.
Suggested fix:
const proxyUrl = state.wsBaseUrl ?? `${protocol}//${window.location.host}/api/voice/qwen-ws`
const wsUrl = `${proxyUrl}${proxyUrl.includes('?') ? '&' : '?'}model=${encodeURIComponent(model)}&token=${encodeURIComponent(authToken)}`There was a problem hiding this comment.
Fixed in fbf315b.
wsBaseUrl from the hub now contains the full proxy path. The client uses it directly as proxyUrl and only appends query params (model= and token=), no more path duplication.
| fetchVoiceBackend(props.api).then((resp) => { | ||
| if (!cancelled) { | ||
| setBackend(resp.backend) | ||
| props.onReadyChange?.(true) |
There was a problem hiding this comment.
[MAJOR] This still marks the voice backend ready before the lazy-loaded Gemini/Qwen component has mounted and called registerVoiceSession(). On a cold chunk load the button becomes clickable while startRealtimeSession() would still hit the "No voice session registered" branch, so the first tap can still be lost.
Suggested fix:
// only call onReadyChange(true) after the selected backend component mounts
// and completes registerVoiceSession(...)
<GeminiLiveVoiceSession {...props} onRegistered={() => props.onReadyChange?.(true)} />There was a problem hiding this comment.
Fixed in fbf315b.
onReadyChange(true) is no longer called from VoiceBackendSession after discovery. Instead, each backend component (Gemini/Qwen/ElevenLabs) now accepts an onRegistered callback and fires it after registerVoiceSession() succeeds inside their mount useEffect. VoiceBackendSession passes handleRegistered → props.onReadyChange(true) down to the selected child, so the button only enables after the lazy chunk has loaded, mounted, and completed registration.
| const app = createApp() | ||
| const res = await app.request('/api/voice/gemini-token', { method: 'POST' }) | ||
| expect(res.status).toBe(200) | ||
| const body = await res.json() as { allowed: boolean; apiKey: string } |
There was a problem hiding this comment.
[MAJOR] These assertions are still checking the old token contract. POST /api/voice/gemini-token now returns a proxied websocket contract, and POST /api/voice/qwen-token returns only wsUrl, so the new tests are out of sync with the implementation and will fail or validate the wrong API surface.
Suggested fix:
const body = await res.json() as { allowed: boolean; wsUrl: string }
expect(body.allowed).toBe(true)
expect(body.wsUrl).toContain('/api/voice/gemini-ws')There was a problem hiding this comment.
Fixed in fbf315b.
Test assertions updated to match the new proxy contract:
gemini-token: assertsapiKey === 'proxied'andwsUrlcontains/api/voice/gemini-wsqwen-token: assertsallowed === true,wsUrlcontains/api/voice/qwen-ws, andapiKeyproperty does not exist
All 10 tests pass.
…tered, update tests
- Fix BLOCKER: Qwen proxy URL was concatenated twice — wsBaseUrl from
hub already contains /api/voice/qwen-ws, so the client no longer
appends it again
- Fix MAJOR: onReadyChange now fires from onRegistered callback inside
each backend component (after registerVoiceSession completes), not
from VoiceBackendSession after discovery. This ensures lazy-loaded
Gemini/Qwen chunks have fully mounted before the voice button enables
- Fix MAJOR: update voice.test.ts to assert new proxy contract —
gemini-token returns { apiKey: 'proxied', wsUrl }, qwen-token returns
{ allowed, wsUrl } with no apiKey field
There was a problem hiding this comment.
Findings
- [Blocker] The new proxy token routes build websocket URLs from a hard-coded
localhost:24888fallback, so Gemini/Qwen voice breaks whenHAPI_PUBLIC_URLis unset. - [Major] The primary AudioWorklet recording path never connects the worklet node to a live graph, so mic chunks are likely never emitted on browsers that successfully load the worklet.
- [Major] Gemini tool calls are executed in parallel against shared mutable session state, which can duplicate or reorder side effects.
Summary
Review mode: follow-up after new commits
Three findings on the current head. The earlier readiness and Qwen proxy regressions are fixed, but the remaining issues still block or destabilize the new voice backends in common self-hosted flows.
Testing
- Not run (automation)
HAPI Bot
|
|
||
| // Use server-side WS proxy to avoid region restrictions. | ||
| // The proxy at /api/voice/gemini-ws handles the API key server-side. | ||
| const publicUrl = process.env.HAPI_PUBLIC_URL || `http://localhost:${process.env.HAPI_LISTEN_PORT || '24888'}` |
There was a problem hiding this comment.
[BLOCKER] HAPI_PUBLIC_URL is optional here, but this fallback always emits an absolute ws://localhost:24888/... proxy URL. On the default/local config that makes remote browsers open their own localhost instead of the hub, so both new voice backends fail before the websocket upgrade. The same issue is duplicated in the Qwen branch below.
Suggested fix:
const requestOrigin = new URL(c.req.url).origin
const publicUrl = process.env.HAPI_PUBLIC_URL || requestOrigin
const wsProxyUrl = publicUrl.replace(/^http/, "ws") + "/api/voice/gemini-ws"There was a problem hiding this comment.
Fixed in c68366a.
Both /voice/gemini-token and /voice/qwen-token now derive wsUrl from new URL(c.req.url).origin instead of hard-coded localhost:24888. HAPI_PUBLIC_URL still overrides when set. Remote browsers will correctly connect back to the hub.
| const base64 = arrayBufferToBase64(pcm16); | ||
| onChunk(base64); | ||
| }; | ||
| this.sourceNode.connect(this.workletNode); |
There was a problem hiding this comment.
[MAJOR] On the main AudioWorklet path the recorder never connects the worklet node to any downstream sink. The fallback ScriptProcessorNode path does connect to destination, but this branch stops at source -> worklet, so the graph has nothing pulling frames and port.onmessage will not fire on browsers where the worklet loads successfully.
Suggested fix:
const sink = this.audioContext.createGain()
sink.gain.value = 0
this.sourceNode.connect(this.workletNode)
this.workletNode.connect(sink)
sink.connect(this.audioContext.destination)There was a problem hiding this comment.
Fixed in c68366a.
AudioWorklet node is now connected through a zero-gain GainNode to audioContext.destination: source → worklet → sink(gain=0) → destination. This gives the audio graph a downstream pull so port.onmessage fires correctly on all browsers.
| export async function handleGeminiFunctionCalls( | ||
| calls: GeminiFunctionCall[] | ||
| ): Promise<GeminiFunctionResponse[]> { | ||
| return Promise.all(calls.map(handleGeminiFunctionCall)) |
There was a problem hiding this comment.
[MAJOR] Promise.all(...) runs Gemini tool calls concurrently against shared session state. That is unsafe here because processPermissionRequest() resolves the current first pending request each time; two calls in one turn can race, act on the same request twice, or reorder side effects.
Suggested fix:
const responses: GeminiFunctionResponse[] = []
for (const call of calls) {
responses.push(await handleGeminiFunctionCall(call))
}
return responsesThere was a problem hiding this comment.
Fixed in c68366a.
handleGeminiFunctionCalls now uses a sequential for...of loop instead of Promise.all. This ensures tool calls execute one at a time against shared session state, preventing duplicate processPermissionRequest resolutions.
…, serialize tool calls - Fix BLOCKER: derive wsUrl from request origin instead of hard-coded localhost:24888 fallback, so remote browsers connect back to the hub. HAPI_PUBLIC_URL still overrides when set. - Fix MAJOR: connect AudioWorklet node to a silent GainNode sink so the audio graph pulls frames and port.onmessage fires correctly. - Fix MAJOR: replace Promise.all with sequential for-loop in handleGeminiFunctionCalls to prevent racing on shared session state (e.g. duplicate processPermissionRequest resolutions).
There was a problem hiding this comment.
Findings
- [Major] Playback
AudioContextcreation moved into the async websocket setup path, so Gemini/Qwen can connect successfully but stay silent on mobile browsers that require audio contexts to be created/resumed in the original user gesture. Evidenceweb/src/realtime/GeminiLiveVoiceSession.tsx:166,web/src/realtime/QwenVoiceSession.tsx:176,web/src/realtime/gemini/audioPlayer.ts:12.
Summary
- Review mode: follow-up after new commits
- One major regression remains on the current head: the new Gemini/Qwen backends still create playback audio contexts too late for mobile autoplay policies, so voice output can fail even when the websocket handshake succeeds.
Testing
- Not run (automation)
HAPI Bot
| state.statusCallback?.('connected') | ||
|
|
||
| // Start audio capture | ||
| startAudioCapture() |
There was a problem hiding this comment.
[MAJOR] startAudioCapture() runs only after the websocket handshake finishes, and GeminiAudioPlayer creates its own AudioContext there. On iOS/mobile Safari that is already outside the original click gesture, so the session connects but playback can remain suspended/silent. QwenVoiceSession hits the same helper, so both new backends are affected.
Suggested fix:
// still inside startSession(), before awaiting websocket setup
const playbackContext = new AudioContext({ sampleRate: 24000 })
await playbackContext.resume()
// later, after setup completes
state.player = new GeminiAudioPlayer(playbackContext)There was a problem hiding this comment.
Fixed in 759bf35.
Both Gemini Live and Qwen Realtime now create and resume() the playback AudioContext({ sampleRate: 24000 }) at the very start of startSession(), while still inside the user's click/tap gesture. The pre-created context is then passed to GeminiAudioPlayer(playbackContext) via startAudioCapture().
This satisfies mobile autoplay policies (iOS Safari, Chrome Android) that require AudioContext creation within a user activation event.
…utoplay Create and resume the playback AudioContext at the start of startSession(), while still inside the user's click/tap gesture. Pass the pre-created context to GeminiAudioPlayer so mobile browsers (iOS Safari) don't block audio output due to autoplay policy. Applies to both Gemini Live and Qwen Realtime backends.
There was a problem hiding this comment.
Findings
- [Major] The follow-up autoplay fix now leaks the playback
AudioContextfor both Gemini and Qwen.startSession()creates a new context before the token/mic checks, butGeminiAudioPlayer.dispose()only closes contexts it created itself, so failed starts and normal stop/start cycles both leave orphaned contexts behind. On mobile browsers with lowAudioContextlimits, voice playback eventually stops working after a few retries. Evidenceweb/src/realtime/GeminiLiveVoiceSession.tsx:69,web/src/realtime/QwenVoiceSession.tsx:84,web/src/realtime/gemini/audioPlayer.ts:69.
Suggested fix:try { state.playbackContext = new AudioContext({ sampleRate: 24000 }) await state.playbackContext.resume() // ...rest of setup... startAudioCapture(state.playbackContext) } catch (error) { cleanup() throw error } function cleanup() { // existing cleanup... if (state.playbackContext && state.playbackContext.state !== 'closed') { void state.playbackContext.close() state.playbackContext = null } }
Summary
- Review mode: follow-up after new commits
- One major regression remains on the current head: the mobile autoplay fix now leaks playback audio contexts in both new backends, so repeated retries eventually break voice output.
Testing
- Not run (automation environment does not have
bun)
HAPI Bot
|
|
||
| // Create playback AudioContext immediately while still inside the user | ||
| // gesture (click/tap). Mobile browsers require this for autoplay policy. | ||
| const playbackContext = new AudioContext({ sampleRate: 24000 }) |
There was a problem hiding this comment.
[MAJOR] This new user-gesture AudioContext never gets closed. startAudioCapture() hands it to GeminiAudioPlayer, but GeminiAudioPlayer.dispose() only closes owned contexts (web/src/realtime/gemini/audioPlayer.ts:69), so both failed starts and normal stop/start cycles leak one playback context each. Qwen now has the same regression at web/src/realtime/QwenVoiceSession.tsx:84.
Suggested fix:
try {
state.playbackContext = new AudioContext({ sampleRate: 24000 })
await state.playbackContext.resume()
// ...rest of setup...
} catch (error) {
cleanup()
throw error
}There was a problem hiding this comment.
Fixed in 74aa4c2.
Both Gemini and Qwen now store the playback AudioContext in module state (state.playbackContext). cleanup() closes it (void state.playbackContext.close()) alongside the recorder, player, and WebSocket. This covers:
- Failed starts (token check, mic denial) —
cleanup()at top ofstartSession() - Normal stop —
endSession()→cleanup() - Retries — next
startSession()callscleanup()first
No more orphaned contexts on repeated start/stop cycles.
Move playback AudioContext into module state so cleanup() can close it on failed starts and normal stop/start cycles. Prevents orphaned AudioContext leaks that would exhaust mobile browser limits after repeated retries.
There was a problem hiding this comment.
Review mode: follow-up after new commits
Findings
- [Major] Forced service-worker activation breaks already-open tabs after deploy.
skipWaiting()+clientsClaim()immediately move existing sessions onto the new worker, but this PR also introduces lazy-loaded voice chunks. A tab that is still running the old app shell can then request an old chunk hash and fail to load voice after rollout. Evidenceweb/src/sw.ts:24,web/src/realtime/VoiceBackendSession.tsx:8.
Suggested fix:// Keep the new worker in waiting state until the page reloads, // or trigger skipWaiting only after an explicit client prompt. precacheAndRoute(self.__WB_MANIFEST)
- [Minor] Voice requests are now logged in production. The debug guard was removed around
messageCodingAgent, so every dictated prompt and session id is written to the browser console. Evidenceweb/src/realtime/realtimeClientTools.ts:48.
Suggested fix:if (VOICE_CONFIG.ENABLE_DEBUG_LOGGING) { console.log('[Voice] messageCodingAgent called with:', message) console.log('[Voice] Sending message to session:', sessionId) }
Summary
- Review mode: follow-up after new commits
- Two issues remain on the current head: one deploy-time regression in the service-worker update path, and one production logging regression for voice prompts.
Testing
- Not run (
bunis not installed in this automation environment). Existing tests also do not cover the service-worker upgrade path.
HAPI Bot
| } | ||
|
|
||
| // Activate new SW immediately without waiting for all tabs to close | ||
| self.addEventListener('install', () => { self.skipWaiting() }) |
There was a problem hiding this comment.
[MAJOR] skipWaiting()/clientsClaim() force the new worker onto already-open tabs. Because this PR also lazy-loads the non-ElevenLabs voice backends, an existing page can still request old chunk hashes after deployment and hit a chunk-load failure mid-session.
Suggested fix:
// Leave the new worker waiting, or only call skipWaiting after the client opts in.
precacheAndRoute(self.__WB_MANIFEST)There was a problem hiding this comment.
Fixed in 296dc85.
Removed skipWaiting() and clientsClaim() from the service worker. New SW now stays in waiting state until all tabs close, preventing lazy-loaded chunk hash mismatches after deploy.
| console.log('[Voice] messageCodingAgent called with:', message) | ||
| console.log('[Voice] Sending message to session:', sessionId) | ||
| } | ||
| console.log('[Voice] messageCodingAgent called with:', message) |
There was a problem hiding this comment.
[MINOR] This now logs every dictated agent request and the session id in production. Voice prompts often contain repo details or other sensitive instructions, so the old debug guard should stay in place here.
Suggested fix:
if (VOICE_CONFIG.ENABLE_DEBUG_LOGGING) {
console.log('[Voice] messageCodingAgent called with:', message)
console.log('[Voice] Sending message to session:', sessionId)
}There was a problem hiding this comment.
Fixed in 296dc85.
Wrapped the messageCodingAgent console.log calls with VOICE_CONFIG.ENABLE_DEBUG_LOGGING guard, matching the existing pattern in processPermissionRequest.
- Remove skipWaiting + clientsClaim from service worker to prevent breaking lazy-loaded voice chunks in already-open tabs after deploy. New SW now waits for all tabs to close before activating. - Wrap messageCodingAgent console.log calls with VOICE_CONFIG debug guard to stop logging user prompts and session IDs in production.
There was a problem hiding this comment.
Findings
- [Major] Gemini voice startup can hang indefinitely after an early socket close.
startSession()only rejects onws.onerrorbeforesetupComplete; if the proxy or upstream closes cleanly during the handshake,ws.oncloseonly marks the session disconnected and never rejects the pending promise, sovoice.startVoice()stays stuck inconnecting. Evidenceweb/src/realtime/GeminiLiveVoiceSession.tsx:254
Suggested fix:ws.onclose = (event) => { const message = event.reason || 'WebSocket closed before setup completed' cleanup() resetRealtimeSessionState() if (!setupDone) { state.statusCallback?.('error', message) reject(new Error(message)) return } state.statusCallback?.('disconnected') }
- [Major] Qwen voice has the same unresolved startup path, and the new
errorevent handler also drops server-side setup failures on the floor. If DashScope returns anerrorevent or closes beforesession.updated, the promise never settles and the UI remains stuck inconnecting. Evidenceweb/src/realtime/QwenVoiceSession.tsx:260,web/src/realtime/QwenVoiceSession.tsx:275
Suggested fix:Add the sameif (eventType === 'error') { const message = err?.message || 'Realtime session setup failed' state.statusCallback?.('error', message) if (!sessionCreated) { reject(new Error(message)) ws.close() } return }
!sessionCreatedrejection guard inws.onclose. - [Minor] Gemini debug logging is forced on in production.
const DEBUG = truemakes the new backend log connection state, tool calls, and model text on every user session, which regresses from the existing dev/config-gated logging pattern. Evidenceweb/src/realtime/GeminiLiveVoiceSession.tsx:14
Suggested fix:const DEBUG = import.meta.env.DEV
Summary
- Review mode: follow-up after new commits
- Three issues remain on the current head: both new WebSocket voice backends can leave the UI stuck in
connectingwhen setup fails early, and Gemini debug logging is still forced on in production.
Testing
- Not run (
bunis not installed in this automation environment). - New tests cover route/PCM/tool-adapter helpers, but they do not exercise WebSocket handshake failure paths for Gemini/Qwen.
HAPI Bot
| } | ||
| } | ||
|
|
||
| ws.onclose = (event) => { |
There was a problem hiding this comment.
[MAJOR] If the socket closes before setupComplete, startSession() never settles. ws.onerror rejects the promise, but this onclose path only flips the status to disconnected, so a clean handshake failure leaves the voice UI stuck in connecting.
Suggested fix:
ws.onclose = (event) => {
const message = event.reason || 'WebSocket closed before setup completed'
cleanup()
resetRealtimeSessionState()
if (!setupDone) {
state.statusCallback?.('error', message)
reject(new Error(message))
return
}
state.statusCallback?.('disconnected')
}There was a problem hiding this comment.
Fixed in f108c0f.
Gemini ws.onclose now rejects the startup promise when !setupDone, with error status callback. Also changed DEBUG from true to import.meta.env.DEV.
There was a problem hiding this comment.
Fixed in f108c0f (previous commit).
The onclose handler already rejects the promise when !setupDone, calls state.statusCallback?.('error', message), and returns early. This matches the suggested fix exactly.
| } | ||
|
|
||
| // Error | ||
| if (eventType === 'error') { |
There was a problem hiding this comment.
[MAJOR] This drops server-side setup failures on the floor. If DashScope sends an error event before session.updated, the pending startSession() promise is neither resolved nor rejected, so the UI keeps spinning in connecting.
Suggested fix:
if (eventType === 'error') {
const message = err?.message || 'Realtime session setup failed'
state.statusCallback?.('error', message)
if (!sessionCreated) {
reject(new Error(message))
ws.close()
}
return
}There was a problem hiding this comment.
Fixed in f108c0f.
Qwen error event handler now calls reject() + ws.close() when !sessionCreated. onclose also rejects the promise on early close, matching the Gemini pattern.
There was a problem hiding this comment.
Fixed in 7111b67.
Renamed the guard flag from sessionCreated → sessionReady and now only set it to true after session.updated arrives (the point where resolve() is called). If DashScope sends session.created but fails before session.updated, the error/close handlers now correctly reject the promise instead of leaving it hanging.
Additionally, the Gemini proxy in hub now queues client messages (especially the setup frame) in a pending buffer while the upstream WebSocket is still CONNECTING, then flushes on open — preventing the silent frame drop that could also cause a startup hang.
| import type { Session } from '@/types/api' | ||
| import type { GeminiFunctionCall } from './gemini/toolAdapter' | ||
|
|
||
| const DEBUG = true |
There was a problem hiding this comment.
[MINOR] DEBUG is hardcoded to true, so this backend now logs connection state, tool calls, and model text in production. The existing voice code gates these logs behind dev/config flags; this new backend should do the same.
Suggested fix:
const DEBUG = import.meta.env.DEV- Gemini: reject the startup promise in onclose when setup hasn't completed, preventing the UI from hanging in 'connecting' state - Qwen: reject on both server 'error' events and early onclose during handshake, with proper cleanup and status callback - Change Gemini DEBUG from hardcoded true to import.meta.env.DEV to stop logging connection state and tool calls in production
There was a problem hiding this comment.
Findings
- [Major] Gemini proxy can drop the setup frame before the upstream socket opens — the browser sends
setupas soon as the hub WebSocket opens, but the proxy forwards client messages only when the Google upstream is alreadyOPEN; messages received during upstreamCONNECTINGare discarded. That leaves Gemini Live waiting forever forsetupComplete. Evidencehub/src/web/server.ts:61and client send atweb/src/realtime/GeminiLiveVoiceSession.tsx:148.
Suggested fix:const pending: Array<string | ArrayBuffer | Uint8Array> = [] upstream.onopen = () => { for (const queued of pending.splice(0)) upstream.send(queued) } // in message(): queue until upstream opens instead of dropping setup if (upstream?.readyState === WebSocket.OPEN) upstream.send(message) else if (upstream?.readyState === WebSocket.CONNECTING) pending.push(message)
- [Major] Qwen startup still hangs when setup fails after
session.created—startSession()only rejects while!sessionCreated, but the promise does not resolve untilsession.updated. If DashScope accepts the socket, sendssession.created, then rejects thesession.updateor closes beforesession.updated, this code reports an error but never settles the pending start promise. Evidenceweb/src/realtime/QwenVoiceSession.tsx:265andweb/src/realtime/QwenVoiceSession.tsx:285.
Suggested fix:let sessionReady = false if (eventType === 'session.updated') { sessionReady = true state.statusCallback?.('connected') startAudioCapture(state.playbackContext!) resolve() return } if (eventType === 'error' && !sessionReady) { reject(new Error(message)) ws.close() return } // onclose: reject when !sessionReady, not only when !sessionCreated
Summary
- Review mode: follow-up after new commits
- Two startup reliability issues remain in the current head: Gemini proxy can discard the initial setup message, and Qwen can still leave the UI stuck in
connectingon setup failures aftersession.created.
Testing
- Not run (automation:
bunis not installed in this runner). - Missing coverage: WebSocket proxy backpressure/queueing for Gemini setup and Qwen failure paths between
session.createdandsession.updated.
HAPI Bot
| }, | ||
| message(clientWs: ServerWebSocket<unknown>, message: string | ArrayBuffer | Uint8Array) { | ||
| const upstream = upstreamMap.get(clientWs) | ||
| if (upstream?.readyState === WebSocket.OPEN) { |
There was a problem hiding this comment.
[MAJOR] Client messages are dropped while the upstream Gemini socket is still connecting. The browser sends the required setup frame immediately after the hub WebSocket opens, but this proxy only forwards when upstream.readyState === WebSocket.OPEN; if that frame arrives first, Google never receives setup and the client waits forever for setupComplete.
Suggested fix:
const pending: Array<string | ArrayBuffer | Uint8Array> = []
upstream.onopen = () => {
for (const queued of pending.splice(0)) upstream.send(queued)
}
if (upstream?.readyState === WebSocket.OPEN) {
upstream.send(message)
} else if (upstream?.readyState === WebSocket.CONNECTING) {
pending.push(message)
}| const message = err?.message || 'Realtime session setup failed' | ||
| console.error('[Qwen] Server error:', message) | ||
| state.statusCallback?.('error', message) | ||
| if (!sessionCreated) { |
There was a problem hiding this comment.
[MAJOR] This still leaves startup unresolved when setup fails after session.created. The promise resolves only on session.updated, but errors after sessionCreated = true skip reject(), so an invalid session.update response leaves the voice UI stuck in connecting.
Suggested fix:
let sessionReady = false
if (eventType === session.updated) {
sessionReady = true
state.statusCallback?.(connected)
startAudioCapture(state.playbackContext!)
resolve()
return
}
if (eventType === error && !sessionReady) {
reject(new Error(message))
ws.close()
return
}…tup hang Gemini proxy: buffer client messages (especially the setup frame) in a pending queue while the upstream WebSocket is still CONNECTING, then flush on open. Previously the setup frame was silently dropped. Qwen: rename sessionCreated → sessionReady and only set it after session.updated arrives. If the server fails between session.created and session.updated the promise now correctly rejects instead of hanging forever.
There was a problem hiding this comment.
Findings
- [Major] Proxy/upstream URL reuse leaks the hub JWT and bypasses the server-side proxy whenever
GEMINI_LIVE_WS_URLorQWEN_REALTIME_WS_URLis set./voice/*-tokenreturns those env values to the browser, and both clients appendtoken=<hub-jwt>before connecting, while the hub also reuses the same env vars for its upstream socket target. Evidencehub/src/web/routes/voice.ts:154,hub/src/web/routes/voice.ts:176,hub/src/web/server.ts:37,hub/src/web/server.ts:98,web/src/realtime/GeminiLiveVoiceSession.tsx:111,web/src/realtime/QwenVoiceSession.tsx:123.
Suggested fix:Apply the same split for Qwen, and keep the browser-facingconst geminiUpstreamWsUrl = process.env.GEMINI_LIVE_UPSTREAM_WS_URL || GEMINI_WS_BASE const geminiClientWsUrl = publicUrl.replace(/^http/, 'ws') + '/api/voice/gemini-ws' return c.json({ allowed: true, apiKey: 'proxied', wsUrl: geminiClientWsUrl })
wsUrlpinned to/api/voice/*-ws. - [Major] The shared Chinese prompt/first-message change is not additive: it changes the existing ElevenLabs assistant globally and bypasses the user language preference that the web app still passes only as
overrides.agent.language. Even when the user selects English, the base prompt now instructs “Always respond in Chinese” and the first message is hard-coded Chinese. Evidenceshared/src/voice.ts:145,shared/src/voice.ts:157,shared/src/voice.ts:242,web/src/lib/voice-context.tsx:43,web/src/realtime/RealtimeVoiceSession.tsx:74.
Suggested fix:If Gemini/Qwen should default to Chinese, give those backends their own prompt constants instead of reusing the ElevenLabs config.const ELEVENLABS_PROMPT = /* existing neutral/English prompt */ const ELEVENLABS_FIRST_MESSAGE = 'Hey! Hapi here.' export function buildVoiceAgentConfig(): VoiceAgentConfig { return { conversation_config: { agent: { first_message: ELEVENLABS_FIRST_MESSAGE, language: 'en', prompt: { prompt: ELEVENLABS_PROMPT, /* ... */ } } } } }
Summary
- Review mode: follow-up after new commits
- Two issues remain in the current head: the new
*_WS_URLenv vars act as both upstream and browser-facing URLs, which can leak the hub JWT outside the proxy path, and the shared prompt change silently regresses the existing ElevenLabs language behavior.
Testing
- Not run (review constraints: static analysis only; no PR code execution).
- Missing coverage: custom
GEMINI_LIVE_WS_URL/QWEN_REALTIME_WS_URLbehavior, and preserving ElevenLabs language selection after the shared prompt refactor.
HAPI Bot
| return c.json({ | ||
| allowed: true, | ||
| apiKey: 'proxied', // Dummy — key is handled server-side | ||
| wsUrl: process.env.GEMINI_LIVE_WS_URL || wsProxyUrl, |
There was a problem hiding this comment.
[MAJOR] GEMINI_LIVE_WS_URL is being reused as both the upstream target and the browser-facing wsUrl. If an operator sets it to a direct/custom upstream URL, /voice/gemini-token returns that URL to the browser and the client appends token=<hub-jwt> before connecting (web/src/realtime/GeminiLiveVoiceSession.tsx:111). That bypasses /api/voice/gemini-ws and leaks the hub JWT to the upstream service. Qwen has the same pattern at /api/voice/qwen-token.
Suggested fix:
const geminiUpstreamWsUrl = process.env.GEMINI_LIVE_UPSTREAM_WS_URL || GEMINI_WS_BASE
const geminiClientWsUrl = publicUrl.replace(/^http/, 'ws') + '/api/voice/gemini-ws'
return c.json({
allowed: true,
apiKey: 'proxied',
wsUrl: geminiClientWsUrl
})|
|
||
| # Language | ||
|
|
||
| IMPORTANT: Always respond in Chinese (Mandarin). Use natural spoken Chinese. |
There was a problem hiding this comment.
[MAJOR] This is now changing the existing ElevenLabs assistant globally, not just the new backends. The web app still only overrides agent.language (web/src/realtime/RealtimeVoiceSession.tsx:74), so users who selected English now get a base prompt and first message that explicitly force Chinese. That breaks the existing language preference path and contradicts the additive/no-breakage goal.
Suggested fix:
const ELEVENLABS_PROMPT = /* existing neutral/English prompt */
const ELEVENLABS_FIRST_MESSAGE = 'Hey! Hapi here.'
export function buildVoiceAgentConfig(): VoiceAgentConfig {
return {
conversation_config: {
agent: {
first_message: ELEVENLABS_FIRST_MESSAGE,
language: 'en',
prompt: { prompt: ELEVENLABS_PROMPT, /* ... */ }
}
}
}
}…from Gemini/Qwen Security: hub token endpoints now always return the proxy URL (/api/voice/*-ws) regardless of GEMINI_LIVE_WS_URL or QWEN_REALTIME_WS_URL env vars. Those env vars are upstream-only and never exposed to the browser, preventing JWT leakage to external endpoints. Language: extracted the Chinese language block from VOICE_SYSTEM_PROMPT into VOICE_CHINESE_LANGUAGE_BLOCK, appended only by Gemini and Qwen backends. ElevenLabs keeps a neutral English base prompt and controls language via its own language field, preserving user language preference.
|
Both issues fixed in 5c60488. 1. WS URL leak — Token endpoints now always return the hub proxy URL ( 2. ElevenLabs language regression — Extracted the Chinese language block from |
There was a problem hiding this comment.
Findings
- [Major] Initial mic mute is lost in the new Gemini/Qwen backends —
micMutedcan already betruebefore a session starts, but both components only apply it in an effect that runs whilestate.recorderis stillnull.startAudioCapture()then creates a live recorder and never reapplies the existing mute state, so the UI can show muted while audio is still captured until the user toggles again. Evidenceweb/src/realtime/GeminiLiveVoiceSession.tsx:380,web/src/realtime/QwenVoiceSession.tsx:397, related contextweb/src/lib/voice-context.tsx:23.
Suggested fix:const micMutedRef = useRef(micMuted) useEffect(() => { micMutedRef.current = micMuted state.recorder?.setMuted(micMuted) }, [micMuted]) void state.recorder.start(...).then(() => { state.recorder?.setMuted(micMutedRef.current) })
- [Major] Composer Enter-to-send was flipped for the whole web app — plain
Enternow inserts a newline and onlyCtrl/Cmd+Entersends. That changes the existing chat workflow for every keyboard user and is unrelated to the voice backend feature. Evidenceweb/src/components/AssistantChat/HappyComposer.tsx:314.
Suggested fix:if (key === 'Enter' && e.shiftKey) { return } if (key === 'Enter' && !e.ctrlKey && !e.altKey && !e.metaKey && canSend) { e.preventDefault() api.composer().send() setShowContinueHint(false) return }
Summary
- Review mode: follow-up after new commits
- Two issues remain in the current head: the new Gemini/Qwen sessions can capture audio while the UI still shows the mic as muted, and the composer key handling now regresses the existing Enter-to-send behavior.
Testing
- Not run (automation; static review only)
- Missing coverage: starting Gemini/Qwen while
micMuted=true, and keyboard composer behavior forEnter,Shift+Enter, andCtrl/Cmd+Enter
HAPI Bot
| }, [api]) // eslint-disable-line react-hooks/exhaustive-deps | ||
|
|
||
| // Sync mic mute state | ||
| useEffect(() => { |
There was a problem hiding this comment.
[MAJOR] micMuted is only applied here when the prop changes after state.recorder already exists. If the user stopped voice while muted and starts a new Gemini session, micMuted is already true, this effect runs before startAudioCapture() creates the recorder, and the initial mute never gets reapplied. The UI stays muted while the recorder still captures audio.
Suggested fix:
const micMutedRef = useRef(micMuted)
useEffect(() => {
micMutedRef.current = micMuted
state.recorder?.setMuted(micMuted)
}, [micMuted])
void state.recorder.start(...).then(() => {
state.recorder?.setMuted(micMutedRef.current)
})| } | ||
| }, [api]) // eslint-disable-line react-hooks/exhaustive-deps | ||
|
|
||
| useEffect(() => { |
There was a problem hiding this comment.
[MAJOR] Same initial-mute regression as Gemini: this effect only updates an existing recorder. When micMuted is already true before the Qwen session starts, startAudioCapture() creates a fresh recorder after this effect has already run, so audio capture begins unmuted until the user toggles again.
Suggested fix:
const micMutedRef = useRef(micMuted)
useEffect(() => {
micMutedRef.current = micMuted
state.recorder?.setMuted(micMuted)
}, [micMuted])
void state.recorder.start(...).then(() => {
state.recorder?.setMuted(micMutedRef.current)
})|
|
||
| // Only plain Enter (no modifiers) sends; other modifier combos are ignored | ||
| if (key === 'Enter') { | ||
| // Ctrl+Enter (Windows/Linux) or Cmd+Enter (Mac) sends the message |
There was a problem hiding this comment.
[MAJOR] This flips the existing composer shortcut from plain Enter to Ctrl/Cmd+Enter for every chat session. The previous behavior already supported multi-line input with Shift+Enter, so this changes a core keyboard workflow without anything in the voice feature requiring it.
Suggested fix:
if (key === 'Enter' && e.shiftKey) {
return
}
if (key === 'Enter' && !e.ctrlKey && !e.altKey && !e.metaKey) {
e.preventDefault()
if (canSend) {
api.composer().send()
setShowContinueHint(false)
}
return
}Both Gemini and Qwen backends now persist micMuted to module-level state and reapply it immediately after startAudioCapture() creates the recorder. Previously the React mute effect could run while the recorder was still null, causing audio to be captured even when the UI showed the mic as muted.
|
Re: review at 2026-04-22T03:35:05Z 1. Mic mute initial state — Fixed in aa9802d. Both Gemini and Qwen now persist 2. Composer Enter → Ctrl+Enter — This is an intentional change requested separately from the voice feature. The previous Enter-to-send conflicted with multi-line input needs. The new behavior:
This is a deliberate UX improvement, not a regression. |
There was a problem hiding this comment.
Findings
- [Major] Gemini can silently unmute the user after any model response —
turnCompletealways callsstate.recorder?.setMuted(false), so a user who intentionally muted the mic gets re-exposed as soon as the first reply finishes. Evidenceweb/src/realtime/GeminiLiveVoiceSession.tsx:217, related stateweb/src/realtime/GeminiLiveVoiceSession.tsx:388.
Suggested fix:function applyRecorderMute(): void { state.recorder?.setMuted(state.micMuted || state.modelSpeaking) } if (!state.modelSpeaking) { state.modelSpeaking = true applyRecorderMute() } if (serverContent.turnComplete) { state.modelSpeaking = false applyRecorderMute() }
- [Major] Failed starts leak the newly created audio context in both new backends — each
startSession()opensstate.playbackContextbefore token/mic validation, but the earlythrowpaths never callcleanup(). Repeated failures leave contexts alive and can block later starts once the browser hits its context cap. Evidenceweb/src/realtime/GeminiLiveVoiceSession.tsx:78,web/src/realtime/GeminiLiveVoiceSession.tsx:89, same patternweb/src/realtime/QwenVoiceSession.tsx:94,web/src/realtime/QwenVoiceSession.tsx:102.
Suggested fix:async startSession(config: VoiceSessionConfig): Promise<void> { cleanup() try { state.playbackContext = new AudioContext({ sampleRate: 24000 }) await state.playbackContext.resume() // existing setup... } catch (error) { cleanup() throw error } }
- [Major] Composer Enter-to-send is still inverted for the whole chat UI — the current handler only sends on
Ctrl/Cmd+Enter, while plainEnternow inserts a newline. That changes a core workflow unrelated to the voice feature. Evidenceweb/src/components/AssistantChat/HappyComposer.tsx:314.
Suggested fix:if (key === 'Enter' && e.shiftKey) { return } if (key === 'Enter' && !e.ctrlKey && !e.altKey && !e.metaKey && canSend) { e.preventDefault() api.composer().send() setShowContinueHint(false) return }
Summary
- Review mode: follow-up after new commits
- Three issues remain at the current head: Gemini still overrides a user mute after model speech, failed Gemini/Qwen starts leak audio contexts, and the composer shortcut regression is still present.
Testing
- Not run (automation; static review only)
- Missing coverage: manual mute persisted across Gemini responses, repeated failed starts for Gemini/Qwen, and
EntervsShift+EntervsCtrl/Cmd+Entercomposer behavior
HAPI Bot
| console.log('[GeminiLive] Turn complete') | ||
| // Model done — unmute mic for next user turn | ||
| state.modelSpeaking = false | ||
| state.recorder?.setMuted(false) |
There was a problem hiding this comment.
[MAJOR] turnComplete always unmutes the recorder, so a user who intentionally muted the mic gets re-exposed as soon as the first Gemini reply finishes.
Suggested fix:
function applyRecorderMute(): void {
state.recorder?.setMuted(state.micMuted || state.modelSpeaking)
}
if (serverContent.turnComplete) {
state.modelSpeaking = false
applyRecorderMute()
}| // Create playback AudioContext immediately while still inside the user | ||
| // gesture (click/tap). Mobile browsers require this for autoplay policy. | ||
| // Store in state so cleanup() can close it on failure or stop. | ||
| state.playbackContext = new AudioContext({ sampleRate: 24000 }) |
There was a problem hiding this comment.
[MAJOR] This AudioContext is created before token/mic validation, but the early throw paths above never call cleanup(). The same pattern exists in QwenVoiceSession. A few failed starts can leave enough contexts alive to block later voice sessions.
Suggested fix:
async startSession(config: VoiceSessionConfig): Promise<void> {
cleanup()
try {
state.playbackContext = new AudioContext({ sampleRate: 24000 })
await state.playbackContext.resume()
// existing setup...
} catch (error) {
cleanup()
throw error
}
}|
|
||
| // Only plain Enter (no modifiers) sends; other modifier combos are ignored | ||
| if (key === 'Enter') { | ||
| // Ctrl+Enter (Windows/Linux) or Cmd+Enter (Mac) sends the message |
There was a problem hiding this comment.
[MAJOR] This still flips the app-wide composer shortcut from plain Enter to Ctrl/Cmd+Enter. The voice backend work does not require changing the default send behavior.
Suggested fix:
if (key === 'Enter' && e.shiftKey) {
return
}
if (key === 'Enter' && !e.ctrlKey && !e.altKey && !e.metaKey && canSend) {
e.preventDefault()
api.composer().send()
setShowContinueHint(false)
return
}
Summary
Add a pluggable voice backend architecture that extends the existing ElevenLabs ConvAI integration with two new voice providers:
gemini-live): Google's real-time audio streaming API via WebSocket, with full function calling support formessageCodingAgentandprocessPermissionRequestqwen-realtime): Alibaba's DashScope real-time voice API via Hub WebSocket proxy, supporting voice conversation (function calling pending model support)Users can switch backends via the
VOICE_BACKENDenvironment variable. The existing ElevenLabs integration remains the default and is completely unchanged.Key Design Decisions
GET /voice/backendlets the frontend detect the active backend without Vite rebuildReact.lazy()ensures alternative backends are only loaded when active?urlimport to avoid MIME type issues in production builds/api/voice/qwen-wsbecause browser WebSocket API cannot setAuthorizationheadersskipWaiting+clientsClaimto service worker for instant deployment updatesConfiguration
Files Changed
shared/src/voice.tshub/src/web/routes/voice.tshub/src/web/server.tsweb/src/api/client.ts,voice.tsweb/src/realtime/GeminiLiveVoiceSession.tsxweb/src/realtime/QwenVoiceSession.tsxweb/src/realtime/gemini/web/src/realtime/VoiceBackendSession.tsxweb/src/components/SessionChat.tsxVoiceBackendSessioninstead ofRealtimeVoiceSessionweb/src/sw.tsskipWaiting+clientsClaimhub/src/web/routes/voice.test.ts,pcmUtils.test.ts,toolAdapter.test.tsTest Plan