Skip to content

feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime#401

Open
Overbaker wants to merge 21 commits intotiann:mainfrom
Overbaker:feat/pluggable-voice-backend
Open

feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime#401
Overbaker wants to merge 21 commits intotiann:mainfrom
Overbaker:feat/pluggable-voice-backend

Conversation

@Overbaker
Copy link
Copy Markdown

Summary

Add a pluggable voice backend architecture that extends the existing ElevenLabs ConvAI integration with two new voice providers:

  • Gemini 2.5 Live (gemini-live): Google's real-time audio streaming API via WebSocket, with full function calling support for messageCodingAgent and processPermissionRequest
  • Qwen Realtime (qwen-realtime): Alibaba's DashScope real-time voice API via Hub WebSocket proxy, supporting voice conversation (function calling pending model support)

Users can switch backends via the VOICE_BACKEND environment variable. The existing ElevenLabs integration remains the default and is completely unchanged.

Key Design Decisions

  • Runtime discovery: GET /voice/backend lets the frontend detect the active backend without Vite rebuild
  • Code splitting: React.lazy() ensures alternative backends are only loaded when active
  • Zero upstream breakage: All original ElevenLabs code paths untouched; new code is additive
  • Inline AudioWorklet: Uses Blob URL instead of Vite ?url import to avoid MIME type issues in production builds
  • Qwen WebSocket proxy: Hub proxies Qwen connections at /api/voice/qwen-ws because browser WebSocket API cannot set Authorization headers
  • Barge-in prevention: Auto-mutes microphone during model speech to prevent ambient noise from interrupting responses
  • PWA immediate activation: Added skipWaiting + clientsClaim to service worker for instant deployment updates

Configuration

# Gemini Live (recommended - free tier, full function calling)
VOICE_BACKEND=gemini-live
GEMINI_API_KEY=your-google-api-key

# Qwen Realtime (voice-only, function calling not yet supported by model)
VOICE_BACKEND=qwen-realtime
DASHSCOPE_API_KEY=your-dashscope-key

# ElevenLabs (default, unchanged)
VOICE_BACKEND=elevenlabs
ELEVENLABS_API_KEY=your-elevenlabs-key

Files Changed

Area Files Description
Shared shared/src/voice.ts Voice backend types, Gemini/Qwen model constants, tool-optimized system prompt
Hub Routes hub/src/web/routes/voice.ts Backend discovery + token endpoints for Gemini & Qwen
Hub Server hub/src/web/server.ts Qwen WebSocket proxy handler
Web API web/src/api/client.ts, voice.ts Client functions for new endpoints
Gemini Session web/src/realtime/GeminiLiveVoiceSession.tsx Full Gemini Live implementation (WebSocket + AudioWorklet)
Qwen Session web/src/realtime/QwenVoiceSession.tsx Qwen Realtime implementation (OpenAI-compatible protocol)
Audio Pipeline web/src/realtime/gemini/ PCM utils, AudioWorklet recorder, 24kHz player, tool adapter
Switcher web/src/realtime/VoiceBackendSession.tsx Dynamic backend selector with lazy loading
Integration web/src/components/SessionChat.tsx Uses VoiceBackendSession instead of RealtimeVoiceSession
PWA web/src/sw.ts skipWaiting + clientsClaim
Tests hub/src/web/routes/voice.test.ts, pcmUtils.test.ts, toolAdapter.test.ts 16 test cases

Test Plan

  • ElevenLabs backend still works (no code changes to existing paths)
  • Gemini Live: voice conversation works
  • Gemini Live: function calling (messageCodingAgent) triggers correctly
  • Gemini Live: barge-in prevention (no mid-speech interruption from noise)
  • Qwen Realtime: voice conversation works via Hub WebSocket proxy
  • Hub route tests pass (backend discovery, token endpoints)
  • PCM audio conversion round-trip tests pass
  • Tool adapter tests pass
  • TypeScript type-check passes for both hub and web
  • Test on mobile browsers (iOS Safari, Android Chrome)

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Blocker] Qwen WebSocket proxy bypasses API auth and can be opened without a JWT, which lets any reachable client consume the hub's DashScope credentials through /api/voice/qwen-ws. Evidence hub/src/web/server.ts:328.
  • [Major] The fallback voice backend is now gemini-live, so existing installs that only configured ElevenLabs will be routed away from the existing token flow and fail voice startup. Evidence shared/src/voice.ts:280, hub/src/web/routes/voice.ts:121.
  • [Major] The Qwen frontend still requires the hub to return a raw DashScope key even though the browser never uses it after switching to the hub WebSocket proxy, so every authenticated web client now receives a long-lived provider secret unnecessarily. Evidence web/src/realtime/QwenVoiceSession.tsx:84, hub/src/web/routes/voice.ts:162.

Summary
Review mode: initial
Three findings. Added coverage does not exercise the new /api/voice/qwen-ws auth boundary, and the new route tests would not catch the default-backend regression because they do not assert against DEFAULT_VOICE_BACKEND.

Testing

  • Not run (automation): bun is not installed in this runner.

HAPI Bot

Comment thread hub/src/web/server.ts
return socketHandler.fetch(req, server as never)
}
// Qwen Realtime WebSocket proxy
if (url.pathname === '/api/voice/qwen-ws') {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BLOCKER] This branch upgrades /api/voice/qwen-ws before the request ever reaches app.fetch(req), so it bypasses createAuthMiddleware() entirely. As written, any client that can reach the hub can open a proxied DashScope session with the server-side API key.

Suggested fix:

const token = url.searchParams.get('token')
if (!token) {
    return new Response('Missing authorization token', { status: 401 })
}

await jwtVerify(token, options.jwtSecret, { algorithms: ['HS256'] })
const upgraded = server.upgrade(req, {
    data: { _qwenProxy: true, apiKey, model }
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f5cbd0e.

Both /api/voice/gemini-ws and /api/voice/qwen-ws now require a ?token=<JWT> query parameter. The fetch handler verifies the token via jwtVerify(token, jwtSecret, { algorithms: ['HS256'] }) before calling server.upgrade().

Frontend voice sessions now attach the JWT from ApiClient.getAuthToken() to the WebSocket URL.

Comment thread shared/src/voice.ts Outdated
export const QWEN_REALTIME_MODEL = 'qwen3-omni-flash-realtime'
export const QWEN_REALTIME_VOICE = 'Mia'

export const DEFAULT_VOICE_BACKEND: VoiceBackendType = 'gemini-live'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] This flips the implicit backend from ElevenLabs to Gemini Live. On a hub that only has ELEVENLABS_API_KEY configured, /api/voice/backend now resolves to gemini-live, and VoiceBackendSession will send users into /api/voice/gemini-token instead of the existing ElevenLabs flow.

Suggested fix:

export const DEFAULT_VOICE_BACKEND: VoiceBackendType = 'elevenlabs'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f5cbd0e.

DEFAULT_VOICE_BACKEND reverted to 'elevenlabs'. Hubs with only ELEVENLABS_API_KEY configured will continue to work as before. Users who want Gemini Live or Qwen can set VOICE_BACKEND=gemini-live or VOICE_BACKEND=qwen-realtime explicitly.

Comment thread web/src/realtime/QwenVoiceSession.tsx Outdated

// Get API key from hub
const tokenResp = await fetchQwenToken(this.api)
if (!tokenResp.allowed || !tokenResp.apiKey) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] The browser does not use this provider key after the Qwen proxy change. startSession() only checks that apiKey exists, then opens /api/voice/qwen-ws, where the hub injects its own Authorization header. Returning a long-lived DashScope key here needlessly leaks it to every authenticated web client.

Suggested fix:

const tokenResp = await fetchQwenToken(this.api)
if (!tokenResp.allowed) {
    const msg = tokenResp.error ?? 'DashScope API key not available'
    state.statusCallback?.('error', msg)
    throw new Error(msg)
}
state.wsBaseUrl = tokenResp.wsUrl || null

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f5cbd0e.

/voice/qwen-token now returns { allowed: true, wsUrl } only — no apiKey field. The DashScope key stays server-side in the WebSocket proxy. Frontend QwenVoiceSession updated accordingly: it checks allowed, gets wsUrl, and connects to the hub proxy with a JWT token.

Same pattern already used for Gemini Live (apiKey: 'proxied').

Copy link
Copy Markdown
Owner

@tiann tiann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution. I believe this is a good feature. Please fix the comments first.

Add a strategy-based voice backend architecture that allows switching
between ElevenLabs ConvAI and Gemini Live API via VOICE_BACKEND env var.

- Shared: VoiceBackendType, Gemini Live config builder, tool definitions
- Hub: GET /voice/backend discovery, POST /voice/gemini-token with proxy support
- Web: GeminiLiveVoiceSession (WebSocket + AudioWorklet audio pipeline),
  VoiceBackendSession dynamic switcher with React.lazy() code splitting,
  Gemini tool adapter bridging existing client tools
- Tests: hub route tests, pcmUtils round-trip tests, toolAdapter tests
- Zero changes to existing ElevenLabs code paths
- System prompt instructs assistant to respond in Mandarin
- First message changed to Chinese greeting
- ElevenLabs language set to 'zh'
Vite inlined the worklet as a data URI with wrong MIME type (video/mp2t)
and uncompiled TypeScript, causing AudioWorklet.addModule() to fail.
Use Blob URL with plain JS source instead.
gemini-3.1-flash-live-preview does not accept clientContent text input,
only audio input. gemini-2.5-flash-native-audio-latest supports both.
- Shared: add 'qwen-realtime' backend type, model/voice constants
- Hub: POST /voice/qwen-token route (DASHSCOPE_API_KEY / QWEN_API_KEY)
- Web: QwenVoiceSession using DashScope Realtime WebSocket API
  (OpenAI-compatible protocol: session.update, input_audio_buffer,
  response.audio.delta, function calling via conversation.item.create)
- VoiceBackendSession: lazy-load Qwen component
- Tests: qwen-token route tests (3 cases)

Switch via VOICE_BACKEND=qwen-realtime + DASHSCOPE_API_KEY=xxx
Without this, new deployments required users to close all tabs before
the updated Service Worker would activate and serve new assets.
- Hub: add WebSocket proxy at /api/voice/qwen-ws that injects
  Authorization header (browser WebSocket can't set custom headers)
- Switch from qwen3.5-omni-plus-realtime (access denied / invite-only)
  to qwen3-omni-flash-realtime (publicly available)
- Frontend connects via Hub proxy instead of direct to DashScope
- Change DEFAULT_VOICE_BACKEND from elevenlabs to qwen-realtime
- Change QWEN_REALTIME_VOICE from Cherry to Mia
- Increase maxHttpBufferSize to 55MB to match upload limit
- Mute mic while model is speaking to prevent noise-triggered barge-in
- Skip audio sending during model turn to avoid false interrupts
- Remove greeting prompt that blocked subsequent tool calls
- Add critical tool usage instruction at top of system prompt
- Always log tool call and messageCodingAgent events (not just DEV mode)
clientContent greeting creates a conversation turn that pushes the model
into "chat mode", breaking subsequent tool calls. Instead, instruct the
model to greet naturally when the user speaks first.
…g API keys

- Add JWT token verification to /api/voice/gemini-ws and /api/voice/qwen-ws
  WebSocket endpoints before upgrading, preventing unauthenticated access to
  server-side API credentials (BLOCKER)
- Revert DEFAULT_VOICE_BACKEND to 'elevenlabs' so existing installs that only
  configured ElevenLabs are not broken (MAJOR)
- Remove raw DashScope API key from /voice/qwen-token response; the hub proxy
  handles the key server-side, so the browser never needs it (MAJOR)
- Update frontend to pass JWT via ?token= query param on WS connections
- Change composer send shortcut from Enter to Ctrl/Cmd+Enter
@Overbaker Overbaker force-pushed the feat/pluggable-voice-backend branch from 5578f22 to f5cbd0e Compare April 21, 2026 08:55
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Voice can drop the first tap while backend discovery is still loading — VoiceBackendSession now renders null until GET /voice/backend completes, but SessionChat still enables the voice button whenever the provider exists. If the user taps immediately, startRealtimeSession() sees no registered session and returns without changing status, so the action is silently lost. Evidence web/src/realtime/VoiceBackendSession.tsx:35 and web/src/components/SessionChat.tsx:426.
    Suggested fix:
    const [voiceBackendReady, setVoiceBackendReady] = useState(false)
    
    <HappyComposer
        onVoiceToggle={voice && voiceBackendReady ? handleVoiceToggle : undefined}
        onVoiceMicToggle={voice && voiceBackendReady ? handleVoiceMicToggle : undefined}
    />
    
    <VoiceBackendSession
        api={props.api}
        micMuted={voice.micMuted}
        onStatusChange={voice.setStatus}
        onReadyChange={setVoiceBackendReady}
    />

Summary
Review mode: follow-up after new commits
One finding. The previously reported Qwen unauthenticated WebSocket proxy and raw DashScope key exposure are fixed on this head. Residual testing gap: no coverage for the async backend discovery/voice-button readiness path.

Testing

  • Not run (automation)

HAPI Bot

return () => { cancelled = true }
}, [props.api])

if (!backend) return null
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] VoiceBackendSession now renders nothing until backend discovery finishes, but SessionChat still wires onVoiceToggle as soon as the voice provider exists. During that initial window no registerVoiceSession() has run, so a quick tap calls startRealtimeSession(), hits the “No voice session registered” branch, and silently drops the user's first start attempt.

Suggested fix:

const [voiceBackendReady, setVoiceBackendReady] = useState(false)

<HappyComposer
    onVoiceToggle={voice && voiceBackendReady ? handleVoiceToggle : undefined}
    onVoiceMicToggle={voice && voiceBackendReady ? handleVoiceMicToggle : undefined}
/>

<VoiceBackendSession
    api={props.api}
    micMuted={voice.micMuted}
    onStatusChange={voice.setStatus}
    onReadyChange={setVoiceBackendReady}
/>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e32c1f6.

Added onReadyChange callback to VoiceBackendSession — it fires true after fetchVoiceBackend() resolves. SessionChat now gates onVoiceToggle / onVoiceMicToggle behind voiceBackendReady state, so the button stays disabled until registerVoiceSession() has completed.

VoiceBackendSession now fires onReadyChange(true) after backend
discovery completes. SessionChat disables the voice toggle until
ready, preventing silent drops when the user taps before
registerVoiceSession() has run.
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Blocker] Qwen default proxy URL is concatenated twice, so the backend cannot connect with the route contract returned by the hub. POST /api/voice/qwen-token now returns a full proxy URL ending in /api/voice/qwen-ws, but the client still appends /api/voice/qwen-ws again before opening the socket. Evidence web/src/realtime/QwenVoiceSession.tsx:106, web/src/realtime/QwenVoiceSession.tsx:109, hub/src/web/routes/voice.ts:167.
    Suggested fix:
    const proxyUrl = state.wsBaseUrl ?? `${protocol}//${window.location.host}/api/voice/qwen-ws`
    const wsUrl = `${proxyUrl}${proxyUrl.includes('?') ? '&' : '?'}model=${encodeURIComponent(model)}&token=${encodeURIComponent(authToken)}`
  • [Major] The first voice tap can still be dropped for Gemini/Qwen because readiness flips before the lazy backend module has mounted and called registerVoiceSession(). VoiceBackendSession now enables the button immediately after backend discovery, but both non-ElevenLabs backends still sit behind React.lazy() with a null fallback. On a cold chunk load, startRealtimeSession() still hits the unregistered-session branch. Evidence web/src/realtime/VoiceBackendSession.tsx:9, web/src/realtime/VoiceBackendSession.tsx:30, web/src/realtime/VoiceBackendSession.tsx:46.
    Suggested fix:
    // only mark the backend ready after the selected session component has mounted
    // and finished registerVoiceSession(...)
    <GeminiLiveVoiceSession {...props} onRegistered={() => props.onReadyChange?.(true)} />
  • [Major] The new hub route tests still assert the pre-proxy token contract instead of the responses this PR now returns. Gemini no longer exposes the raw key, and Qwen now returns only wsUrl, so these assertions are out of sync with the implementation and will fail or validate the wrong API surface. Evidence hub/src/web/routes/voice.test.ts:88, hub/src/web/routes/voice.test.ts:99, hub/src/web/routes/voice.test.ts:133, hub/src/web/routes/voice.test.ts:144.
    Suggested fix:
    const body = await res.json() as { allowed: boolean; wsUrl: string }
    expect(body.allowed).toBe(true)
    expect(body.wsUrl).toContain('/api/voice/qwen-ws')

Summary
Review mode: follow-up after new commits
Three findings. The prior backend-discovery gating issue is only partially fixed: lazy backends can still lose the first start click. Qwen also appears broken by default because the proxy path is appended twice. Test coverage is present, but the new route tests are still checking the old token shape.

Testing

  • Not run (automation)

HAPI Bot

Comment thread web/src/realtime/QwenVoiceSession.tsx Outdated
const proxyBase = state.wsBaseUrl || `${protocol}//${window.location.host}`
const model = QWEN_REALTIME_MODEL
const authToken = this.api.getAuthToken() || ''
const wsUrl = `${proxyBase}/api/voice/qwen-ws?model=${encodeURIComponent(model)}&token=${encodeURIComponent(authToken)}`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BLOCKER] fetchQwenToken() now returns a full proxy URL ending in /api/voice/qwen-ws, but this code appends /api/voice/qwen-ws again before opening the socket. With the default hub response the browser ends up dialing .../api/voice/qwen-ws/api/voice/qwen-ws?..., so Qwen never reaches session.created.

Suggested fix:

const proxyUrl = state.wsBaseUrl ?? `${protocol}//${window.location.host}/api/voice/qwen-ws`
const wsUrl = `${proxyUrl}${proxyUrl.includes('?') ? '&' : '?'}model=${encodeURIComponent(model)}&token=${encodeURIComponent(authToken)}`

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fbf315b.

wsBaseUrl from the hub now contains the full proxy path. The client uses it directly as proxyUrl and only appends query params (model= and token=), no more path duplication.

fetchVoiceBackend(props.api).then((resp) => {
if (!cancelled) {
setBackend(resp.backend)
props.onReadyChange?.(true)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] This still marks the voice backend ready before the lazy-loaded Gemini/Qwen component has mounted and called registerVoiceSession(). On a cold chunk load the button becomes clickable while startRealtimeSession() would still hit the "No voice session registered" branch, so the first tap can still be lost.

Suggested fix:

// only call onReadyChange(true) after the selected backend component mounts
// and completes registerVoiceSession(...)
<GeminiLiveVoiceSession {...props} onRegistered={() => props.onReadyChange?.(true)} />

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fbf315b.

onReadyChange(true) is no longer called from VoiceBackendSession after discovery. Instead, each backend component (Gemini/Qwen/ElevenLabs) now accepts an onRegistered callback and fires it after registerVoiceSession() succeeds inside their mount useEffect. VoiceBackendSession passes handleRegistered → props.onReadyChange(true) down to the selected child, so the button only enables after the lazy chunk has loaded, mounted, and completed registration.

Comment thread hub/src/web/routes/voice.test.ts Outdated
const app = createApp()
const res = await app.request('/api/voice/gemini-token', { method: 'POST' })
expect(res.status).toBe(200)
const body = await res.json() as { allowed: boolean; apiKey: string }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] These assertions are still checking the old token contract. POST /api/voice/gemini-token now returns a proxied websocket contract, and POST /api/voice/qwen-token returns only wsUrl, so the new tests are out of sync with the implementation and will fail or validate the wrong API surface.

Suggested fix:

const body = await res.json() as { allowed: boolean; wsUrl: string }
expect(body.allowed).toBe(true)
expect(body.wsUrl).toContain('/api/voice/gemini-ws')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fbf315b.

Test assertions updated to match the new proxy contract:

  • gemini-token: asserts apiKey === 'proxied' and wsUrl contains /api/voice/gemini-ws
  • qwen-token: asserts allowed === true, wsUrl contains /api/voice/qwen-ws, and apiKey property does not exist

All 10 tests pass.

…tered, update tests

- Fix BLOCKER: Qwen proxy URL was concatenated twice — wsBaseUrl from
  hub already contains /api/voice/qwen-ws, so the client no longer
  appends it again
- Fix MAJOR: onReadyChange now fires from onRegistered callback inside
  each backend component (after registerVoiceSession completes), not
  from VoiceBackendSession after discovery. This ensures lazy-loaded
  Gemini/Qwen chunks have fully mounted before the voice button enables
- Fix MAJOR: update voice.test.ts to assert new proxy contract —
  gemini-token returns { apiKey: 'proxied', wsUrl }, qwen-token returns
  { allowed, wsUrl } with no apiKey field
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Blocker] The new proxy token routes build websocket URLs from a hard-coded localhost:24888 fallback, so Gemini/Qwen voice breaks when HAPI_PUBLIC_URL is unset.
  • [Major] The primary AudioWorklet recording path never connects the worklet node to a live graph, so mic chunks are likely never emitted on browsers that successfully load the worklet.
  • [Major] Gemini tool calls are executed in parallel against shared mutable session state, which can duplicate or reorder side effects.

Summary
Review mode: follow-up after new commits
Three findings on the current head. The earlier readiness and Qwen proxy regressions are fixed, but the remaining issues still block or destabilize the new voice backends in common self-hosted flows.

Testing

  • Not run (automation)

HAPI Bot

Comment thread hub/src/web/routes/voice.ts Outdated

// Use server-side WS proxy to avoid region restrictions.
// The proxy at /api/voice/gemini-ws handles the API key server-side.
const publicUrl = process.env.HAPI_PUBLIC_URL || `http://localhost:${process.env.HAPI_LISTEN_PORT || '24888'}`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BLOCKER] HAPI_PUBLIC_URL is optional here, but this fallback always emits an absolute ws://localhost:24888/... proxy URL. On the default/local config that makes remote browsers open their own localhost instead of the hub, so both new voice backends fail before the websocket upgrade. The same issue is duplicated in the Qwen branch below.

Suggested fix:

const requestOrigin = new URL(c.req.url).origin
const publicUrl = process.env.HAPI_PUBLIC_URL || requestOrigin
const wsProxyUrl = publicUrl.replace(/^http/, "ws") + "/api/voice/gemini-ws"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c68366a.

Both /voice/gemini-token and /voice/qwen-token now derive wsUrl from new URL(c.req.url).origin instead of hard-coded localhost:24888. HAPI_PUBLIC_URL still overrides when set. Remote browsers will correctly connect back to the hub.

const base64 = arrayBufferToBase64(pcm16);
onChunk(base64);
};
this.sourceNode.connect(this.workletNode);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] On the main AudioWorklet path the recorder never connects the worklet node to any downstream sink. The fallback ScriptProcessorNode path does connect to destination, but this branch stops at source -> worklet, so the graph has nothing pulling frames and port.onmessage will not fire on browsers where the worklet loads successfully.

Suggested fix:

const sink = this.audioContext.createGain()
sink.gain.value = 0

this.sourceNode.connect(this.workletNode)
this.workletNode.connect(sink)
sink.connect(this.audioContext.destination)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c68366a.

AudioWorklet node is now connected through a zero-gain GainNode to audioContext.destination: source → worklet → sink(gain=0) → destination. This gives the audio graph a downstream pull so port.onmessage fires correctly on all browsers.

Comment thread web/src/realtime/gemini/toolAdapter.ts Outdated
export async function handleGeminiFunctionCalls(
calls: GeminiFunctionCall[]
): Promise<GeminiFunctionResponse[]> {
return Promise.all(calls.map(handleGeminiFunctionCall))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] Promise.all(...) runs Gemini tool calls concurrently against shared session state. That is unsafe here because processPermissionRequest() resolves the current first pending request each time; two calls in one turn can race, act on the same request twice, or reorder side effects.

Suggested fix:

const responses: GeminiFunctionResponse[] = []
for (const call of calls) {
    responses.push(await handleGeminiFunctionCall(call))
}
return responses

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c68366a.

handleGeminiFunctionCalls now uses a sequential for...of loop instead of Promise.all. This ensures tool calls execute one at a time against shared session state, preventing duplicate processPermissionRequest resolutions.

…, serialize tool calls

- Fix BLOCKER: derive wsUrl from request origin instead of hard-coded
  localhost:24888 fallback, so remote browsers connect back to the hub.
  HAPI_PUBLIC_URL still overrides when set.
- Fix MAJOR: connect AudioWorklet node to a silent GainNode sink so the
  audio graph pulls frames and port.onmessage fires correctly.
- Fix MAJOR: replace Promise.all with sequential for-loop in
  handleGeminiFunctionCalls to prevent racing on shared session state
  (e.g. duplicate processPermissionRequest resolutions).
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Playback AudioContext creation moved into the async websocket setup path, so Gemini/Qwen can connect successfully but stay silent on mobile browsers that require audio contexts to be created/resumed in the original user gesture. Evidence web/src/realtime/GeminiLiveVoiceSession.tsx:166, web/src/realtime/QwenVoiceSession.tsx:176, web/src/realtime/gemini/audioPlayer.ts:12.

Summary

  • Review mode: follow-up after new commits
  • One major regression remains on the current head: the new Gemini/Qwen backends still create playback audio contexts too late for mobile autoplay policies, so voice output can fail even when the websocket handshake succeeds.

Testing

  • Not run (automation)

HAPI Bot

state.statusCallback?.('connected')

// Start audio capture
startAudioCapture()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] startAudioCapture() runs only after the websocket handshake finishes, and GeminiAudioPlayer creates its own AudioContext there. On iOS/mobile Safari that is already outside the original click gesture, so the session connects but playback can remain suspended/silent. QwenVoiceSession hits the same helper, so both new backends are affected.

Suggested fix:

// still inside startSession(), before awaiting websocket setup
const playbackContext = new AudioContext({ sampleRate: 24000 })
await playbackContext.resume()

// later, after setup completes
state.player = new GeminiAudioPlayer(playbackContext)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 759bf35.

Both Gemini Live and Qwen Realtime now create and resume() the playback AudioContext({ sampleRate: 24000 }) at the very start of startSession(), while still inside the user's click/tap gesture. The pre-created context is then passed to GeminiAudioPlayer(playbackContext) via startAudioCapture().

This satisfies mobile autoplay policies (iOS Safari, Chrome Android) that require AudioContext creation within a user activation event.

…utoplay

Create and resume the playback AudioContext at the start of
startSession(), while still inside the user's click/tap gesture.
Pass the pre-created context to GeminiAudioPlayer so mobile browsers
(iOS Safari) don't block audio output due to autoplay policy.

Applies to both Gemini Live and Qwen Realtime backends.
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] The follow-up autoplay fix now leaks the playback AudioContext for both Gemini and Qwen. startSession() creates a new context before the token/mic checks, but GeminiAudioPlayer.dispose() only closes contexts it created itself, so failed starts and normal stop/start cycles both leave orphaned contexts behind. On mobile browsers with low AudioContext limits, voice playback eventually stops working after a few retries. Evidence web/src/realtime/GeminiLiveVoiceSession.tsx:69, web/src/realtime/QwenVoiceSession.tsx:84, web/src/realtime/gemini/audioPlayer.ts:69.
    Suggested fix:
    try {
        state.playbackContext = new AudioContext({ sampleRate: 24000 })
        await state.playbackContext.resume()
        // ...rest of setup...
        startAudioCapture(state.playbackContext)
    } catch (error) {
        cleanup()
        throw error
    }
    
    function cleanup() {
        // existing cleanup...
        if (state.playbackContext && state.playbackContext.state !== 'closed') {
            void state.playbackContext.close()
            state.playbackContext = null
        }
    }

Summary

  • Review mode: follow-up after new commits
  • One major regression remains on the current head: the mobile autoplay fix now leaks playback audio contexts in both new backends, so repeated retries eventually break voice output.

Testing

  • Not run (automation environment does not have bun)

HAPI Bot


// Create playback AudioContext immediately while still inside the user
// gesture (click/tap). Mobile browsers require this for autoplay policy.
const playbackContext = new AudioContext({ sampleRate: 24000 })
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] This new user-gesture AudioContext never gets closed. startAudioCapture() hands it to GeminiAudioPlayer, but GeminiAudioPlayer.dispose() only closes owned contexts (web/src/realtime/gemini/audioPlayer.ts:69), so both failed starts and normal stop/start cycles leak one playback context each. Qwen now has the same regression at web/src/realtime/QwenVoiceSession.tsx:84.

Suggested fix:

try {
    state.playbackContext = new AudioContext({ sampleRate: 24000 })
    await state.playbackContext.resume()
    // ...rest of setup...
} catch (error) {
    cleanup()
    throw error
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 74aa4c2.

Both Gemini and Qwen now store the playback AudioContext in module state (state.playbackContext). cleanup() closes it (void state.playbackContext.close()) alongside the recorder, player, and WebSocket. This covers:

  • Failed starts (token check, mic denial) — cleanup() at top of startSession()
  • Normal stop — endSession()cleanup()
  • Retries — next startSession() calls cleanup() first

No more orphaned contexts on repeated start/stop cycles.

Move playback AudioContext into module state so cleanup() can close it
on failed starts and normal stop/start cycles. Prevents orphaned
AudioContext leaks that would exhaust mobile browser limits after
repeated retries.
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review mode: follow-up after new commits

Findings

  • [Major] Forced service-worker activation breaks already-open tabs after deploy. skipWaiting() + clientsClaim() immediately move existing sessions onto the new worker, but this PR also introduces lazy-loaded voice chunks. A tab that is still running the old app shell can then request an old chunk hash and fail to load voice after rollout. Evidence web/src/sw.ts:24, web/src/realtime/VoiceBackendSession.tsx:8.
    Suggested fix:
    // Keep the new worker in waiting state until the page reloads,
    // or trigger skipWaiting only after an explicit client prompt.
    precacheAndRoute(self.__WB_MANIFEST)
  • [Minor] Voice requests are now logged in production. The debug guard was removed around messageCodingAgent, so every dictated prompt and session id is written to the browser console. Evidence web/src/realtime/realtimeClientTools.ts:48.
    Suggested fix:
    if (VOICE_CONFIG.ENABLE_DEBUG_LOGGING) {
        console.log('[Voice] messageCodingAgent called with:', message)
        console.log('[Voice] Sending message to session:', sessionId)
    }

Summary

  • Review mode: follow-up after new commits
  • Two issues remain on the current head: one deploy-time regression in the service-worker update path, and one production logging regression for voice prompts.

Testing

  • Not run (bun is not installed in this automation environment). Existing tests also do not cover the service-worker upgrade path.

HAPI Bot

Comment thread web/src/sw.ts Outdated
}

// Activate new SW immediately without waiting for all tabs to close
self.addEventListener('install', () => { self.skipWaiting() })
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] skipWaiting()/clientsClaim() force the new worker onto already-open tabs. Because this PR also lazy-loads the non-ElevenLabs voice backends, an existing page can still request old chunk hashes after deployment and hit a chunk-load failure mid-session.

Suggested fix:

// Leave the new worker waiting, or only call skipWaiting after the client opts in.
precacheAndRoute(self.__WB_MANIFEST)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 296dc85.

Removed skipWaiting() and clientsClaim() from the service worker. New SW now stays in waiting state until all tabs close, preventing lazy-loaded chunk hash mismatches after deploy.

Comment thread web/src/realtime/realtimeClientTools.ts Outdated
console.log('[Voice] messageCodingAgent called with:', message)
console.log('[Voice] Sending message to session:', sessionId)
}
console.log('[Voice] messageCodingAgent called with:', message)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MINOR] This now logs every dictated agent request and the session id in production. Voice prompts often contain repo details or other sensitive instructions, so the old debug guard should stay in place here.

Suggested fix:

if (VOICE_CONFIG.ENABLE_DEBUG_LOGGING) {
    console.log('[Voice] messageCodingAgent called with:', message)
    console.log('[Voice] Sending message to session:', sessionId)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 296dc85.

Wrapped the messageCodingAgent console.log calls with VOICE_CONFIG.ENABLE_DEBUG_LOGGING guard, matching the existing pattern in processPermissionRequest.

- Remove skipWaiting + clientsClaim from service worker to prevent
  breaking lazy-loaded voice chunks in already-open tabs after deploy.
  New SW now waits for all tabs to close before activating.
- Wrap messageCodingAgent console.log calls with VOICE_CONFIG debug
  guard to stop logging user prompts and session IDs in production.
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Gemini voice startup can hang indefinitely after an early socket close. startSession() only rejects on ws.onerror before setupComplete; if the proxy or upstream closes cleanly during the handshake, ws.onclose only marks the session disconnected and never rejects the pending promise, so voice.startVoice() stays stuck in connecting. Evidence web/src/realtime/GeminiLiveVoiceSession.tsx:254
    Suggested fix:
    ws.onclose = (event) => {
        const message = event.reason || 'WebSocket closed before setup completed'
        cleanup()
        resetRealtimeSessionState()
        if (!setupDone) {
            state.statusCallback?.('error', message)
            reject(new Error(message))
            return
        }
        state.statusCallback?.('disconnected')
    }
  • [Major] Qwen voice has the same unresolved startup path, and the new error event handler also drops server-side setup failures on the floor. If DashScope returns an error event or closes before session.updated, the promise never settles and the UI remains stuck in connecting. Evidence web/src/realtime/QwenVoiceSession.tsx:260, web/src/realtime/QwenVoiceSession.tsx:275
    Suggested fix:
    if (eventType === 'error') {
        const message = err?.message || 'Realtime session setup failed'
        state.statusCallback?.('error', message)
        if (!sessionCreated) {
            reject(new Error(message))
            ws.close()
        }
        return
    }
    Add the same !sessionCreated rejection guard in ws.onclose.
  • [Minor] Gemini debug logging is forced on in production. const DEBUG = true makes the new backend log connection state, tool calls, and model text on every user session, which regresses from the existing dev/config-gated logging pattern. Evidence web/src/realtime/GeminiLiveVoiceSession.tsx:14
    Suggested fix:
    const DEBUG = import.meta.env.DEV

Summary

  • Review mode: follow-up after new commits
  • Three issues remain on the current head: both new WebSocket voice backends can leave the UI stuck in connecting when setup fails early, and Gemini debug logging is still forced on in production.

Testing

  • Not run (bun is not installed in this automation environment).
  • New tests cover route/PCM/tool-adapter helpers, but they do not exercise WebSocket handshake failure paths for Gemini/Qwen.

HAPI Bot

}
}

ws.onclose = (event) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] If the socket closes before setupComplete, startSession() never settles. ws.onerror rejects the promise, but this onclose path only flips the status to disconnected, so a clean handshake failure leaves the voice UI stuck in connecting.

Suggested fix:

ws.onclose = (event) => {
    const message = event.reason || 'WebSocket closed before setup completed'
    cleanup()
    resetRealtimeSessionState()
    if (!setupDone) {
        state.statusCallback?.('error', message)
        reject(new Error(message))
        return
    }
    state.statusCallback?.('disconnected')
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f108c0f.

Gemini ws.onclose now rejects the startup promise when !setupDone, with error status callback. Also changed DEBUG from true to import.meta.env.DEV.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f108c0f (previous commit).

The onclose handler already rejects the promise when !setupDone, calls state.statusCallback?.('error', message), and returns early. This matches the suggested fix exactly.

}

// Error
if (eventType === 'error') {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] This drops server-side setup failures on the floor. If DashScope sends an error event before session.updated, the pending startSession() promise is neither resolved nor rejected, so the UI keeps spinning in connecting.

Suggested fix:

if (eventType === 'error') {
    const message = err?.message || 'Realtime session setup failed'
    state.statusCallback?.('error', message)
    if (!sessionCreated) {
        reject(new Error(message))
        ws.close()
    }
    return
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f108c0f.

Qwen error event handler now calls reject() + ws.close() when !sessionCreated. onclose also rejects the promise on early close, matching the Gemini pattern.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7111b67.

Renamed the guard flag from sessionCreatedsessionReady and now only set it to true after session.updated arrives (the point where resolve() is called). If DashScope sends session.created but fails before session.updated, the error/close handlers now correctly reject the promise instead of leaving it hanging.

Additionally, the Gemini proxy in hub now queues client messages (especially the setup frame) in a pending buffer while the upstream WebSocket is still CONNECTING, then flushes on open — preventing the silent frame drop that could also cause a startup hang.

import type { Session } from '@/types/api'
import type { GeminiFunctionCall } from './gemini/toolAdapter'

const DEBUG = true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MINOR] DEBUG is hardcoded to true, so this backend now logs connection state, tool calls, and model text in production. The existing voice code gates these logs behind dev/config flags; this new backend should do the same.

Suggested fix:

const DEBUG = import.meta.env.DEV

- Gemini: reject the startup promise in onclose when setup hasn't
  completed, preventing the UI from hanging in 'connecting' state
- Qwen: reject on both server 'error' events and early onclose during
  handshake, with proper cleanup and status callback
- Change Gemini DEBUG from hardcoded true to import.meta.env.DEV to
  stop logging connection state and tool calls in production
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Gemini proxy can drop the setup frame before the upstream socket opens — the browser sends setup as soon as the hub WebSocket opens, but the proxy forwards client messages only when the Google upstream is already OPEN; messages received during upstream CONNECTING are discarded. That leaves Gemini Live waiting forever for setupComplete. Evidence hub/src/web/server.ts:61 and client send at web/src/realtime/GeminiLiveVoiceSession.tsx:148.
    Suggested fix:
    const pending: Array<string | ArrayBuffer | Uint8Array> = []
    upstream.onopen = () => {
        for (const queued of pending.splice(0)) upstream.send(queued)
    }
    // in message(): queue until upstream opens instead of dropping setup
    if (upstream?.readyState === WebSocket.OPEN) upstream.send(message)
    else if (upstream?.readyState === WebSocket.CONNECTING) pending.push(message)
  • [Major] Qwen startup still hangs when setup fails after session.createdstartSession() only rejects while !sessionCreated, but the promise does not resolve until session.updated. If DashScope accepts the socket, sends session.created, then rejects the session.update or closes before session.updated, this code reports an error but never settles the pending start promise. Evidence web/src/realtime/QwenVoiceSession.tsx:265 and web/src/realtime/QwenVoiceSession.tsx:285.
    Suggested fix:
    let sessionReady = false
    if (eventType === 'session.updated') {
        sessionReady = true
        state.statusCallback?.('connected')
        startAudioCapture(state.playbackContext!)
        resolve()
        return
    }
    if (eventType === 'error' && !sessionReady) {
        reject(new Error(message))
        ws.close()
        return
    }
    // onclose: reject when !sessionReady, not only when !sessionCreated

Summary

  • Review mode: follow-up after new commits
  • Two startup reliability issues remain in the current head: Gemini proxy can discard the initial setup message, and Qwen can still leave the UI stuck in connecting on setup failures after session.created.

Testing

  • Not run (automation: bun is not installed in this runner).
  • Missing coverage: WebSocket proxy backpressure/queueing for Gemini setup and Qwen failure paths between session.created and session.updated.

HAPI Bot

Comment thread hub/src/web/server.ts
},
message(clientWs: ServerWebSocket<unknown>, message: string | ArrayBuffer | Uint8Array) {
const upstream = upstreamMap.get(clientWs)
if (upstream?.readyState === WebSocket.OPEN) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] Client messages are dropped while the upstream Gemini socket is still connecting. The browser sends the required setup frame immediately after the hub WebSocket opens, but this proxy only forwards when upstream.readyState === WebSocket.OPEN; if that frame arrives first, Google never receives setup and the client waits forever for setupComplete.

Suggested fix:

const pending: Array<string | ArrayBuffer | Uint8Array> = []
upstream.onopen = () => {
    for (const queued of pending.splice(0)) upstream.send(queued)
}

if (upstream?.readyState === WebSocket.OPEN) {
    upstream.send(message)
} else if (upstream?.readyState === WebSocket.CONNECTING) {
    pending.push(message)
}

Comment thread web/src/realtime/QwenVoiceSession.tsx Outdated
const message = err?.message || 'Realtime session setup failed'
console.error('[Qwen] Server error:', message)
state.statusCallback?.('error', message)
if (!sessionCreated) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] This still leaves startup unresolved when setup fails after session.created. The promise resolves only on session.updated, but errors after sessionCreated = true skip reject(), so an invalid session.update response leaves the voice UI stuck in connecting.

Suggested fix:

let sessionReady = false

if (eventType === session.updated) {
    sessionReady = true
    state.statusCallback?.(connected)
    startAudioCapture(state.playbackContext!)
    resolve()
    return
}

if (eventType === error && !sessionReady) {
    reject(new Error(message))
    ws.close()
    return
}

…tup hang

Gemini proxy: buffer client messages (especially the setup frame) in a
pending queue while the upstream WebSocket is still CONNECTING, then
flush on open. Previously the setup frame was silently dropped.

Qwen: rename sessionCreated → sessionReady and only set it after
session.updated arrives. If the server fails between session.created
and session.updated the promise now correctly rejects instead of
hanging forever.
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Proxy/upstream URL reuse leaks the hub JWT and bypasses the server-side proxy whenever GEMINI_LIVE_WS_URL or QWEN_REALTIME_WS_URL is set. /voice/*-token returns those env values to the browser, and both clients append token=<hub-jwt> before connecting, while the hub also reuses the same env vars for its upstream socket target. Evidence hub/src/web/routes/voice.ts:154, hub/src/web/routes/voice.ts:176, hub/src/web/server.ts:37, hub/src/web/server.ts:98, web/src/realtime/GeminiLiveVoiceSession.tsx:111, web/src/realtime/QwenVoiceSession.tsx:123.
    Suggested fix:
    const geminiUpstreamWsUrl = process.env.GEMINI_LIVE_UPSTREAM_WS_URL || GEMINI_WS_BASE
    const geminiClientWsUrl = publicUrl.replace(/^http/, 'ws') + '/api/voice/gemini-ws'
    
    return c.json({
        allowed: true,
        apiKey: 'proxied',
        wsUrl: geminiClientWsUrl
    })
    Apply the same split for Qwen, and keep the browser-facing wsUrl pinned to /api/voice/*-ws.
  • [Major] The shared Chinese prompt/first-message change is not additive: it changes the existing ElevenLabs assistant globally and bypasses the user language preference that the web app still passes only as overrides.agent.language. Even when the user selects English, the base prompt now instructs “Always respond in Chinese” and the first message is hard-coded Chinese. Evidence shared/src/voice.ts:145, shared/src/voice.ts:157, shared/src/voice.ts:242, web/src/lib/voice-context.tsx:43, web/src/realtime/RealtimeVoiceSession.tsx:74.
    Suggested fix:
    const ELEVENLABS_PROMPT = /* existing neutral/English prompt */
    const ELEVENLABS_FIRST_MESSAGE = 'Hey! Hapi here.'
    
    export function buildVoiceAgentConfig(): VoiceAgentConfig {
        return {
            conversation_config: {
                agent: {
                    first_message: ELEVENLABS_FIRST_MESSAGE,
                    language: 'en',
                    prompt: { prompt: ELEVENLABS_PROMPT, /* ... */ }
                }
            }
        }
    }
    If Gemini/Qwen should default to Chinese, give those backends their own prompt constants instead of reusing the ElevenLabs config.

Summary

  • Review mode: follow-up after new commits
  • Two issues remain in the current head: the new *_WS_URL env vars act as both upstream and browser-facing URLs, which can leak the hub JWT outside the proxy path, and the shared prompt change silently regresses the existing ElevenLabs language behavior.

Testing

  • Not run (review constraints: static analysis only; no PR code execution).
  • Missing coverage: custom GEMINI_LIVE_WS_URL/QWEN_REALTIME_WS_URL behavior, and preserving ElevenLabs language selection after the shared prompt refactor.

HAPI Bot

Comment thread hub/src/web/routes/voice.ts Outdated
return c.json({
allowed: true,
apiKey: 'proxied', // Dummy — key is handled server-side
wsUrl: process.env.GEMINI_LIVE_WS_URL || wsProxyUrl,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] GEMINI_LIVE_WS_URL is being reused as both the upstream target and the browser-facing wsUrl. If an operator sets it to a direct/custom upstream URL, /voice/gemini-token returns that URL to the browser and the client appends token=<hub-jwt> before connecting (web/src/realtime/GeminiLiveVoiceSession.tsx:111). That bypasses /api/voice/gemini-ws and leaks the hub JWT to the upstream service. Qwen has the same pattern at /api/voice/qwen-token.

Suggested fix:

const geminiUpstreamWsUrl = process.env.GEMINI_LIVE_UPSTREAM_WS_URL || GEMINI_WS_BASE
const geminiClientWsUrl = publicUrl.replace(/^http/, 'ws') + '/api/voice/gemini-ws'

return c.json({
    allowed: true,
    apiKey: 'proxied',
    wsUrl: geminiClientWsUrl
})

Comment thread shared/src/voice.ts

# Language

IMPORTANT: Always respond in Chinese (Mandarin). Use natural spoken Chinese.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] This is now changing the existing ElevenLabs assistant globally, not just the new backends. The web app still only overrides agent.language (web/src/realtime/RealtimeVoiceSession.tsx:74), so users who selected English now get a base prompt and first message that explicitly force Chinese. That breaks the existing language preference path and contradicts the additive/no-breakage goal.

Suggested fix:

const ELEVENLABS_PROMPT = /* existing neutral/English prompt */
const ELEVENLABS_FIRST_MESSAGE = 'Hey! Hapi here.'

export function buildVoiceAgentConfig(): VoiceAgentConfig {
    return {
        conversation_config: {
            agent: {
                first_message: ELEVENLABS_FIRST_MESSAGE,
                language: 'en',
                prompt: { prompt: ELEVENLABS_PROMPT, /* ... */ }
            }
        }
    }
}

…from Gemini/Qwen

Security: hub token endpoints now always return the proxy URL
(/api/voice/*-ws) regardless of GEMINI_LIVE_WS_URL or
QWEN_REALTIME_WS_URL env vars. Those env vars are upstream-only and
never exposed to the browser, preventing JWT leakage to external
endpoints.

Language: extracted the Chinese language block from VOICE_SYSTEM_PROMPT
into VOICE_CHINESE_LANGUAGE_BLOCK, appended only by Gemini and Qwen
backends. ElevenLabs keeps a neutral English base prompt and controls
language via its own language field, preserving user language preference.
@TennyDDDD
Copy link
Copy Markdown

Both issues fixed in 5c60488.

1. WS URL leak — Token endpoints now always return the hub proxy URL (/api/voice/*-ws). GEMINI_LIVE_WS_URL and QWEN_REALTIME_WS_URL env vars are only used server-side as upstream targets and are never sent to the browser. The hub JWT stays within the proxy path.

2. ElevenLabs language regression — Extracted the Chinese language block from VOICE_SYSTEM_PROMPT into a separate VOICE_CHINESE_LANGUAGE_BLOCK constant, which is appended only by Gemini (buildGeminiLiveConfig) and Qwen (QwenVoiceSession). ElevenLabs uses the neutral base prompt with language: 'en' and first message in English, preserving user language preference via its own language field override.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Initial mic mute is lost in the new Gemini/Qwen backends — micMuted can already be true before a session starts, but both components only apply it in an effect that runs while state.recorder is still null. startAudioCapture() then creates a live recorder and never reapplies the existing mute state, so the UI can show muted while audio is still captured until the user toggles again. Evidence web/src/realtime/GeminiLiveVoiceSession.tsx:380, web/src/realtime/QwenVoiceSession.tsx:397, related context web/src/lib/voice-context.tsx:23.
    Suggested fix:
    const micMutedRef = useRef(micMuted)
    
    useEffect(() => {
        micMutedRef.current = micMuted
        state.recorder?.setMuted(micMuted)
    }, [micMuted])
    
    void state.recorder.start(...).then(() => {
        state.recorder?.setMuted(micMutedRef.current)
    })
  • [Major] Composer Enter-to-send was flipped for the whole web app — plain Enter now inserts a newline and only Ctrl/Cmd+Enter sends. That changes the existing chat workflow for every keyboard user and is unrelated to the voice backend feature. Evidence web/src/components/AssistantChat/HappyComposer.tsx:314.
    Suggested fix:
    if (key === 'Enter' && e.shiftKey) {
        return
    }
    
    if (key === 'Enter' && !e.ctrlKey && !e.altKey && !e.metaKey && canSend) {
        e.preventDefault()
        api.composer().send()
        setShowContinueHint(false)
        return
    }

Summary

  • Review mode: follow-up after new commits
  • Two issues remain in the current head: the new Gemini/Qwen sessions can capture audio while the UI still shows the mic as muted, and the composer key handling now regresses the existing Enter-to-send behavior.

Testing

  • Not run (automation; static review only)
  • Missing coverage: starting Gemini/Qwen while micMuted=true, and keyboard composer behavior for Enter, Shift+Enter, and Ctrl/Cmd+Enter

HAPI Bot

}, [api]) // eslint-disable-line react-hooks/exhaustive-deps

// Sync mic mute state
useEffect(() => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] micMuted is only applied here when the prop changes after state.recorder already exists. If the user stopped voice while muted and starts a new Gemini session, micMuted is already true, this effect runs before startAudioCapture() creates the recorder, and the initial mute never gets reapplied. The UI stays muted while the recorder still captures audio.

Suggested fix:

const micMutedRef = useRef(micMuted)

useEffect(() => {
    micMutedRef.current = micMuted
    state.recorder?.setMuted(micMuted)
}, [micMuted])

void state.recorder.start(...).then(() => {
    state.recorder?.setMuted(micMutedRef.current)
})

}
}, [api]) // eslint-disable-line react-hooks/exhaustive-deps

useEffect(() => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] Same initial-mute regression as Gemini: this effect only updates an existing recorder. When micMuted is already true before the Qwen session starts, startAudioCapture() creates a fresh recorder after this effect has already run, so audio capture begins unmuted until the user toggles again.

Suggested fix:

const micMutedRef = useRef(micMuted)

useEffect(() => {
    micMutedRef.current = micMuted
    state.recorder?.setMuted(micMuted)
}, [micMuted])

void state.recorder.start(...).then(() => {
    state.recorder?.setMuted(micMutedRef.current)
})


// Only plain Enter (no modifiers) sends; other modifier combos are ignored
if (key === 'Enter') {
// Ctrl+Enter (Windows/Linux) or Cmd+Enter (Mac) sends the message
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] This flips the existing composer shortcut from plain Enter to Ctrl/Cmd+Enter for every chat session. The previous behavior already supported multi-line input with Shift+Enter, so this changes a core keyboard workflow without anything in the voice feature requiring it.

Suggested fix:

if (key === 'Enter' && e.shiftKey) {
    return
}

if (key === 'Enter' && !e.ctrlKey && !e.altKey && !e.metaKey) {
    e.preventDefault()
    if (canSend) {
        api.composer().send()
        setShowContinueHint(false)
    }
    return
}

Both Gemini and Qwen backends now persist micMuted to module-level
state and reapply it immediately after startAudioCapture() creates the
recorder. Previously the React mute effect could run while the recorder
was still null, causing audio to be captured even when the UI showed
the mic as muted.
@TennyDDDD
Copy link
Copy Markdown

Re: review at 2026-04-22T03:35:05Z

1. Mic mute initial state — Fixed in aa9802d.

Both Gemini and Qwen now persist micMuted to module-level state via the React effect, and startAudioCapture() applies it immediately after creating the recorder. This covers the case where micMuted was already true before the session started.

2. Composer Enter → Ctrl+Enter — This is an intentional change requested separately from the voice feature. The previous Enter-to-send conflicted with multi-line input needs. The new behavior:

  • Ctrl/Cmd+Enter → send
  • Enter → newline
  • Suggestion selection still uses plain Enter

This is a deliberate UX improvement, not a regression.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Gemini can silently unmute the user after any model response — turnComplete always calls state.recorder?.setMuted(false), so a user who intentionally muted the mic gets re-exposed as soon as the first reply finishes. Evidence web/src/realtime/GeminiLiveVoiceSession.tsx:217, related state web/src/realtime/GeminiLiveVoiceSession.tsx:388.
    Suggested fix:
    function applyRecorderMute(): void {
        state.recorder?.setMuted(state.micMuted || state.modelSpeaking)
    }
    
    if (!state.modelSpeaking) {
        state.modelSpeaking = true
        applyRecorderMute()
    }
    
    if (serverContent.turnComplete) {
        state.modelSpeaking = false
        applyRecorderMute()
    }
  • [Major] Failed starts leak the newly created audio context in both new backends — each startSession() opens state.playbackContext before token/mic validation, but the early throw paths never call cleanup(). Repeated failures leave contexts alive and can block later starts once the browser hits its context cap. Evidence web/src/realtime/GeminiLiveVoiceSession.tsx:78, web/src/realtime/GeminiLiveVoiceSession.tsx:89, same pattern web/src/realtime/QwenVoiceSession.tsx:94, web/src/realtime/QwenVoiceSession.tsx:102.
    Suggested fix:
    async startSession(config: VoiceSessionConfig): Promise<void> {
        cleanup()
        try {
            state.playbackContext = new AudioContext({ sampleRate: 24000 })
            await state.playbackContext.resume()
            // existing setup...
        } catch (error) {
            cleanup()
            throw error
        }
    }
  • [Major] Composer Enter-to-send is still inverted for the whole chat UI — the current handler only sends on Ctrl/Cmd+Enter, while plain Enter now inserts a newline. That changes a core workflow unrelated to the voice feature. Evidence web/src/components/AssistantChat/HappyComposer.tsx:314.
    Suggested fix:
    if (key === 'Enter' && e.shiftKey) {
        return
    }
    
    if (key === 'Enter' && !e.ctrlKey && !e.altKey && !e.metaKey && canSend) {
        e.preventDefault()
        api.composer().send()
        setShowContinueHint(false)
        return
    }

Summary

  • Review mode: follow-up after new commits
  • Three issues remain at the current head: Gemini still overrides a user mute after model speech, failed Gemini/Qwen starts leak audio contexts, and the composer shortcut regression is still present.

Testing

  • Not run (automation; static review only)
  • Missing coverage: manual mute persisted across Gemini responses, repeated failed starts for Gemini/Qwen, and Enter vs Shift+Enter vs Ctrl/Cmd+Enter composer behavior

HAPI Bot

console.log('[GeminiLive] Turn complete')
// Model done — unmute mic for next user turn
state.modelSpeaking = false
state.recorder?.setMuted(false)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] turnComplete always unmutes the recorder, so a user who intentionally muted the mic gets re-exposed as soon as the first Gemini reply finishes.

Suggested fix:

function applyRecorderMute(): void {
    state.recorder?.setMuted(state.micMuted || state.modelSpeaking)
}

if (serverContent.turnComplete) {
    state.modelSpeaking = false
    applyRecorderMute()
}

// Create playback AudioContext immediately while still inside the user
// gesture (click/tap). Mobile browsers require this for autoplay policy.
// Store in state so cleanup() can close it on failure or stop.
state.playbackContext = new AudioContext({ sampleRate: 24000 })
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] This AudioContext is created before token/mic validation, but the early throw paths above never call cleanup(). The same pattern exists in QwenVoiceSession. A few failed starts can leave enough contexts alive to block later voice sessions.

Suggested fix:

async startSession(config: VoiceSessionConfig): Promise<void> {
    cleanup()
    try {
        state.playbackContext = new AudioContext({ sampleRate: 24000 })
        await state.playbackContext.resume()
        // existing setup...
    } catch (error) {
        cleanup()
        throw error
    }
}


// Only plain Enter (no modifiers) sends; other modifier combos are ignored
if (key === 'Enter') {
// Ctrl+Enter (Windows/Linux) or Cmd+Enter (Mac) sends the message
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] This still flips the app-wide composer shortcut from plain Enter to Ctrl/Cmd+Enter. The voice backend work does not require changing the default send behavior.

Suggested fix:

if (key === 'Enter' && e.shiftKey) {
    return
}

if (key === 'Enter' && !e.ctrlKey && !e.altKey && !e.metaKey && canSend) {
    e.preventDefault()
    api.composer().send()
    setShowContinueHint(false)
    return
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

换行上, CMD + enter会比较合理

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants