diff --git a/.claude/plan/hapi-web-loading-fix.md b/.claude/plan/hapi-web-loading-fix.md new file mode 100644 index 0000000000..9cf5da8939 --- /dev/null +++ b/.claude/plan/hapi-web-loading-fix.md @@ -0,0 +1,95 @@ +# 📋 实施计划:Hapi Web 加载失败 + 语音后端修复 + +## 诊断结论 + +### 根因分析 + +| 问题 | 根因 | 严重性 | +|------|------|--------| +| Web 版本加载不了 | Hub 进程未重启,运行的是旧环境变量 + Service Worker 缓存旧资源 | Critical | +| 更改语音选项后出问题 | `~/.hapi/env` 修改后 Hub 不会热加载,需要重启 | Critical | +| 数据库是否分了版本 | **只有一个数据库** `~/.hapi/hapi.db`,无 dev/prod 分离,排除此问题 | ✅ 已排除 | + +### 关键证据 + +1. **Hub 进程**: PID 44317, 启动于 **4/3 16:09** +2. **env 文件**: 最后修改于 **4/5 06:13** (Hub 启动后 2 天) +3. **环境变量不同步**: + - `~/.hapi/env` 中 `VOICE_BACKEND=gemini-live` + - 运行中 Hub 实际返回 `{"backend":"qwen-realtime"}`(因为 Hub 进程的 process.env 中没有 `VOICE_BACKEND`,回退到 `DEFAULT_VOICE_BACKEND = 'qwen-realtime'`) +4. **Web 静态文件**: 所有资源返回 200,HTML/JS/CSS 正常可达 +5. **数据库**: 单一 SQLite `~/.hapi/hapi.db`,schema v6,WAL 模式正常 + +### 用户需求更新 + +用户明确表示 **想用 Gemini TTS**,需要将 `VOICE_BACKEND` 设为 `gemini-live`。 + +--- + +## 任务类型 +- [x] 后端 (→ Hub 重启 + env 修复) +- [x] 前端 (→ Service Worker 清理 + 确认 Gemini Live 组件正常) + +## 技术方案 + +**核心修复**: 重启 Hub 进程使其加载最新的 `~/.hapi/env` 环境变量。 + +**辅助修复**: 清理 `web/dist` 中的旧构建产物,确保 Service Worker 不缓存过期资源。 + +--- + +## 实施步骤 + +### Step 1: 确认并修复 env 配置 +- 文件: `/home/ubuntu/.hapi/env` +- 确保 `VOICE_BACKEND=gemini-live`(用户要用 Gemini TTS) +- 确保 `GEMINI_API_KEY` 已配置 +- 预期产物: env 文件就绪 + +### Step 2: 清理 web 构建产物 +- 删除 `/home/ubuntu/hapi/web/dist/` 并重新构建 +- 命令: `cd /home/ubuntu/hapi/web && rm -rf dist && bun run build` +- 预期产物: 干净的 `web/dist/` 目录 + +### Step 3: 重启 Hub 进程 +- 停止当前 Hub (PID 44317) +- 重新启动 Hub,使其读取最新 env +- 命令: `hapi runner restart` 或手动 kill + 启动 +- 预期产物: Hub 进程以新 env 运行 + +### Step 4: 验证修复 +- 调用 `GET /api/voice/backend` 确认返回 `gemini-live` +- 访问 `https://ccg.aimo3d.org/` 确认页面加载正常 +- 测试 Gemini Live 语音功能 +- 预期产物: Web 正常加载 + 语音后端为 Gemini + +### Step 5: (可选) Service Worker 客户端清理 +- 如果用户浏览器仍显示旧内容,需要: + - 清除浏览器 Service Worker 缓存 + - 或强制刷新 (Ctrl+Shift+R) +- `sw.ts` 已有 `skipWaiting + clientsClaim`,重建后应自动更新 + +--- + +## 关键文件 + +| 文件 | 操作 | 说明 | +|------|------|------| +| `~/.hapi/env` | 确认 | VOICE_BACKEND=gemini-live | +| `web/dist/` | 重建 | 清理旧构建产物 | +| Hub 进程 (PID 44317) | 重启 | 加载最新 env | +| `shared/src/voice.ts:272` | 无需修改 | DEFAULT_VOICE_BACKEND 仅作 fallback | +| `hub/src/web/routes/voice.ts:122-128` | 无需修改 | 逻辑正确,只需 env 生效 | +| `~/.hapi/hapi.db` | 无操作 | 唯一数据库,无需修改 | + +## 风险与缓解 + +| 风险 | 缓解措施 | +|------|----------| +| 重启 Hub 会中断活跃 Claude 会话 | 会话可通过 `--resume` 恢复 | +| Gemini API Key 可能无效/过期 | Step 4 验证 token 端点 | +| 浏览器 SW 缓存未更新 | skipWaiting 机制 + 手动清除指引 | + +## SESSION_ID(供 /ccg:execute 使用) +- CODEX_SESSION: N/A(诊断任务,未调用) +- GEMINI_SESSION: N/A(诊断任务,未调用) diff --git a/.claude/team-plan/pluggable-voice-backend.md b/.claude/team-plan/pluggable-voice-backend.md new file mode 100644 index 0000000000..52e203ab83 --- /dev/null +++ b/.claude/team-plan/pluggable-voice-backend.md @@ -0,0 +1,394 @@ +# Team Plan: Pluggable Voice Backend (ElevenLabs + Gemini Live) + +## Overview + +Refactor Hapi voice assistant into a pluggable architecture (Strategy Pattern) supporting ElevenLabs ConvAI (default) and Gemini Live API backends, switchable via `VOICE_BACKEND` env var. Minimize upstream file changes to reduce git pull conflicts. + +## Codex Analysis Summary + +- Existing `VoiceSession` interface + `registerVoiceSession()` is already a Strategy injection point +- `VOICE_BACKEND` should be resolved at runtime via hub API (not Vite env), since web frontend has no runtime env mechanism +- `sendContextualUpdate` has no Gemini Live equivalent; must approximate via `send_realtime_input` for incremental updates, `send_client_content` for initial context seeding +- Ephemeral tokens use `v1alpha` endpoint; regular API key uses `v1beta` — hub must handle this divergence +- Tool calling in Gemini Live requires synchronous `sendToolResponse`; existing `processPermissionRequest` involves async network calls — keep responses short +- Hidden coupling: `VoiceSessionConfig.language` is typed as `ElevenLabsLanguage` (types.ts:1) +- Settings page language list is ElevenLabs-specific (functions named `getElevenLabsSupportedLanguages`) + +## Gemini Analysis Summary + +- Proposed transparent proxy component pattern: `RealtimeVoiceSession` becomes a switcher +- Audio pipeline: capture via `getUserMedia` + `AudioWorkletNode` for 16kHz downsampling → PCM16 → base64 → WebSocket; playback via `AudioContext(24000)` with scheduled buffer queue +- Tool adapter needed: `getFunctionDeclarations()` maps existing client tools to Gemini format, `handleToolCall()` bridges execution +- Client VAD + server VAD hybrid for barge-in: clear playback queue immediately on interruption +- Settings page needs conditional rendering based on active backend +- No changes needed to `SessionChat.tsx`, `ComposerButtons.tsx`, `HappyThread.tsx` — they only consume abstract `useVoice()` status + +## Functional Review Findings (v2) + +### Critical Gaps Fixed +- C1: AudioWorklet processor file was missing → added to Task 4 +- C2: Token expiry/reconnect not handled → added to Task 2 + Task 6 +- C3: Session switching routes tool calls to wrong session → auto-stop on session switch +- C4: Voice component lifecycle / unmount cleanup → unified dispose() path in Task 6 + +### High Gaps Fixed +- H1: Mobile AudioContext blocked by autoplay → AudioContext created in user gesture handler +- H2: GEMINI_API_KEY missing behavior undefined → explicit error contract +- H3: Tool calling multi-call/timeout → serial execution + per-call timeout +- H4: React Strict Mode double-mount → useEffect cleanup +- H5: Voice button available before backend loads → voiceReady state gating + +### Medium Gaps (addressed in Task 8/9) +- M1: Bundle size → React.lazy() dynamic import +- M2: Settings page not adapted → conditional rendering +- M3: No tests → added Task 8 +- M4: No docs update → added Task 9 + +## Technical Decisions + +| Decision | Choice | Rationale | +|----------|--------|-----------| +| Backend discovery | Hub runtime API (`GET /voice/backend`) | Web has no runtime env; avoids Vite rebuild to switch | +| Wrapper location | New `VoiceBackendSession.tsx` | Original `RealtimeVoiceSession.tsx` untouched = zero upstream conflict | +| Audio processing | Separate `gemini/` subdirectory | Isolate complexity; testable independently | +| Tool bridge | Adapter in `gemini/toolAdapter.ts` | Reuse existing `realtimeClientTools` without modification | +| Language type | Keep `ElevenLabsLanguage` for now | Gemini ignores language pref initially; refactor later to avoid upstream diff | +| Token flow | Hub creates ephemeral token for both backends | Never expose long-lived API keys to browser | +| Session switch | Auto-stop voice on session change | Prevents tool calls routing to wrong session | +| Gemini code loading | React.lazy() dynamic import | Zero bundle impact when using ElevenLabs | +| AudioContext creation | Synchronous in user gesture handler | Required for iOS/Android autoplay policy | + +## Task List + +### Task 1: Shared Voice Config Extension +- **Type**: Backend (shared) +- **File scope**: + - `shared/src/voice.ts` (modify — append new exports) +- **Dependencies**: None +- **Implementation steps**: + 1. Add `VoiceBackendType = 'elevenlabs' | 'gemini-live'` and `DEFAULT_VOICE_BACKEND = 'elevenlabs'` + 2. Add `GEMINI_LIVE_MODEL = 'gemini-3.1-flash-live-preview'` constant + 3. Extract `VOICE_TOOL_DEFINITIONS` from existing `VOICE_TOOLS` — neutral format, single source of truth + 4. Add `buildGeminiLiveFunctionDeclarations()` — converts `VOICE_TOOL_DEFINITIONS` to Gemini `{ name, description, parameters }` schema format + 5. Add `buildGeminiLiveConfig()` — returns `{ model, systemInstruction: VOICE_SYSTEM_PROMPT, tools: [{ functionDeclarations }], responseModalities: ['AUDIO'] }` for `ai.live.connect()` + 6. Keep `buildVoiceAgentConfig()` untouched for ElevenLabs +- **Acceptance**: Both config builders produce valid configs; existing ElevenLabs flow unaffected; `VOICE_TOOL_DEFINITIONS` is the single source for both backends + +### Task 2: Hub Backend Discovery + Token Route +- **Type**: Backend (hub) +- **File scope**: + - `hub/src/web/routes/voice.ts` (modify — add routes, refactor handler) + - `hub/package.json` (modify — add `@google/genai`) +- **Dependencies**: Task 1 +- **Implementation steps**: + 1. Add `resolveVoiceBackend()`: reads `VOICE_BACKEND` env, validates against `VoiceBackendType`, defaults to `elevenlabs` + 2. Add `GET /voice/backend` route: + - Success: `{ allowed: true, backend: VoiceBackendType }` + - Failure (missing key): `{ allowed: false, backend: VoiceBackendType, code: 'missing_elevenlabs_api_key' | 'missing_gemini_api_key', error: string }` + - Validates that the required API key exists for the configured backend + 3. Add `issueGeminiLiveToken()`: + - Read `GEMINI_API_KEY ?? GOOGLE_API_KEY`; if missing, return `{ allowed: false, code: 'missing_gemini_api_key' }` + - Use `@google/genai` SDK to create ephemeral token + - Return `{ allowed: true, backend: 'gemini-live', token, model: GEMINI_LIVE_MODEL, apiVersion: 'v1alpha', expiresAt: number }` + - Never cache Gemini tokens (they expire in ~60s) + 4. Refactor `POST /voice/token` handler: + - Branch on `resolveVoiceBackend()` — `elevenlabs` uses existing logic unchanged, `gemini-live` calls `issueGeminiLiveToken()` + - Discriminated union response type + 5. Error contract: all failure responses use `{ allowed: false, backend, code, error }` shape with appropriate HTTP status codes + 6. Add `@google/genai` to `hub/package.json` +- **Acceptance**: `GET /voice/backend` returns correct backend + allowed status; `POST /voice/token` returns valid token with `expiresAt` for Gemini; missing API key returns structured error; ElevenLabs path unchanged + +### Task 3: Web API Types + Client Functions +- **Type**: Frontend (web) +- **File scope**: + - `web/src/api/voice.ts` (modify — add types and fetch functions) + - `web/src/api/client.ts` (modify — add fetchVoiceBackend method) +- **Dependencies**: Task 2 +- **Implementation steps**: + 1. Add `VoiceBackendResponse` type: + ```ts + | { allowed: true; backend: VoiceBackendType } + | { allowed: false; backend: VoiceBackendType; code: string; error: string } + ``` + 2. Extend `VoiceTokenResponse` as discriminated union: + ```ts + | { allowed: true; backend: 'elevenlabs'; token: string; agentId: string } + | { allowed: true; backend: 'gemini-live'; token: string; model: string; apiVersion: string; expiresAt: number } + | { allowed: false; backend: string; code: string; error: string } + ``` + 3. Add `fetchVoiceBackend(api)` function with module-level cache (cache only successful responses; invalidate on error) + 4. Add `fetchVoiceBackend()` method to `ApiClient` class + 5. Update `fetchVoiceToken()` to handle union response +- **Acceptance**: Type-safe API calls for both backends; cached backend discovery; failed responses not cached + +### Task 4: Gemini Audio Pipeline +- **Type**: Frontend (web) +- **File scope** (all new files): + - `web/src/realtime/gemini/pcmUtils.ts` + - `web/src/realtime/gemini/pcm-recorder.worklet.ts` + - `web/src/realtime/gemini/audioRecorder.ts` + - `web/src/realtime/gemini/audioPlayer.ts` +- **Dependencies**: None (can parallel with Task 1-3) +- **Implementation steps**: + 1. `pcmUtils.ts`: Pure utility functions: + - `float32ToPcm16(samples: Float32Array): ArrayBuffer` + - `pcm16ToFloat32(buffer: ArrayBuffer): Float32Array` + - `arrayBufferToBase64(buffer: ArrayBuffer): string` + - `base64ToArrayBuffer(base64: string): ArrayBuffer` + 2. `pcm-recorder.worklet.ts`: AudioWorklet processor: + - Extends `AudioWorkletProcessor` + - `process()` method: accumulate Float32 samples into chunks (e.g., 4096 samples), post to main thread via `port.postMessage()` + - Register as `'pcm-recorder-processor'` + - Must be loadable via Vite: `import workletUrl from './pcm-recorder.worklet.ts?url'` + 3. `audioRecorder.ts`: class `GeminiAudioRecorder`: + - `start(onChunk: (base64Pcm: string) => void)`: + - `getUserMedia({ audio: { sampleRate: 16000, channelCount: 1 } })` + - Create `AudioContext({ sampleRate: 16000 })` + - `audioContext.audioWorklet.addModule(workletUrl)` + - Connect MediaStreamSource → AudioWorkletNode + - Worklet messages → `float32ToPcm16()` → `arrayBufferToBase64()` → `onChunk()` + - `stop()`: stop all tracks, disconnect nodes, close AudioContext + - `setMuted(muted: boolean)`: toggle `MediaStreamTrack.enabled` + - `dispose()`: idempotent full cleanup, safe to call multiple times + - Listen for `MediaStreamTrack.onended` (device unplugged) → invoke error callback + - **Fallback**: if `audioWorklet.addModule()` fails, fall back to `ScriptProcessorNode` (deprecated but wider support) + 4. `audioPlayer.ts`: class `GeminiAudioPlayer`: + - `constructor(audioContext?: AudioContext)`: use provided AudioContext or create new at 24kHz; maintain playback queue with scheduled end times + - `enqueue(base64Pcm: string)`: decode → create `AudioBufferSourceNode` → schedule at `max(audioContext.currentTime, lastEndTime)` → update `lastEndTime` + - `clearQueue()`: stop all scheduled sources immediately (for barge-in); reset `lastEndTime` + - `isPlaying(): boolean`: check if audio is currently being output + - `dispose()`: stop all, close AudioContext if we own it + - Handle Chrome tab backgrounding: detect `audioContext.state === 'suspended'` → attempt `resume()` → if blocked, notify via callback +- **Acceptance**: Recorder produces 16kHz PCM16 base64 chunks; Player plays 24kHz PCM16 smoothly without clicks; clearQueue stops immediately; device unplug detected; fallback for no-AudioWorklet browsers + +### Task 5: Gemini Tool Adapter +- **Type**: Frontend (web) +- **File scope** (new file): + - `web/src/realtime/gemini/toolAdapter.ts` +- **Dependencies**: Task 1 (for VOICE_TOOL_DEFINITIONS) +- **Implementation steps**: + 1. `getGeminiFunctionDeclarations()`: import `VOICE_TOOL_DEFINITIONS` from shared (single source of truth), map to Gemini schema format — no separate declaration, no schema drift risk + 2. `handleGeminiToolCalls(functionCalls, clientTools)`: + - Process calls **serially** (one at a time, in order) + - For each call: lookup function name in `realtimeClientTools`, execute with args, collect result + - **Preserve call IDs**: each `FunctionResponse` must include the matching `id` from the `FunctionCall` + - **Per-call timeout**: wrap each execution in a 30s timeout; return `'error (timeout)'` on expiry + - **Error isolation**: tool failure returns error string as response, never throws, never crashes session + - Return `FunctionResponse[]` array + 3. `validateToolArgs(name: string, args: unknown): boolean`: basic validation that required params exist +- **Acceptance**: Function declarations derived from single source; tool calls route correctly; call IDs preserved in responses; timeout works; errors don't crash session + +### Task 6: GeminiLiveVoiceSession Implementation +- **Type**: Frontend (web) +- **File scope** (new file): + - `web/src/realtime/GeminiLiveVoiceSession.tsx` + - `web/package.json` (modify — add `@google/genai`) +- **Dependencies**: Task 3, Task 4, Task 5 +- **Implementation steps**: + 1. Create `GeminiLiveVoiceSessionImpl` class implementing `VoiceSession` interface: + - **`startSession(config)`**: + - Fetch token from hub via `fetchVoiceToken(api)` + - Build config via `buildGeminiLiveConfig()` from shared + - Call `ai.live.connect({ model, config, callbacks })` with ephemeral token + - Start audio recorder → pipe chunks to live session via `sendRealtimeInput()` + - Seed initial context via `session.sendClientContent()` (one-time) + - Set status 'connected' + - **`endSession()`**: call `dispose()` (see below) + - **`sendTextMessage(message)`**: send as realtime text input to live session + - **`sendContextualUpdate(update)`**: send as realtime text input with `[CONTEXT UPDATE] ` prefix + - **`dispose(reason?: string)`**: single idempotent teardown path: + - Stop recorder (releases mic) + - Clear + dispose player + - Close live session WebSocket + - Reset all internal state + - Safe to call from any failure branch, unmount, session switch, or error + - **Reconnect logic**: + - On WebSocket close/error: if `reason !== 'user-initiated'`, attempt reconnect + - Fetch fresh token from hub (old one expired) + - Recreate live session with new token + - Reseed context via `sendClientContent()` + - Max 3 reconnect attempts with exponential backoff (1s, 3s, 9s) + - After 3 failures: set status 'error', show error in VoiceErrorBanner + 2. Create `GeminiLiveVoiceSession` React component: + - Props: same as `RealtimeVoiceSessionProps` + - **On mount**: instantiate impl, register via `registerVoiceSession()`, register session store + - **useEffect cleanup**: call `dispose('unmount')` — handles React Strict Mode double-mount correctly + - Handle `micMuted` prop: delegate to `recorder.setMuted()` — if recorder not yet started, store as pending state applied on recorder start + - Wire live session callbacks: + - `onopen` → status 'connected' + - `onclose` → attempt reconnect or status 'disconnected' + - `onerror` → status 'error' with message + - `onmessage`: dispatch by type: + - Audio data → `player.enqueue(base64)` + - Tool call → `toolAdapter.handleGeminiToolCalls()` → `session.sendToolResponse()` + - Text → log/ignore (voice session doesn't render text) + - **Barge-in**: when server signals user is speaking (or audio input detected while player active) → `player.clearQueue()` + - **AudioContext creation**: create AudioContext **synchronously in startSession**, which is called from user click handler → satisfies mobile autoplay policy + - Share AudioContext between recorder and player where sample rates allow (otherwise separate contexts) + - Render nothing (same as ElevenLabs version) + 3. Add `@google/genai` to `web/package.json` +- **Acceptance**: Full voice conversation works; tool calls execute correctly with preserved IDs; mic mute works (including pending state); barge-in clears playback; reconnect works on token expiry/WebSocket drop; dispose is idempotent; no resource leaks on unmount; works on mobile (AudioContext in gesture) + +### Task 7: Voice Backend Switcher + Integration +- **Type**: Frontend (web) +- **File scope**: + - `web/src/realtime/VoiceBackendSession.tsx` (new) + - `web/src/realtime/index.ts` (modify — add export) + - `web/src/components/SessionChat.tsx` (modify — change import + JSX, add auto-stop) + - `web/src/lib/voice-context.tsx` (modify — add voiceReady state) +- **Dependencies**: Task 6 +- **Implementation steps**: + 1. Create `VoiceBackendSession.tsx`: + - Props: same as `RealtimeVoiceSessionProps` + `api: ApiClient` + - On mount: call `fetchVoiceBackend(api)` (cached), store result in state + - Render: + - Loading (no backend yet): return null + - `backend === 'gemini-live'` → `React.lazy(() => import('./GeminiLiveVoiceSession'))` wrapped in `` + - Default → `` + - `allowed === false` → return null (voice not available) + 2. Update `web/src/lib/voice-context.tsx`: + - Add `voiceReady: boolean` to context (default false) + - Set `voiceReady = true` after backend discovery completes with `allowed: true` + - Expose `voiceReady` in `useVoice()` return + - Voice button disabled until `voiceReady === true` + 3. Update `web/src/components/SessionChat.tsx`: + - Change import: `RealtimeVoiceSession` → `VoiceBackendSession` + - Change JSX: ` { const origin = req.headers.get('origin') if (!origin || allowAllOrigins || corsOrigins.includes(origin)) { diff --git a/hub/src/web/routes/voice.test.ts b/hub/src/web/routes/voice.test.ts new file mode 100644 index 0000000000..f2eb2444ba --- /dev/null +++ b/hub/src/web/routes/voice.test.ts @@ -0,0 +1,152 @@ +import { describe, test, expect, afterEach } from 'bun:test' +import { Hono } from 'hono' +import type { WebAppEnv } from '../middleware/auth' +import { createVoiceRoutes } from './voice' + +function createApp() { + const app = new Hono() + app.route('/api', createVoiceRoutes()) + return app +} + +describe('GET /api/voice/backend', () => { + const originalEnv = process.env.VOICE_BACKEND + + afterEach(() => { + if (originalEnv === undefined) { + delete process.env.VOICE_BACKEND + } else { + process.env.VOICE_BACKEND = originalEnv + } + }) + + test('returns elevenlabs by default', async () => { + delete process.env.VOICE_BACKEND + const app = createApp() + const res = await app.request('/api/voice/backend') + expect(res.status).toBe(200) + const body = await res.json() as { backend: string } + expect(body.backend).toBe('elevenlabs') + }) + + test('returns gemini-live when configured', async () => { + process.env.VOICE_BACKEND = 'gemini-live' + const app = createApp() + const res = await app.request('/api/voice/backend') + expect(res.status).toBe(200) + const body = await res.json() as { backend: string } + expect(body.backend).toBe('gemini-live') + }) + + test('returns qwen-realtime when configured', async () => { + process.env.VOICE_BACKEND = 'qwen-realtime' + const app = createApp() + const res = await app.request('/api/voice/backend') + expect(res.status).toBe(200) + const body = await res.json() as { backend: string } + expect(body.backend).toBe('qwen-realtime') + }) + + test('falls back to elevenlabs for unknown values', async () => { + process.env.VOICE_BACKEND = 'unknown-backend' + const app = createApp() + const res = await app.request('/api/voice/backend') + expect(res.status).toBe(200) + const body = await res.json() as { backend: string } + expect(body.backend).toBe('elevenlabs') + }) +}) + +describe('POST /api/voice/gemini-token', () => { + const origGemini = process.env.GEMINI_API_KEY + const origGoogle = process.env.GOOGLE_API_KEY + + afterEach(() => { + if (origGemini === undefined) delete process.env.GEMINI_API_KEY + else process.env.GEMINI_API_KEY = origGemini + if (origGoogle === undefined) delete process.env.GOOGLE_API_KEY + else process.env.GOOGLE_API_KEY = origGoogle + }) + + test('returns 400 when no API key configured', async () => { + delete process.env.GEMINI_API_KEY + delete process.env.GOOGLE_API_KEY + const app = createApp() + const res = await app.request('/api/voice/gemini-token', { method: 'POST' }) + expect(res.status).toBe(400) + const body = await res.json() as { allowed: boolean; error: string } + expect(body.allowed).toBe(false) + expect(body.error).toContain('not configured') + }) + + test('returns proxied wsUrl when GEMINI_API_KEY is set', async () => { + process.env.GEMINI_API_KEY = 'test-gemini-key' + delete process.env.GOOGLE_API_KEY + const app = createApp() + const res = await app.request('/api/voice/gemini-token', { method: 'POST' }) + expect(res.status).toBe(200) + const body = await res.json() as { allowed: boolean; apiKey: string; wsUrl: string } + expect(body.allowed).toBe(true) + expect(body.apiKey).toBe('proxied') + expect(body.wsUrl).toContain('/api/voice/gemini-ws') + }) + + test('falls back to GOOGLE_API_KEY', async () => { + delete process.env.GEMINI_API_KEY + process.env.GOOGLE_API_KEY = 'test-google-key' + const app = createApp() + const res = await app.request('/api/voice/gemini-token', { method: 'POST' }) + expect(res.status).toBe(200) + const body = await res.json() as { allowed: boolean; apiKey: string; wsUrl: string } + expect(body.allowed).toBe(true) + expect(body.apiKey).toBe('proxied') + expect(body.wsUrl).toContain('/api/voice/gemini-ws') + }) +}) + +describe('POST /api/voice/qwen-token', () => { + const origDash = process.env.DASHSCOPE_API_KEY + const origQwen = process.env.QWEN_API_KEY + + afterEach(() => { + if (origDash === undefined) delete process.env.DASHSCOPE_API_KEY + else process.env.DASHSCOPE_API_KEY = origDash + if (origQwen === undefined) delete process.env.QWEN_API_KEY + else process.env.QWEN_API_KEY = origQwen + }) + + test('returns 400 when no API key configured', async () => { + delete process.env.DASHSCOPE_API_KEY + delete process.env.QWEN_API_KEY + const app = createApp() + const res = await app.request('/api/voice/qwen-token', { method: 'POST' }) + expect(res.status).toBe(400) + const body = await res.json() as { allowed: boolean; error: string } + expect(body.allowed).toBe(false) + expect(body.error).toContain('not configured') + }) + + test('returns wsUrl when DASHSCOPE_API_KEY is set (no raw key exposed)', async () => { + process.env.DASHSCOPE_API_KEY = 'test-dash-key' + delete process.env.QWEN_API_KEY + const app = createApp() + const res = await app.request('/api/voice/qwen-token', { method: 'POST' }) + expect(res.status).toBe(200) + const body = await res.json() as { allowed: boolean; wsUrl: string } + expect(body.allowed).toBe(true) + expect(body.wsUrl).toContain('/api/voice/qwen-ws') + expect(body).not.toHaveProperty('apiKey') + }) + + test('falls back to QWEN_API_KEY', async () => { + delete process.env.DASHSCOPE_API_KEY + process.env.QWEN_API_KEY = 'test-qwen-key' + const app = createApp() + const res = await app.request('/api/voice/qwen-token', { method: 'POST' }) + expect(res.status).toBe(200) + const body = await res.json() as { allowed: boolean; wsUrl: string } + expect(body.allowed).toBe(true) + expect(body.wsUrl).toContain('/api/voice/qwen-ws') + expect(body).not.toHaveProperty('apiKey') + }) +}) diff --git a/hub/src/web/routes/voice.ts b/hub/src/web/routes/voice.ts index 1a55f83639..a0863fa45e 100644 --- a/hub/src/web/routes/voice.ts +++ b/hub/src/web/routes/voice.ts @@ -4,8 +4,10 @@ import type { WebAppEnv } from '../middleware/auth' import { ELEVENLABS_API_BASE, VOICE_AGENT_NAME, - buildVoiceAgentConfig + buildVoiceAgentConfig, + DEFAULT_VOICE_BACKEND } from '@hapi/protocol/voice' +import type { VoiceBackendType } from '@hapi/protocol/voice' const tokenRequestSchema = z.object({ customAgentId: z.string().optional(), @@ -116,6 +118,65 @@ async function getOrCreateAgentId(apiKey: string): Promise { export function createVoiceRoutes(): Hono { const app = new Hono() + // Return the configured voice backend type + app.get('/voice/backend', (c) => { + const raw = process.env.VOICE_BACKEND + const backend: VoiceBackendType = + raw === 'gemini-live' ? 'gemini-live' + : raw === 'qwen-realtime' ? 'qwen-realtime' + : DEFAULT_VOICE_BACKEND + return c.json({ backend }) + }) + + // Get Gemini API key for Gemini Live voice sessions + // Gemini Live API does not support ephemeral tokens, so we proxy the key. + // The key is short-lived in the browser session and never persisted client-side. + app.post('/voice/gemini-token', async (c) => { + const apiKey = process.env.GEMINI_API_KEY || process.env.GOOGLE_API_KEY + if (!apiKey) { + return c.json({ + allowed: false, + error: 'Gemini API key not configured (set GEMINI_API_KEY or GOOGLE_API_KEY)' + }, 400) + } + + // Use server-side WS proxy to avoid region restrictions. + // The proxy at /api/voice/gemini-ws handles the API key server-side. + // Derive wsUrl from the request origin so remote browsers connect back to the hub, + // not to localhost. HAPI_PUBLIC_URL overrides when set (e.g. behind a reverse proxy). + const requestOrigin = new URL(c.req.url).origin + const publicUrl = process.env.HAPI_PUBLIC_URL || requestOrigin + const wsProxyUrl = publicUrl.replace(/^http/, 'ws') + '/api/voice/gemini-ws' + + return c.json({ + allowed: true, + apiKey: 'proxied', // Dummy — key is handled server-side + wsUrl: wsProxyUrl, // Always proxy — env WS URLs are upstream-only (server-side) + baseUrl: process.env.GEMINI_API_BASE || undefined + }) + }) + + // Check Qwen (DashScope) availability for Qwen Realtime voice sessions + // The actual API key is never sent to the browser — it stays server-side in the WS proxy. + app.post('/voice/qwen-token', async (c) => { + const apiKey = process.env.DASHSCOPE_API_KEY || process.env.QWEN_API_KEY + if (!apiKey) { + return c.json({ + allowed: false, + error: 'DashScope API key not configured (set DASHSCOPE_API_KEY or QWEN_API_KEY)' + }, 400) + } + + const requestOrigin = new URL(c.req.url).origin + const publicUrl = process.env.HAPI_PUBLIC_URL || requestOrigin + const wsProxyUrl = publicUrl.replace(/^http/, 'ws') + '/api/voice/qwen-ws' + + return c.json({ + allowed: true, + wsUrl: wsProxyUrl // Always proxy — env WS URLs are upstream-only (server-side) + }) + }) + // Get ElevenLabs ConvAI conversation token app.post('/voice/token', async (c) => { const json = await c.req.json().catch(() => null) diff --git a/hub/src/web/server.ts b/hub/src/web/server.ts index b4dbf4eb5e..3272902be6 100644 --- a/hub/src/web/server.ts +++ b/hub/src/web/server.ts @@ -21,8 +21,122 @@ import { createPushRoutes } from './routes/push' import { createVoiceRoutes } from './routes/voice' import type { SSEManager } from '../sse/sseManager' import type { VisibilityTracker } from '../visibility/visibilityTracker' -import type { Server as BunServer } from 'bun' +import type { Server as BunServer, ServerWebSocket } from 'bun' import type { Server as SocketEngine } from '@socket.io/bun-engine' +import { jwtVerify } from 'jose' + +// Gemini Live WebSocket proxy — relays browser WS to Google, bypassing region restrictions +function createGeminiProxyWebSocketHandler() { + const GEMINI_WS_BASE = 'wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent' + const upstreamMap = new WeakMap, WebSocket>() + const pendingMap = new WeakMap, Array>() + + return { + open(clientWs: ServerWebSocket) { + const data = clientWs.data as { _geminiProxy: boolean; apiKey: string } + const upstreamUrl = `${process.env.GEMINI_LIVE_WS_URL || GEMINI_WS_BASE}?key=${encodeURIComponent(data.apiKey)}` + const pending: Array = [] + pendingMap.set(clientWs, pending) + + const upstream = new WebSocket(upstreamUrl) + upstreamMap.set(clientWs, upstream) + + upstream.onopen = () => { + // Flush any messages queued while upstream was connecting (e.g. setup frame) + for (const queued of pending.splice(0)) { + upstream.send(typeof queued === 'string' ? queued : queued) + } + pendingMap.delete(clientWs) + } + upstream.onmessage = (event) => { + try { + if (clientWs.readyState === 1) { + clientWs.send(typeof event.data === 'string' ? event.data : new Uint8Array(event.data as ArrayBuffer)) + } + } catch { /* client gone */ } + } + upstream.onerror = () => { + pendingMap.delete(clientWs) + try { clientWs.close(1011, 'Upstream error') } catch { /* */ } + } + upstream.onclose = (event) => { + pendingMap.delete(clientWs) + try { clientWs.close(event.code, event.reason) } catch { /* */ } + upstreamMap.delete(clientWs) + } + }, + message(clientWs: ServerWebSocket, message: string | ArrayBuffer | Uint8Array) { + const upstream = upstreamMap.get(clientWs) + if (upstream?.readyState === WebSocket.OPEN) { + upstream.send(typeof message === 'string' ? message : message) + } else if (upstream?.readyState === WebSocket.CONNECTING) { + // Queue messages until upstream opens (critical for the setup frame) + const pending = pendingMap.get(clientWs) + if (pending) pending.push(message) + } + }, + close(clientWs: ServerWebSocket, code: number, reason: string) { + const upstream = upstreamMap.get(clientWs) + pendingMap.delete(clientWs) + if (upstream) { + try { upstream.close(code, reason) } catch { /* */ } + upstreamMap.delete(clientWs) + } + } + } +} + +// Qwen Realtime WebSocket proxy — bridges browser (no custom headers) to DashScope (requires Authorization header) +function createQwenProxyWebSocketHandler() { + const QWEN_WS_BASE = 'wss://dashscope.aliyuncs.com/api-ws/v1/realtime' + // Map browser WS → upstream WS + const upstreamMap = new WeakMap, WebSocket>() + + return { + open(clientWs: ServerWebSocket) { + const data = clientWs.data as { apiKey: string; model: string } + const upstreamUrl = `${process.env.QWEN_REALTIME_WS_URL || QWEN_WS_BASE}?model=${encodeURIComponent(data.model)}` + + const upstream = new WebSocket(upstreamUrl, { + headers: { 'Authorization': `Bearer ${data.apiKey}` } + } as unknown as string[]) + + upstreamMap.set(clientWs, upstream) + + upstream.onopen = () => { + // Connection ready — upstream will send session.created + } + upstream.onmessage = (event) => { + // Forward upstream → client + try { + if (clientWs.readyState === 1) { + clientWs.send(typeof event.data === 'string' ? event.data : new Uint8Array(event.data as ArrayBuffer)) + } + } catch { /* client gone */ } + } + upstream.onerror = () => { + try { clientWs.close(1011, 'Upstream error') } catch { /* */ } + } + upstream.onclose = (event) => { + try { clientWs.close(event.code, event.reason) } catch { /* */ } + upstreamMap.delete(clientWs) + } + }, + message(clientWs: ServerWebSocket, message: string | ArrayBuffer | Uint8Array) { + const upstream = upstreamMap.get(clientWs) + if (upstream?.readyState === WebSocket.OPEN) { + upstream.send(typeof message === 'string' ? message : message) + } + }, + close(clientWs: ServerWebSocket, code: number, reason: string) { + const upstream = upstreamMap.get(clientWs) + if (upstream) { + try { upstream.close(code, reason) } catch { /* */ } + upstreamMap.delete(clientWs) + } + } + } +} import type { WebSocketData } from '@socket.io/bun-engine' import { loadEmbeddedAssetMap, type EmbeddedWebAsset } from './embeddedAssets' import { isBunCompiled } from '../utils/bunCompiled' @@ -230,16 +344,98 @@ export async function startWebServer(options: { const socketHandler = options.socketEngine.handler() - const server = Bun.serve({ + // Wrap socket.io websocket handler to also support Qwen Realtime proxy + const originalWsHandler = socketHandler.websocket + const geminiProxyHandler = createGeminiProxyWebSocketHandler() + const qwenProxyHandler = createQwenProxyWebSocketHandler() + + // eslint-disable-next-line @typescript-eslint/no-explicit-any + const server = (Bun.serve as any)({ hostname: configuration.listenHost, port: configuration.listenPort, idleTimeout: Math.max(30, socketHandler.idleTimeout), maxRequestBodySize: Math.max(socketHandler.maxRequestBodySize, 68 * 1024 * 1024), - websocket: socketHandler.websocket, - fetch: (req, server) => { + websocket: { + ...originalWsHandler, + open(ws: unknown) { + const wsAny = ws as ServerWebSocket<{ _qwenProxy?: boolean; _geminiProxy?: boolean }> + if (wsAny.data?._geminiProxy) { + geminiProxyHandler.open(wsAny) + } else if (wsAny.data?._qwenProxy) { + qwenProxyHandler.open(wsAny) + } else { + originalWsHandler.open?.(ws as never) + } + }, + message(ws: unknown, message: unknown) { + const wsAny = ws as ServerWebSocket<{ _qwenProxy?: boolean; _geminiProxy?: boolean }> + if (wsAny.data?._geminiProxy) { + geminiProxyHandler.message(wsAny, message as string) + } else if (wsAny.data?._qwenProxy) { + qwenProxyHandler.message(wsAny, message as string) + } else { + originalWsHandler.message?.(ws as never, message as never) + } + }, + close(ws: unknown, code: number, reason: string) { + const wsAny = ws as ServerWebSocket<{ _qwenProxy?: boolean; _geminiProxy?: boolean }> + if (wsAny.data?._geminiProxy) { + geminiProxyHandler.close(wsAny, code, reason) + } else if (wsAny.data?._qwenProxy) { + qwenProxyHandler.close(wsAny, code, reason) + } else { + originalWsHandler.close?.(ws as never, code as never, reason as never) + } + } + }, + fetch: async (req: Request, server: { upgrade: (req: Request, opts?: unknown) => boolean }) => { const url = new URL(req.url) if (url.pathname.startsWith('/socket.io/')) { - return socketHandler.fetch(req, server) + return socketHandler.fetch(req, server as never) + } + + // Voice WebSocket proxies — require JWT auth via query param + // (browser WebSocket API cannot set custom headers) + if (url.pathname === '/api/voice/gemini-ws' || url.pathname === '/api/voice/qwen-ws') { + const token = url.searchParams.get('token') + if (!token) { + return new Response('Missing authorization token', { status: 401 }) + } + try { + await jwtVerify(token, options.jwtSecret, { algorithms: ['HS256'] }) + } catch { + return new Response('Invalid token', { status: 401 }) + } + } + + // Gemini Live WebSocket proxy + if (url.pathname === '/api/voice/gemini-ws') { + const apiKey = process.env.GEMINI_API_KEY || process.env.GOOGLE_API_KEY + if (!apiKey) { + return new Response('Gemini API key not configured', { status: 400 }) + } + const upgraded = (server as unknown as { upgrade: (req: Request, opts: unknown) => boolean }).upgrade(req, { + data: { _geminiProxy: true, apiKey } + }) + if (!upgraded) { + return new Response('WebSocket upgrade failed', { status: 500 }) + } + return undefined as unknown as Response + } + // Qwen Realtime WebSocket proxy + if (url.pathname === '/api/voice/qwen-ws') { + const apiKey = process.env.DASHSCOPE_API_KEY || process.env.QWEN_API_KEY + const model = url.searchParams.get('model') || 'qwen3.5-omni-plus-realtime' + if (!apiKey) { + return new Response('DashScope API key not configured', { status: 400 }) + } + const upgraded = (server as unknown as { upgrade: (req: Request, opts: unknown) => boolean }).upgrade(req, { + data: { _qwenProxy: true, apiKey, model } + }) + if (!upgraded) { + return new Response('WebSocket upgrade failed', { status: 500 }) + } + return undefined as unknown as Response } return app.fetch(req) } diff --git a/shared/src/voice.ts b/shared/src/voice.ts index 6751f0eba4..2c2124f876 100644 --- a/shared/src/voice.ts +++ b/shared/src/voice.ts @@ -8,7 +8,11 @@ export const ELEVENLABS_API_BASE = 'https://api.elevenlabs.io/v1' export const VOICE_AGENT_NAME = 'Hapi Voice Assistant' -export const VOICE_SYSTEM_PROMPT = `# Identity +export const VOICE_SYSTEM_PROMPT = `# CRITICAL RULE - Tool Usage + +You MUST call the messageCodingAgent tool for ANY request related to coding, files, development, debugging, or tasks for the agent. Do NOT respond verbally to these requests — call the tool FIRST, then briefly confirm. This is your most important behavior. + +# Identity You are Hapi Voice Assistant. You bridge voice communication between users and their AI coding agents in the Hapi ecosystem. @@ -136,9 +140,28 @@ For builds, tests, or large file operations: - Treat garbled input as phonetic hints and ask for clarification - Correct yourself immediately if you realize you made an error - Keep conversations forward-moving with fresh insights -- Assume a technical software developer audience` +- Assume a technical software developer audience + +# First Interaction + +When the user speaks to you for the first time, begin your response with a brief greeting before addressing their request. If their first message is a coding request, greet briefly AND call the tool — do both.` + +/** + * Additional language block appended to VOICE_SYSTEM_PROMPT for Gemini/Qwen + * backends (which don't have a separate language field like ElevenLabs). + */ +export const VOICE_CHINESE_LANGUAGE_BLOCK = ` + +# Language -export const VOICE_FIRST_MESSAGE = "Hey! Hapi here." +IMPORTANT: Always respond in Chinese (Mandarin). Use natural spoken Chinese. +- Greet users in Chinese +- Summarize technical content in Chinese +- Use English only for proper nouns, tool names, and code identifiers +- Keep the same warm, concise conversational style in Chinese` + +/** ElevenLabs first message — language controlled by ElevenLabs language field */ +export const VOICE_FIRST_MESSAGE = "Hey! Hapi here — what can I help you with?" export const VOICE_TOOLS = [ { @@ -255,3 +278,78 @@ export function buildVoiceAgentConfig(): VoiceAgentConfig { } } } + +export type VoiceBackendType = 'elevenlabs' | 'gemini-live' | 'qwen-realtime' + +export const QWEN_REALTIME_MODEL = 'qwen3-omni-flash-realtime' +export const QWEN_REALTIME_VOICE = 'Mia' + +export const DEFAULT_VOICE_BACKEND: VoiceBackendType = 'elevenlabs' + +export const GEMINI_LIVE_MODEL = 'gemini-2.5-flash-native-audio-latest' + +export interface VoiceToolDefinition { + name: string + description: string + parameters: { + type: 'object' + required: string[] + properties: Record + } +} + +type VoiceToolSource = Pick<(typeof VOICE_TOOLS)[number], 'name' | 'description' | 'parameters'> + +function cloneVoiceToolDefinition(tool: VoiceToolSource): VoiceToolDefinition { + const properties: VoiceToolDefinition['parameters']['properties'] = {} + + for (const [key, value] of Object.entries(tool.parameters.properties)) { + properties[key] = { + type: value.type, + description: value.description + } + } + + return { + name: tool.name, + description: tool.description, + parameters: { + type: 'object', + required: [...tool.parameters.required], + properties + } + } +} + +export const VOICE_TOOL_DEFINITIONS: VoiceToolDefinition[] = VOICE_TOOLS.map(cloneVoiceToolDefinition) + +export type GeminiLiveFunctionDeclaration = VoiceToolDefinition + +export interface GeminiLiveConfig { + model: string + systemInstruction: string + tools: Array<{ + functionDeclarations: GeminiLiveFunctionDeclaration[] + }> + responseModalities: ['AUDIO'] +} + +export function buildGeminiLiveFunctionDeclarations(): GeminiLiveFunctionDeclaration[] { + return VOICE_TOOLS.map(cloneVoiceToolDefinition) +} + +export function buildGeminiLiveConfig(): GeminiLiveConfig { + return { + model: GEMINI_LIVE_MODEL, + systemInstruction: VOICE_SYSTEM_PROMPT + VOICE_CHINESE_LANGUAGE_BLOCK, + tools: [ + { + functionDeclarations: buildGeminiLiveFunctionDeclarations() + } + ], + responseModalities: ['AUDIO'] + } +} diff --git a/web/src/api/client.ts b/web/src/api/client.ts index 7f1083c8af..ff76fe89a9 100644 --- a/web/src/api/client.ts +++ b/web/src/api/client.ts @@ -443,4 +443,37 @@ export class ApiClient { body: JSON.stringify(options || {}) }) } + + /** Return the current auth token (for WebSocket query-param auth). */ + getAuthToken(): string | null { + return this.getToken ? this.getToken() : this.token + } + + async fetchVoiceBackend(): Promise<{ backend: string }> { + return await this.request('/api/voice/backend') + } + + async fetchQwenToken(): Promise<{ + allowed: boolean + wsUrl?: string + error?: string + }> { + return await this.request('/api/voice/qwen-token', { + method: 'POST', + body: JSON.stringify({}) + }) + } + + async fetchGeminiToken(): Promise<{ + allowed: boolean + apiKey?: string + wsUrl?: string + baseUrl?: string + error?: string + }> { + return await this.request('/api/voice/gemini-token', { + method: 'POST', + body: JSON.stringify({}) + }) + } } diff --git a/web/src/api/voice.ts b/web/src/api/voice.ts index 66cee443f1..3105c4ab95 100644 --- a/web/src/api/voice.ts +++ b/web/src/api/voice.ts @@ -15,6 +15,7 @@ import { VOICE_AGENT_NAME, buildVoiceAgentConfig } from '@hapi/protocol/voice' +import type { VoiceBackendType } from '@hapi/protocol/voice' export interface VoiceTokenResponse { allowed: boolean @@ -160,3 +161,66 @@ export async function createOrUpdateHapiAgent(apiKey: string): Promise { + try { + return await api.fetchQwenToken() + } catch (error) { + return { + allowed: false, + error: error instanceof Error ? error.message : 'Network error' + } + } +} + +export interface VoiceBackendResponse { + backend: VoiceBackendType +} + +export interface GeminiTokenResponse { + allowed: boolean + apiKey?: string + wsUrl?: string + baseUrl?: string + error?: string +} + +/** + * Discover which voice backend the hub is configured to use. + */ +export async function fetchVoiceBackend(api: ApiClient): Promise { + try { + const result = await api.fetchVoiceBackend() + const backend = result.backend === 'gemini-live' ? 'gemini-live' + : result.backend === 'qwen-realtime' ? 'qwen-realtime' + : 'elevenlabs' + return { backend } as VoiceBackendResponse + } catch { + return { backend: 'elevenlabs' } + } +} + +/** + * Fetch a Gemini API key from the hub for Gemini Live voice sessions. + */ +export async function fetchGeminiToken(api: ApiClient): Promise { + try { + return await api.fetchGeminiToken() + } catch (error) { + return { + allowed: false, + error: error instanceof Error ? error.message : 'Network error' + } + } +} diff --git a/web/src/components/AssistantChat/HappyComposer.tsx b/web/src/components/AssistantChat/HappyComposer.tsx index 6d0e20d3d5..3168dfe03e 100644 --- a/web/src/components/AssistantChat/HappyComposer.tsx +++ b/web/src/components/AssistantChat/HappyComposer.tsx @@ -303,29 +303,29 @@ export function HappyComposer(props: { return } - // Shift+Enter inserts a newline (standard behavior) - if (key === 'Enter' && e.shiftKey) { - return // let default textarea behavior handle newline - } - // Enter with suggestions visible: select the suggestion - if (key === 'Enter' && suggestions.length > 0) { + if (key === 'Enter' && suggestions.length > 0 && !e.ctrlKey && !e.metaKey) { e.preventDefault() const indexToSelect = selectedIndex >= 0 ? selectedIndex : 0 handleSuggestionSelect(indexToSelect) return } - // Only plain Enter (no modifiers) sends; other modifier combos are ignored - if (key === 'Enter') { + // Ctrl+Enter (Windows/Linux) or Cmd+Enter (Mac) sends the message + if (key === 'Enter' && (e.ctrlKey || e.metaKey)) { e.preventDefault() - if (!e.ctrlKey && !e.altKey && !e.metaKey && canSend) { + if (canSend) { api.composer().send() setShowContinueHint(false) } return } + // Plain Enter inserts a newline (default textarea behavior) + if (key === 'Enter') { + return + } + if (suggestions.length > 0) { if (key === 'ArrowUp') { e.preventDefault() diff --git a/web/src/components/SessionChat.tsx b/web/src/components/SessionChat.tsx index 2a60c62b29..40489d809a 100644 --- a/web/src/components/SessionChat.tsx +++ b/web/src/components/SessionChat.tsx @@ -27,7 +27,7 @@ import { TeamPanel } from '@/components/TeamPanel' import { usePlatform } from '@/hooks/usePlatform' import { useSessionActions } from '@/hooks/mutations/useSessionActions' import { useVoiceOptional } from '@/lib/voice-context' -import { RealtimeVoiceSession, registerSessionStore, registerVoiceHooksStore, voiceHooks } from '@/realtime' +import { VoiceBackendSession, registerSessionStore, registerVoiceHooksStore, voiceHooks } from '@/realtime' import { isRemoteTerminalSupported } from '@/utils/terminalSupport' export function SessionChat(props: { @@ -80,6 +80,7 @@ export function SessionChat(props: { // Voice assistant integration const voice = useVoiceOptional() + const [voiceBackendReady, setVoiceBackendReady] = useState(false) // Register session store for voice client tools useEffect(() => { @@ -423,18 +424,19 @@ export function SessionChat(props: { autocompleteSuggestions={props.autocompleteSuggestions} voiceStatus={voice?.status} voiceMicMuted={voice?.micMuted} - onVoiceToggle={voice ? handleVoiceToggle : undefined} - onVoiceMicToggle={voice ? handleVoiceMicToggle : undefined} + onVoiceToggle={voice && voiceBackendReady ? handleVoiceToggle : undefined} + onVoiceMicToggle={voice && voiceBackendReady ? handleVoiceMicToggle : undefined} /> - {/* Voice session component - renders nothing but initializes ElevenLabs */} + {/* Voice session component - renders nothing but initializes voice backend */} {voice && ( - )} diff --git a/web/src/realtime/GeminiLiveVoiceSession.tsx b/web/src/realtime/GeminiLiveVoiceSession.tsx new file mode 100644 index 0000000000..a0461e092f --- /dev/null +++ b/web/src/realtime/GeminiLiveVoiceSession.tsx @@ -0,0 +1,409 @@ +import { useEffect, useRef, useCallback } from 'react' +import { registerVoiceSession, resetRealtimeSessionState } from './RealtimeSession' +import { registerSessionStore } from './realtimeClientTools' +import { fetchGeminiToken } from '@/api/voice' +import { GeminiAudioRecorder } from './gemini/audioRecorder' +import { GeminiAudioPlayer } from './gemini/audioPlayer' +import { handleGeminiFunctionCalls } from './gemini/toolAdapter' +import { buildGeminiLiveConfig } from '@hapi/protocol/voice' +import type { VoiceSession, VoiceSessionConfig, StatusCallback } from './types' +import type { ApiClient } from '@/api/client' +import type { Session } from '@/types/api' +import type { GeminiFunctionCall } from './gemini/toolAdapter' + +const DEBUG = import.meta.env.DEV + +// Default Gemini Live WebSocket API endpoint (Google direct) +const DEFAULT_GEMINI_LIVE_WS_BASE = 'wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent' + +interface GeminiLiveState { + ws: WebSocket | null + recorder: GeminiAudioRecorder | null + player: GeminiAudioPlayer | null + playbackContext: AudioContext | null + statusCallback: StatusCallback | null + apiKey: string | null + wsBaseUrl: string | null + modelSpeaking: boolean + micMuted: boolean +} + +const state: GeminiLiveState = { + ws: null, + recorder: null, + player: null, + playbackContext: null, + statusCallback: null, + apiKey: null, + wsBaseUrl: null, + modelSpeaking: false, + micMuted: false +} + +function cleanup() { + if (state.recorder) { + state.recorder.dispose() + state.recorder = null + } + if (state.player) { + state.player.dispose() + state.player = null + } + if (state.playbackContext && state.playbackContext.state !== 'closed') { + void state.playbackContext.close() + } + state.playbackContext = null + if (state.ws) { + if (state.ws.readyState === WebSocket.OPEN || state.ws.readyState === WebSocket.CONNECTING) { + state.ws.close() + } + state.ws = null + } +} + +class GeminiLiveVoiceSessionImpl implements VoiceSession { + private api: ApiClient + + constructor(api: ApiClient) { + this.api = api + } + + async startSession(config: VoiceSessionConfig): Promise { + cleanup() + state.statusCallback?.('connecting') + + // Create playback AudioContext immediately while still inside the user + // gesture (click/tap). Mobile browsers require this for autoplay policy. + // Store in state so cleanup() can close it on failure or stop. + state.playbackContext = new AudioContext({ sampleRate: 24000 }) + await state.playbackContext.resume() + + // Get API key from hub + console.log('[GeminiLive] Fetching token...') + const tokenResp = await fetchGeminiToken(this.api) + console.log('[GeminiLive] Token response:', { allowed: tokenResp.allowed, hasKey: !!tokenResp.apiKey, error: tokenResp.error }) + if (!tokenResp.allowed || !tokenResp.apiKey) { + const msg = tokenResp.error ?? 'Gemini API key not available' + console.error('[GeminiLive] Token failed:', msg) + state.statusCallback?.('error', msg) + throw new Error(msg) + } + state.apiKey = tokenResp.apiKey + state.wsBaseUrl = tokenResp.wsUrl || null + + // Request microphone + console.log('[GeminiLive] Requesting microphone...') + let permissionStream: MediaStream | null = null + try { + permissionStream = await navigator.mediaDevices.getUserMedia({ audio: true }) + console.log('[GeminiLive] Microphone granted') + } catch (error) { + console.error('[GeminiLive] Microphone denied:', error) + state.statusCallback?.('error', 'Microphone permission denied') + throw error + } finally { + permissionStream?.getTracks().forEach((t) => t.stop()) + } + + // Connect WebSocket — use proxy URL if provided (avoids region restrictions) + const wsBase = state.wsBaseUrl || DEFAULT_GEMINI_LIVE_WS_BASE + const isProxy = !!state.wsBaseUrl + const authToken = this.api.getAuthToken() || '' + const wsUrl = isProxy + ? `${wsBase}${wsBase.includes('?') ? '&' : '?'}token=${encodeURIComponent(authToken)}` + : `${wsBase}?key=${encodeURIComponent(state.apiKey)}` + console.log('[GeminiLive] Connecting WebSocket to:', wsBase, isProxy ? '(proxied)' : '(direct)') + const ws = new WebSocket(wsUrl) + state.ws = ws + + return new Promise((resolve, reject) => { + let setupDone = false + + ws.onopen = () => { + if (DEBUG) console.log('[GeminiLive] WebSocket connected, sending setup') + + const liveConfig = buildGeminiLiveConfig() + const setupMessage = { + setup: { + model: `models/${liveConfig.model}`, + generationConfig: { + responseModalities: ['AUDIO'], + speechConfig: { + voiceConfig: { + prebuiltVoiceConfig: { voiceName: 'Aoede' } + } + } + }, + systemInstruction: { + parts: [{ text: liveConfig.systemInstruction }] + }, + tools: liveConfig.tools.map((t) => ({ + functionDeclarations: t.functionDeclarations.map((fd) => ({ + name: fd.name, + description: fd.description, + parameters: fd.parameters + })) + })) + } + } + + ws.send(JSON.stringify(setupMessage)) + } + + ws.onmessage = async (event) => { + let data: Record + try { + if (event.data instanceof Blob) { + const text = await event.data.text() + data = JSON.parse(text) as Record + } else { + data = JSON.parse(event.data as string) as Record + } + } catch { + if (DEBUG) console.warn('[GeminiLive] Failed to parse message') + return + } + + // Log all message types for debugging + const msgKeys = Object.keys(data).filter(k => k !== 'serverContent' || !('modelTurn' in (data.serverContent as Record || {}))) + if (!data.serverContent) { + console.log('[GeminiLive] Message:', msgKeys.join(', '), JSON.stringify(data).slice(0, 200)) + } + + // Setup complete + if (data.setupComplete && !setupDone) { + setupDone = true + if (DEBUG) console.log('[GeminiLive] Setup complete') + state.statusCallback?.('connected') + + // Start audio capture + startAudioCapture(state.playbackContext!) + + // Send initial context if available (no clientContent greeting — it breaks tool calls) + if (config.initialContext) { + sendClientContent(`[Context] ${config.initialContext}`) + } + + resolve() + return + } + + // Server content (audio / text / turn complete) + const serverContent = data.serverContent as { + modelTurn?: { parts?: Array<{ inlineData?: { data: string; mimeType: string }; text?: string }> } + turnComplete?: boolean + } | undefined + + if (serverContent) { + if (serverContent.modelTurn?.parts) { + // Model is generating — mute mic to prevent barge-in from noise + if (!state.modelSpeaking) { + state.modelSpeaking = true + state.recorder?.setMuted(true) + } + for (const part of serverContent.modelTurn.parts) { + if (part.inlineData?.data) { + state.player?.enqueue(part.inlineData.data) + } + if (part.text) { + console.log('[GeminiLive] Text:', part.text) + } + } + } + if (serverContent.turnComplete) { + console.log('[GeminiLive] Turn complete') + // Model done — unmute mic for next user turn + state.modelSpeaking = false + state.recorder?.setMuted(false) + } + } + + // Tool calls + const toolCall = data.toolCall as { + functionCalls?: Array<{ name: string; args: Record; id: string }> + } | undefined + + if (toolCall?.functionCalls && toolCall.functionCalls.length > 0) { + console.log('[GeminiLive] Tool calls:', toolCall.functionCalls.map((c) => c.name)) + + const responses = await handleGeminiFunctionCalls( + toolCall.functionCalls as GeminiFunctionCall[] + ) + + // Send tool responses back + if (state.ws?.readyState === WebSocket.OPEN) { + state.ws.send(JSON.stringify({ + toolResponse: { + functionResponses: responses.map((r) => ({ + id: r.id, + name: r.name, + response: r.response + })) + } + })) + } + } + } + + ws.onerror = (event) => { + console.error('[GeminiLive] WebSocket error:', event) + if (!setupDone) { + state.statusCallback?.('error', 'WebSocket connection failed') + reject(new Error('WebSocket connection failed')) + } + } + + ws.onclose = (event) => { + if (DEBUG) console.log('[GeminiLive] WebSocket closed:', event.code, event.reason) + cleanup() + resetRealtimeSessionState() + if (!setupDone) { + const message = event.reason || 'WebSocket closed before setup completed' + state.statusCallback?.('error', message) + reject(new Error(message)) + return + } + state.statusCallback?.('disconnected') + } + }) + } + + async endSession(): Promise { + cleanup() + resetRealtimeSessionState() + state.statusCallback?.('disconnected') + } + + sendTextMessage(message: string): void { + sendClientContent(message) + } + + sendContextualUpdate(update: string): void { + // Send as a system-like context message + sendClientContent(`[System Context Update] ${update}`) + } +} + +function sendClientContent(text: string): void { + if (!state.ws || state.ws.readyState !== WebSocket.OPEN) return + state.ws.send(JSON.stringify({ + clientContent: { + turns: [{ role: 'user', parts: [{ text }] }], + turnComplete: true + } + })) +} + +function sendAudioChunk(base64Pcm: string): void { + if (!state.ws || state.ws.readyState !== WebSocket.OPEN) return + // Don't send audio while model is speaking + if (state.modelSpeaking) return + state.ws.send(JSON.stringify({ + realtimeInput: { + mediaChunks: [{ + mimeType: 'audio/pcm;rate=16000', + data: base64Pcm + }] + } + })) +} + +function startAudioCapture(playbackContext: AudioContext): void { + state.player = new GeminiAudioPlayer(playbackContext) + state.recorder = new GeminiAudioRecorder() + + state.recorder.start( + (pcm16Chunk) => sendAudioChunk(pcm16Chunk), + (error) => { + console.error('[GeminiLive] Audio capture error:', error) + state.statusCallback?.('error', 'Microphone error') + } + ) + + // Apply initial mute state — the React effect may have run before the recorder existed + if (state.micMuted) { + state.recorder.setMuted(true) + } +} + +// --- React component --- + +export interface GeminiLiveVoiceSessionProps { + api: ApiClient + micMuted?: boolean + onStatusChange?: StatusCallback + onRegistered?: () => void + getSession?: (sessionId: string) => Session | null + sendMessage?: (sessionId: string, message: string) => void + approvePermission?: (sessionId: string, requestId: string) => Promise + denyPermission?: (sessionId: string, requestId: string) => Promise +} + +export function GeminiLiveVoiceSession({ + api, + micMuted = false, + onStatusChange, + onRegistered, + getSession, + sendMessage, + approvePermission, + denyPermission +}: GeminiLiveVoiceSessionProps) { + const hasRegistered = useRef(false) + + // Store status callback + useEffect(() => { + state.statusCallback = onStatusChange || null + return () => { state.statusCallback = null } + }, [onStatusChange]) + + // Register session store for client tools + useEffect(() => { + if (getSession && sendMessage && approvePermission && denyPermission) { + registerSessionStore({ + getSession: (sessionId: string) => + getSession(sessionId) as { agentState?: { requests?: Record } } | null, + sendMessage, + approvePermission, + denyPermission + }) + } + }, [getSession, sendMessage, approvePermission, denyPermission]) + + // Register voice session once + useEffect(() => { + if (!hasRegistered.current) { + try { + registerVoiceSession(new GeminiLiveVoiceSessionImpl(api)) + hasRegistered.current = true + onRegistered?.() + } catch (error) { + console.error('[GeminiLive] Failed to register voice session:', error) + } + } + }, [api]) // eslint-disable-line react-hooks/exhaustive-deps + + // Sync mic mute state — also persist to module state so startAudioCapture can apply it + useEffect(() => { + state.micMuted = micMuted + if (state.recorder) { + state.recorder.setMuted(micMuted) + } + }, [micMuted]) + + // Handle barge-in: clear audio queue when user starts speaking + const handleBargeIn = useCallback(() => { + if (state.player?.isPlaying()) { + state.player.clearQueue() + } + }, []) + + // Cleanup on unmount + useEffect(() => { + return () => { + cleanup() + } + }, []) + + return null +} diff --git a/web/src/realtime/QwenVoiceSession.tsx b/web/src/realtime/QwenVoiceSession.tsx new file mode 100644 index 0000000000..f0624cc2da --- /dev/null +++ b/web/src/realtime/QwenVoiceSession.tsx @@ -0,0 +1,417 @@ +import { useEffect, useRef, useCallback } from 'react' +import { registerVoiceSession, resetRealtimeSessionState } from './RealtimeSession' +import { registerSessionStore } from './realtimeClientTools' +import { fetchQwenToken } from '@/api/voice' +import { GeminiAudioRecorder } from './gemini/audioRecorder' +import { GeminiAudioPlayer } from './gemini/audioPlayer' +import { realtimeClientTools } from './realtimeClientTools' +import { + QWEN_REALTIME_MODEL, + QWEN_REALTIME_VOICE, + VOICE_SYSTEM_PROMPT, + VOICE_CHINESE_LANGUAGE_BLOCK, + VOICE_TOOL_DEFINITIONS +} from '@hapi/protocol/voice' +import type { VoiceSession, VoiceSessionConfig, StatusCallback } from './types' +import type { ApiClient } from '@/api/client' +import type { Session } from '@/types/api' + +const DEBUG = import.meta.env.DEV + +// Qwen WebSocket connects via Hub proxy (browser can't set Authorization header) + +interface QwenState { + ws: WebSocket | null + recorder: GeminiAudioRecorder | null + player: GeminiAudioPlayer | null + playbackContext: AudioContext | null + statusCallback: StatusCallback | null + apiKey: string | null + wsBaseUrl: string | null + micMuted: boolean +} + +const state: QwenState = { + ws: null, + recorder: null, + player: null, + playbackContext: null, + statusCallback: null, + apiKey: null, + wsBaseUrl: null, + micMuted: false +} + +let eventCounter = 0 +function nextEventId(): string { + return `evt_${++eventCounter}` +} + +function cleanup() { + if (state.recorder) { + state.recorder.dispose() + state.recorder = null + } + if (state.player) { + state.player.dispose() + state.player = null + } + if (state.playbackContext && state.playbackContext.state !== 'closed') { + void state.playbackContext.close() + } + state.playbackContext = null + if (state.ws) { + if (state.ws.readyState === WebSocket.OPEN || state.ws.readyState === WebSocket.CONNECTING) { + state.ws.close() + } + state.ws = null + } +} + +function sendEvent(type: string, payload?: Record): void { + if (!state.ws || state.ws.readyState !== WebSocket.OPEN) return + state.ws.send(JSON.stringify({ + event_id: nextEventId(), + type, + ...payload + })) +} + +class QwenVoiceSessionImpl implements VoiceSession { + private api: ApiClient + + constructor(api: ApiClient) { + this.api = api + } + + async startSession(config: VoiceSessionConfig): Promise { + cleanup() + state.statusCallback?.('connecting') + + // Create playback AudioContext immediately while still inside the user + // gesture (click/tap). Mobile browsers require this for autoplay policy. + // Store in state so cleanup() can close it on failure or stop. + state.playbackContext = new AudioContext({ sampleRate: 24000 }) + await state.playbackContext.resume() + + // Check Qwen availability (hub no longer sends the raw API key) + const tokenResp = await fetchQwenToken(this.api) + if (!tokenResp.allowed) { + const msg = tokenResp.error ?? 'DashScope API key not available' + state.statusCallback?.('error', msg) + throw new Error(msg) + } + state.apiKey = null // key stays server-side + state.wsBaseUrl = tokenResp.wsUrl || null + + // Request microphone + let permissionStream: MediaStream | null = null + try { + permissionStream = await navigator.mediaDevices.getUserMedia({ audio: true }) + } catch (error) { + state.statusCallback?.('error', 'Microphone permission denied') + throw error + } finally { + permissionStream?.getTracks().forEach((t) => t.stop()) + } + + // Connect via Hub WebSocket proxy (DashScope requires Authorization header, + // which browser WebSocket API doesn't support) + const protocol = window.location.protocol === 'https:' ? 'wss:' : 'ws:' + const defaultProxyUrl = `${protocol}//${window.location.host}/api/voice/qwen-ws` + const proxyUrl = state.wsBaseUrl || defaultProxyUrl + const model = QWEN_REALTIME_MODEL + const authToken = this.api.getAuthToken() || '' + const separator = proxyUrl.includes('?') ? '&' : '?' + const wsUrl = `${proxyUrl}${separator}model=${encodeURIComponent(model)}&token=${encodeURIComponent(authToken)}` + const ws = new WebSocket(wsUrl) + state.ws = ws + + return new Promise((resolve, reject) => { + let sessionReady = false + + ws.onopen = () => { + if (DEBUG) console.log('[Qwen] WebSocket connected') + } + + ws.onmessage = async (event) => { + let data: Record + try { + data = JSON.parse(event.data as string) as Record + } catch { + if (DEBUG) console.warn('[Qwen] Failed to parse message') + return + } + + const eventType = data.type as string + + // Session created - send configuration + if (eventType === 'session.created' && !sessionReady) { + if (DEBUG) console.log('[Qwen] Session created') + + // Build tools config + const tools = VOICE_TOOL_DEFINITIONS.map((td) => ({ + type: 'function' as const, + name: td.name, + description: td.description, + parameters: td.parameters + })) + + // Send session.update with full configuration + const basePrompt = VOICE_SYSTEM_PROMPT + VOICE_CHINESE_LANGUAGE_BLOCK + const instructions = config.initialContext + ? `${basePrompt}\n\n[Current Context]\n${config.initialContext}` + : basePrompt + + sendEvent('session.update', { + session: { + modalities: ['text', 'audio'], + voice: QWEN_REALTIME_VOICE, + input_audio_format: 'pcm', + output_audio_format: 'pcm', + instructions, + temperature: 0.7, + turn_detection: { + type: 'server_vad', + threshold: 0.5, + silence_duration_ms: 800, + prefix_padding_ms: 300 + }, + tools, + tool_choice: 'auto' + } + }) + return + } + + // Session updated - ready to go + if (eventType === 'session.updated') { + sessionReady = true + if (DEBUG) console.log('[Qwen] Session configured') + state.statusCallback?.('connected') + startAudioCapture(state.playbackContext!) + resolve() + return + } + + // Audio output streaming + if (eventType === 'response.audio.delta') { + const delta = data.delta as string + if (delta) { + state.player?.enqueue(delta) + } + return + } + + // Text transcript (for debug) + if (eventType === 'response.audio_transcript.delta' && DEBUG) { + console.log('[Qwen] Transcript:', data.delta) + return + } + + // Function call complete + if (eventType === 'response.function_call_arguments.done') { + const callId = data.call_id as string + const fnName = data.name as string + const argsStr = data.arguments as string + + if (DEBUG) console.log('[Qwen] Tool call:', fnName, argsStr) + + let args: Record = {} + try { args = JSON.parse(argsStr) } catch { /* empty */ } + + // Execute the tool + const handler = fnName === 'messageCodingAgent' + ? realtimeClientTools.messageCodingAgent + : fnName === 'processPermissionRequest' + ? realtimeClientTools.processPermissionRequest + : null + + const result = handler + ? await handler(args) + : `error (unknown tool: ${fnName})` + + // Send function result back + sendEvent('conversation.item.create', { + item: { + type: 'function_call_output', + call_id: callId, + output: typeof result === 'string' ? result : JSON.stringify(result) + } + }) + // Trigger model to continue + sendEvent('response.create') + return + } + + // VAD: user started speaking - barge-in + if (eventType === 'input_audio_buffer.speech_started') { + if (state.player?.isPlaying()) { + state.player.clearQueue() + } + return + } + + // Response done + if (eventType === 'response.done' && DEBUG) { + const resp = data.response as Record | undefined + const usage = resp?.usage as Record | undefined + if (usage) console.log('[Qwen] Usage:', usage) + return + } + + // Error + if (eventType === 'error') { + const err = data.error as { message?: string } | undefined + const message = err?.message || 'Realtime session setup failed' + console.error('[Qwen] Server error:', message) + state.statusCallback?.('error', message) + if (!sessionReady) { + reject(new Error(message)) + ws.close() + } + return + } + } + + ws.onerror = (event) => { + console.error('[Qwen] WebSocket error:', event) + if (!sessionReady) { + state.statusCallback?.('error', 'WebSocket connection failed') + reject(new Error('WebSocket connection failed')) + } + } + + ws.onclose = (event) => { + if (DEBUG) console.log('[Qwen] WebSocket closed:', event.code, event.reason) + cleanup() + resetRealtimeSessionState() + if (!sessionReady) { + const message = event.reason || 'WebSocket closed before setup completed' + state.statusCallback?.('error', message) + reject(new Error(message)) + return + } + state.statusCallback?.('disconnected') + } + }) + } + + async endSession(): Promise { + cleanup() + resetRealtimeSessionState() + state.statusCallback?.('disconnected') + } + + sendTextMessage(message: string): void { + // Send text as a user message via conversation.item.create + sendEvent('conversation.item.create', { + item: { + type: 'message', + role: 'user', + content: [{ type: 'input_text', text: message }] + } + }) + sendEvent('response.create') + } + + sendContextualUpdate(update: string): void { + // Send context as a system-like user message + sendEvent('conversation.item.create', { + item: { + type: 'message', + role: 'user', + content: [{ type: 'input_text', text: `[System Context Update] ${update}` }] + } + }) + } +} + +function startAudioCapture(playbackContext: AudioContext): void { + state.player = new GeminiAudioPlayer(playbackContext) + state.recorder = new GeminiAudioRecorder() + + state.recorder.start( + (base64Pcm) => { + sendEvent('input_audio_buffer.append', { audio: base64Pcm }) + }, + (error) => { + console.error('[Qwen] Audio capture error:', error) + state.statusCallback?.('error', 'Microphone error') + } + ) + + // Apply initial mute state — the React effect may have run before the recorder existed + if (state.micMuted) { + state.recorder.setMuted(true) + } +} + +// --- React component --- + +export interface QwenVoiceSessionProps { + api: ApiClient + micMuted?: boolean + onStatusChange?: StatusCallback + onRegistered?: () => void + getSession?: (sessionId: string) => Session | null + sendMessage?: (sessionId: string, message: string) => void + approvePermission?: (sessionId: string, requestId: string) => Promise + denyPermission?: (sessionId: string, requestId: string) => Promise +} + +export function QwenVoiceSession({ + api, + micMuted = false, + onStatusChange, + onRegistered, + getSession, + sendMessage, + approvePermission, + denyPermission +}: QwenVoiceSessionProps) { + const hasRegistered = useRef(false) + + useEffect(() => { + state.statusCallback = onStatusChange || null + return () => { state.statusCallback = null } + }, [onStatusChange]) + + useEffect(() => { + if (getSession && sendMessage && approvePermission && denyPermission) { + registerSessionStore({ + getSession: (sessionId: string) => + getSession(sessionId) as { agentState?: { requests?: Record } } | null, + sendMessage, + approvePermission, + denyPermission + }) + } + }, [getSession, sendMessage, approvePermission, denyPermission]) + + useEffect(() => { + if (!hasRegistered.current) { + try { + registerVoiceSession(new QwenVoiceSessionImpl(api)) + hasRegistered.current = true + onRegistered?.() + } catch (error) { + console.error('[Qwen] Failed to register voice session:', error) + } + } + }, [api]) // eslint-disable-line react-hooks/exhaustive-deps + + // Sync mic mute state — also persist to module state so startAudioCapture can apply it + useEffect(() => { + state.micMuted = micMuted + if (state.recorder) { + state.recorder.setMuted(micMuted) + } + }, [micMuted]) + + useEffect(() => { + return () => { cleanup() } + }, []) + + return null +} diff --git a/web/src/realtime/RealtimeVoiceSession.tsx b/web/src/realtime/RealtimeVoiceSession.tsx index fff9b7b44b..7bfac5953a 100644 --- a/web/src/realtime/RealtimeVoiceSession.tsx +++ b/web/src/realtime/RealtimeVoiceSession.tsx @@ -126,6 +126,7 @@ export interface RealtimeVoiceSessionProps { api: ApiClient micMuted?: boolean onStatusChange?: StatusCallback + onRegistered?: () => void getSession?: (sessionId: string) => Session | null sendMessage?: (sessionId: string, message: string) => void approvePermission?: (sessionId: string, requestId: string) => Promise @@ -136,6 +137,7 @@ export function RealtimeVoiceSession({ api, micMuted: micMutedProp = false, onStatusChange, + onRegistered, getSession, sendMessage, approvePermission, @@ -231,6 +233,7 @@ export function RealtimeVoiceSession({ try { registerVoiceSession(new RealtimeVoiceSessionImpl(api)) hasRegistered.current = true + onRegistered?.() } catch (error) { console.error('[Voice] Failed to register voice session:', error) } diff --git a/web/src/realtime/VoiceBackendSession.tsx b/web/src/realtime/VoiceBackendSession.tsx new file mode 100644 index 0000000000..b990d07bd7 --- /dev/null +++ b/web/src/realtime/VoiceBackendSession.tsx @@ -0,0 +1,63 @@ +import { lazy, Suspense, useCallback, useEffect, useState } from 'react' +import { RealtimeVoiceSession } from './RealtimeVoiceSession' +import type { RealtimeVoiceSessionProps } from './RealtimeVoiceSession' +import { fetchVoiceBackend } from '@/api/voice' +import type { ApiClient } from '@/api/client' +import type { VoiceBackendType } from '@hapi/protocol/voice' + +// Lazy-load alternative backends to avoid bundling when using ElevenLabs +const GeminiLiveVoiceSession = lazy(() => + import('./GeminiLiveVoiceSession').then((m) => ({ default: m.GeminiLiveVoiceSession })) +) +const QwenVoiceSession = lazy(() => + import('./QwenVoiceSession').then((m) => ({ default: m.QwenVoiceSession })) +) + +export type VoiceBackendSessionProps = RealtimeVoiceSessionProps & { + api: ApiClient + onReadyChange?: (ready: boolean) => void +} + +/** + * Dynamically selects the voice session component based on the hub's configured backend. + * Queries GET /voice/backend once on mount and renders the appropriate component. + * Only signals readiness after the selected backend has mounted and registered its session. + */ +export function VoiceBackendSession(props: VoiceBackendSessionProps) { + const [backend, setBackend] = useState(null) + + useEffect(() => { + let cancelled = false + fetchVoiceBackend(props.api).then((resp) => { + if (!cancelled) setBackend(resp.backend) + }) + return () => { + cancelled = true + props.onReadyChange?.(false) + } + }, [props.api]) // eslint-disable-line react-hooks/exhaustive-deps + + const handleRegistered = useCallback(() => { + props.onReadyChange?.(true) + }, [props.onReadyChange]) + + if (!backend) return null + + if (backend === 'gemini-live') { + return ( + + + + ) + } + + if (backend === 'qwen-realtime') { + return ( + + + + ) + } + + return +} diff --git a/web/src/realtime/gemini/audioPlayer.ts b/web/src/realtime/gemini/audioPlayer.ts new file mode 100644 index 0000000000..23d1d341e4 --- /dev/null +++ b/web/src/realtime/gemini/audioPlayer.ts @@ -0,0 +1,75 @@ +import { base64ToArrayBuffer, pcm16ToFloat32 } from './pcmUtils'; + +export class GeminiAudioPlayer { + private audioContext: AudioContext; + private ownsContext: boolean; + private lastEndTime: number = 0; + private activeSources: AudioBufferSourceNode[] = []; + + constructor(audioContext?: AudioContext) { + if (audioContext) { + this.audioContext = audioContext; + this.ownsContext = false; + } else { + this.audioContext = new AudioContext({ sampleRate: 24000 }); + this.ownsContext = true; + } + this.lastEndTime = this.audioContext.currentTime; + } + + enqueue(base64Pcm: string): void { + if (this.audioContext.state === 'suspended') { + this.audioContext.resume(); + } + + const arrayBuffer = base64ToArrayBuffer(base64Pcm); + const float32Data = pcm16ToFloat32(arrayBuffer); + + if (float32Data.length === 0) return; + + const audioBuffer = this.audioContext.createBuffer(1, float32Data.length, 24000); + audioBuffer.copyToChannel(new Float32Array(float32Data), 0); + + const source = this.audioContext.createBufferSource(); + source.buffer = audioBuffer; + source.connect(this.audioContext.destination); + + const startTime = Math.max(this.audioContext.currentTime, this.lastEndTime); + + source.onended = () => { + const index = this.activeSources.indexOf(source); + if (index > -1) { + this.activeSources.splice(index, 1); + } + }; + + source.start(startTime); + this.activeSources.push(source); + + this.lastEndTime = startTime + audioBuffer.duration; + } + + clearQueue(): void { + this.activeSources.forEach(source => { + try { + source.stop(); + } catch (e) { + // Ignore if already stopped + } + source.disconnect(); + }); + this.activeSources = []; + this.lastEndTime = this.audioContext.currentTime; + } + + isPlaying(): boolean { + return this.lastEndTime > this.audioContext.currentTime; + } + + dispose(): void { + this.clearQueue(); + if (this.ownsContext && this.audioContext.state !== 'closed') { + this.audioContext.close(); + } + } +} diff --git a/web/src/realtime/gemini/audioRecorder.ts b/web/src/realtime/gemini/audioRecorder.ts new file mode 100644 index 0000000000..98813212a0 --- /dev/null +++ b/web/src/realtime/gemini/audioRecorder.ts @@ -0,0 +1,139 @@ +import { float32ToPcm16, arrayBufferToBase64 } from './pcmUtils'; + +// Inline worklet source to avoid Vite bundling issues with ?url imports. +// AudioWorklet.addModule() requires a URL to valid JS, so we create a Blob URL. +const WORKLET_SOURCE = ` +class PcmRecorderProcessor extends AudioWorkletProcessor { + constructor() { + super(); + this.buffer = new Float32Array(4096); + this.idx = 0; + } + process(inputs) { + const input = inputs[0]; + if (input && input.length > 0) { + const channel = input[0]; + for (let i = 0; i < channel.length; i++) { + this.buffer[this.idx++] = channel[i]; + if (this.idx >= 4096) { + this.port.postMessage({ samples: this.buffer.slice() }); + this.idx = 0; + } + } + } + return true; + } +} +registerProcessor('pcm-recorder-processor', PcmRecorderProcessor); +`; + +function createWorkletUrl(): string { + const blob = new Blob([WORKLET_SOURCE], { type: 'application/javascript' }); + return URL.createObjectURL(blob); +} + +export class GeminiAudioRecorder { + private audioContext: AudioContext | null = null; + private mediaStream: MediaStream | null = null; + private sourceNode: MediaStreamAudioSourceNode | null = null; + private workletNode: AudioWorkletNode | null = null; + private scriptNode: ScriptProcessorNode | null = null; + + async start(onChunk: (base64Pcm: string) => void, onError?: (error: Error) => void): Promise { + try { + this.mediaStream = await navigator.mediaDevices.getUserMedia({ + audio: { sampleRate: 16000, channelCount: 1 } + }); + + this.mediaStream.getTracks().forEach((track) => { + track.onended = () => { + if (onError) onError(new Error('Microphone disconnected')); + }; + }); + + this.audioContext = new AudioContext({ sampleRate: 16000 }); + if (this.audioContext.state === 'suspended') { + await this.audioContext.resume(); + } + + this.sourceNode = this.audioContext.createMediaStreamSource(this.mediaStream); + + try { + const workletUrl = createWorkletUrl(); + await this.audioContext.audioWorklet.addModule(workletUrl); + URL.revokeObjectURL(workletUrl); + + this.workletNode = new AudioWorkletNode(this.audioContext, 'pcm-recorder-processor'); + this.workletNode.port.onmessage = (event) => { + const pcm16 = float32ToPcm16(event.data.samples); + const base64 = arrayBufferToBase64(pcm16); + onChunk(base64); + }; + // Connect source → worklet → silent sink → destination. + // The downstream connection is required so the audio graph pulls + // frames through the worklet node and port.onmessage fires. + const sink = this.audioContext.createGain(); + sink.gain.value = 0; + this.sourceNode.connect(this.workletNode); + this.workletNode.connect(sink); + sink.connect(this.audioContext.destination); + } catch (e) { + console.warn('[GeminiLive] AudioWorklet failed, falling back to ScriptProcessorNode', e); + this.scriptNode = this.audioContext.createScriptProcessor(4096, 1, 1); + this.scriptNode.onaudioprocess = (event) => { + const inputData = event.inputBuffer.getChannelData(0); + const pcm16 = float32ToPcm16(new Float32Array(inputData)); + const base64 = arrayBufferToBase64(pcm16); + onChunk(base64); + }; + this.sourceNode.connect(this.scriptNode); + this.scriptNode.connect(this.audioContext.destination); + } + } catch (e) { + if (onError) onError(e instanceof Error ? e : new Error(String(e))); + throw e; + } + } + + stop(): void { + if (this.mediaStream) { + this.mediaStream.getTracks().forEach(track => { + track.onended = null; + track.stop(); + }); + this.mediaStream = null; + } + + if (this.scriptNode) { + this.scriptNode.disconnect(); + this.scriptNode = null; + } + + if (this.workletNode) { + this.workletNode.disconnect(); + this.workletNode = null; + } + + if (this.sourceNode) { + this.sourceNode.disconnect(); + this.sourceNode = null; + } + + if (this.audioContext) { + this.audioContext.close(); + this.audioContext = null; + } + } + + setMuted(muted: boolean): void { + if (this.mediaStream) { + this.mediaStream.getAudioTracks().forEach(track => { + track.enabled = !muted; + }); + } + } + + dispose(): void { + this.stop(); + } +} diff --git a/web/src/realtime/gemini/pcm-recorder.worklet.ts b/web/src/realtime/gemini/pcm-recorder.worklet.ts new file mode 100644 index 0000000000..404f65445b --- /dev/null +++ b/web/src/realtime/gemini/pcm-recorder.worklet.ts @@ -0,0 +1,35 @@ +// AudioWorklet processor runs in a separate scope with its own globals. +// These declarations satisfy TypeScript without pulling in DOM lib types. +declare class AudioWorkletProcessor { + readonly port: MessagePort + constructor() +} +declare function registerProcessor(name: string, ctor: new () => AudioWorkletProcessor): void + +class PcmRecorderProcessor extends AudioWorkletProcessor { + private buffer: Float32Array; + private bufferSize = 4096; + private bufferIndex = 0; + + constructor() { + super(); + this.buffer = new Float32Array(this.bufferSize); + } + + process(inputs: Float32Array[][]): boolean { + const input = inputs[0]; + if (input && input.length > 0) { + const channel = input[0]; + for (let i = 0; i < channel.length; i++) { + this.buffer[this.bufferIndex++] = channel[i]; + if (this.bufferIndex >= this.bufferSize) { + this.port.postMessage({ samples: this.buffer.slice() }); + this.bufferIndex = 0; + } + } + } + return true; + } +} + +registerProcessor('pcm-recorder-processor', PcmRecorderProcessor); diff --git a/web/src/realtime/gemini/pcmUtils.test.ts b/web/src/realtime/gemini/pcmUtils.test.ts new file mode 100644 index 0000000000..2e0be05c3f --- /dev/null +++ b/web/src/realtime/gemini/pcmUtils.test.ts @@ -0,0 +1,60 @@ +import { describe, test, expect } from 'bun:test' +import { + float32ToPcm16, + pcm16ToFloat32, + arrayBufferToBase64, + base64ToArrayBuffer +} from './pcmUtils' + +describe('pcmUtils', () => { + describe('float32ToPcm16 / pcm16ToFloat32 round-trip', () => { + test('preserves signal within quantization error', () => { + const input = new Float32Array([0, 0.5, -0.5, 1.0, -1.0]) + const pcm16 = float32ToPcm16(input) + const output = pcm16ToFloat32(pcm16) + + expect(output.length).toBe(input.length) + for (let i = 0; i < input.length; i++) { + expect(Math.abs(output[i] - input[i])).toBeLessThan(0.001) + } + }) + + test('clamps values outside [-1, 1]', () => { + const input = new Float32Array([2.0, -2.0]) + const pcm16 = float32ToPcm16(input) + const output = pcm16ToFloat32(pcm16) + + expect(Math.abs(output[0] - 1.0)).toBeLessThan(0.001) + expect(Math.abs(output[1] - (-1.0))).toBeLessThan(0.001) + }) + + test('handles empty input', () => { + const input = new Float32Array(0) + const pcm16 = float32ToPcm16(input) + expect(pcm16.byteLength).toBe(0) + const output = pcm16ToFloat32(pcm16) + expect(output.length).toBe(0) + }) + }) + + describe('arrayBufferToBase64 / base64ToArrayBuffer round-trip', () => { + test('preserves binary data', () => { + const original = new Uint8Array([0, 1, 127, 128, 255]) + const base64 = arrayBufferToBase64(original.buffer) + const restored = new Uint8Array(base64ToArrayBuffer(base64)) + + expect(restored.length).toBe(original.length) + for (let i = 0; i < original.length; i++) { + expect(restored[i]).toBe(original[i]) + } + }) + + test('handles empty buffer', () => { + const empty = new ArrayBuffer(0) + const base64 = arrayBufferToBase64(empty) + expect(base64).toBe('') + const restored = base64ToArrayBuffer(base64) + expect(restored.byteLength).toBe(0) + }) + }) +}) diff --git a/web/src/realtime/gemini/pcmUtils.ts b/web/src/realtime/gemini/pcmUtils.ts new file mode 100644 index 0000000000..67e2928fc0 --- /dev/null +++ b/web/src/realtime/gemini/pcmUtils.ts @@ -0,0 +1,39 @@ +export function float32ToPcm16(samples: Float32Array): ArrayBuffer { + const buffer = new ArrayBuffer(samples.length * 2); + const view = new DataView(buffer); + for (let i = 0; i < samples.length; i++) { + let s = Math.max(-1, Math.min(1, samples[i])); + s = s < 0 ? s * 0x8000 : s * 0x7FFF; + view.setInt16(i * 2, s, true); + } + return buffer; +} + +export function pcm16ToFloat32(buffer: ArrayBuffer): Float32Array { + const int16Array = new Int16Array(buffer); + const float32Array = new Float32Array(int16Array.length); + for (let i = 0; i < int16Array.length; i++) { + const s = int16Array[i]; + float32Array[i] = s < 0 ? s / 0x8000 : s / 0x7FFF; + } + return float32Array; +} + +export function arrayBufferToBase64(buffer: ArrayBuffer): string { + let binary = ''; + const bytes = new Uint8Array(buffer); + const len = bytes.byteLength; + for (let i = 0; i < len; i++) { + binary += String.fromCharCode(bytes[i]); + } + return btoa(binary); +} + +export function base64ToArrayBuffer(base64: string): ArrayBuffer { + const binary = atob(base64); + const bytes = new Uint8Array(binary.length); + for (let i = 0; i < binary.length; i++) { + bytes[i] = binary.charCodeAt(i); + } + return bytes.buffer; +} diff --git a/web/src/realtime/gemini/toolAdapter.test.ts b/web/src/realtime/gemini/toolAdapter.test.ts new file mode 100644 index 0000000000..5d98d6d4d0 --- /dev/null +++ b/web/src/realtime/gemini/toolAdapter.test.ts @@ -0,0 +1,28 @@ +import { describe, test, expect } from 'bun:test' +import { handleGeminiFunctionCall, handleGeminiFunctionCalls } from './toolAdapter' +import type { GeminiFunctionCall } from './toolAdapter' + +describe('toolAdapter', () => { + test('returns error for unknown tool', async () => { + const call: GeminiFunctionCall = { + name: 'unknownTool', + args: {}, + id: 'call-1' + } + const resp = await handleGeminiFunctionCall(call) + expect(resp.name).toBe('unknownTool') + expect(resp.id).toBe('call-1') + expect(resp.response.result).toContain('unknown tool') + }) + + test('handles multiple calls in parallel', async () => { + const calls: GeminiFunctionCall[] = [ + { name: 'unknownA', args: {}, id: 'a' }, + { name: 'unknownB', args: {}, id: 'b' } + ] + const responses = await handleGeminiFunctionCalls(calls) + expect(responses.length).toBe(2) + expect(responses[0].id).toBe('a') + expect(responses[1].id).toBe('b') + }) +}) diff --git a/web/src/realtime/gemini/toolAdapter.ts b/web/src/realtime/gemini/toolAdapter.ts new file mode 100644 index 0000000000..dbf4dee9c9 --- /dev/null +++ b/web/src/realtime/gemini/toolAdapter.ts @@ -0,0 +1,76 @@ +import { realtimeClientTools } from '../realtimeClientTools' + +/** + * Gemini Live API function call from server. + * Matches the `toolCall` shape in a BidiGenerateContent serverMessage. + */ +export interface GeminiFunctionCall { + name: string + args: Record + id: string +} + +/** + * Response sent back to Gemini Live via `toolResponse`. + */ +export interface GeminiFunctionResponse { + name: string + id: string + response: { result: string } +} + +type ClientToolHandler = (parameters: unknown) => Promise + +const toolHandlers: Record = { + messageCodingAgent: realtimeClientTools.messageCodingAgent, + processPermissionRequest: realtimeClientTools.processPermissionRequest +} + +/** + * Execute a Gemini Live function call using the existing client tool handlers. + * Returns a GeminiFunctionResponse ready to send back over the WebSocket. + */ +export async function handleGeminiFunctionCall( + call: GeminiFunctionCall +): Promise { + const handler = toolHandlers[call.name] + + if (!handler) { + return { + name: call.name, + id: call.id, + response: { result: `error (unknown tool: ${call.name})` } + } + } + + try { + const result = await handler(call.args) + return { + name: call.name, + id: call.id, + response: { result } + } + } catch (error) { + const message = error instanceof Error ? error.message : 'unknown error' + return { + name: call.name, + id: call.id, + response: { result: `error (${message})` } + } + } +} + +/** + * Process multiple function calls sequentially to avoid racing on shared + * session state (e.g. processPermissionRequest resolving the same pending + * request twice when calls run in parallel). + */ +export async function handleGeminiFunctionCalls( + calls: GeminiFunctionCall[] +): Promise { + const responses: GeminiFunctionResponse[] = [] + for (const call of calls) { + responses.push(await handleGeminiFunctionCall(call)) + } + return responses +} diff --git a/web/src/realtime/index.ts b/web/src/realtime/index.ts index a7fa2fbe99..1e080123b5 100644 --- a/web/src/realtime/index.ts +++ b/web/src/realtime/index.ts @@ -15,8 +15,11 @@ export { // Client tools export { realtimeClientTools, registerSessionStore } from './realtimeClientTools' -// Voice session component +// Voice session components export { RealtimeVoiceSession, type RealtimeVoiceSessionProps } from './RealtimeVoiceSession' +export { GeminiLiveVoiceSession, type GeminiLiveVoiceSessionProps } from './GeminiLiveVoiceSession' +export { QwenVoiceSession, type QwenVoiceSessionProps } from './QwenVoiceSession' +export { VoiceBackendSession, type VoiceBackendSessionProps } from './VoiceBackendSession' // Voice hooks export { voiceHooks, registerVoiceHooksStore } from './hooks/voiceHooks' diff --git a/web/src/sw.ts b/web/src/sw.ts index ebe55dc0a7..62be9dd29d 100644 --- a/web/src/sw.ts +++ b/web/src/sw.ts @@ -21,6 +21,9 @@ type PushPayload = { } } +// Let the new service worker wait until all tabs close before activating. +// Immediate skipWaiting + clientsClaim can break lazy-loaded chunks (e.g. voice) +// when the old app shell requests hashes that the new precache no longer serves. precacheAndRoute(self.__WB_MANIFEST) registerRoute( diff --git a/web/tsconfig.json b/web/tsconfig.json index 8b0682a4bb..de7bcdca50 100644 --- a/web/tsconfig.json +++ b/web/tsconfig.json @@ -11,5 +11,6 @@ "@/*": ["./src/*"] } }, - "include": ["src"] + "include": ["src"], + "exclude": ["src/**/*.test.ts", "src/**/*.test.tsx", "src/**/*.spec.ts", "src/**/*.spec.tsx"] }