-
Notifications
You must be signed in to change notification settings - Fork 610
Open
Description
Problem
Current ElevenLabs voice solution has the following limitations:
- Limited code understanding, cannot truly comprehend Claude Code context
- High cost
Proposal
Use Gemini 3 Flash as the voice understanding layer, leveraging its native multimodal capabilities for smarter voice interaction.
Background: Why Gemini 3 Flash
| Capability | Description |
|---|---|
| Native multimodal | Unified understanding of audio + text + code |
| 24 languages | Including Chinese, English, etc. |
| API support | Supports streaming and non-streaming calls |
Core Features
1. Enhancement Level System (Configurable)
| Level | Feature | Example |
|---|---|---|
| Level 0 | Pure ASR | Voice-to-text only, no context |
| Level 1 ⭐ | Voice tolerance | "ree-act" → "React", inferred from project context |
| Level 2 ⭐⭐ | Reference resolution | "this" → current focused file |
| Level 3 | Intent enhancement | "add a button" → "add a submit button at the bottom of the form" |
2. User Scenarios
| User says | Current solution | Gemini solution |
|---|---|---|
| "change this to hoox" | ❌ Cannot understand | ✅ "change UserProfile.tsx to use hooks" |
| "allow" | ✅ Approve permission | ✅ Approve permission |
| "rewrite that with type-scrip" | ❌ Transcription error | ✅ "rewrite utils.js with TypeScript" |
3. Architecture
User voice → Gemini 3 Flash → Text instruction → Claude Code
↑
Session context (history, permission requests, tool state)
Metadata
Metadata
Assignees
Labels
No labels