Skip to content

feat(voice): Context-aware voice interaction using Gemini 3 Flash #397

@tangyu

Description

@tangyu

Problem

Current ElevenLabs voice solution has the following limitations:

  • Limited code understanding, cannot truly comprehend Claude Code context
  • High cost

Proposal

Use Gemini 3 Flash as the voice understanding layer, leveraging its native multimodal capabilities for smarter voice interaction.

Background: Why Gemini 3 Flash

Capability Description
Native multimodal Unified understanding of audio + text + code
24 languages Including Chinese, English, etc.
API support Supports streaming and non-streaming calls

Core Features

1. Enhancement Level System (Configurable)

Level Feature Example
Level 0 Pure ASR Voice-to-text only, no context
Level 1 ⭐ Voice tolerance "ree-act" → "React", inferred from project context
Level 2 ⭐⭐ Reference resolution "this" → current focused file
Level 3 Intent enhancement "add a button" → "add a submit button at the bottom of the form"

2. User Scenarios

User says Current solution Gemini solution
"change this to hoox" ❌ Cannot understand ✅ "change UserProfile.tsx to use hooks"
"allow" ✅ Approve permission ✅ Approve permission
"rewrite that with type-scrip" ❌ Transcription error ✅ "rewrite utils.js with TypeScript"

3. Architecture

User voice → Gemini 3 Flash → Text instruction → Claude Code
                  ↑
         Session context (history, permission requests, tool state)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions