feat(voice): Context-aware voice interaction using Gemini 3 Flash

## Problem

Current ElevenLabs voice solution has the following limitations:
- Limited code understanding, cannot truly comprehend Claude Code context
- High cost

## Proposal

Use Gemini 3 Flash as the voice understanding layer, leveraging its native multimodal capabilities for smarter voice interaction.

## Background: Why Gemini 3 Flash

| Capability | Description |
|------------|-------------|
| Native multimodal | Unified understanding of audio + text + code |
| 24 languages | Including Chinese, English, etc. |
| API support | Supports streaming and non-streaming calls |

## Core Features

### 1. Enhancement Level System (Configurable)

| Level | Feature | Example |
|-------|---------|---------|
| Level 0 | Pure ASR | Voice-to-text only, no context |
| Level 1 ⭐ | Voice tolerance | "ree-act" → "React", inferred from project context |
| Level 2 ⭐⭐ | Reference resolution | "this" → current focused file |
| Level 3 | Intent enhancement | "add a button" → "add a submit button at the bottom of the form" |

### 2. User Scenarios

| User says | Current solution | Gemini solution |
|-----------|------------------|-----------------|
| "change this to hoox" | ❌ Cannot understand | ✅ "change UserProfile.tsx to use hooks" |
| "allow" | ✅ Approve permission | ✅ Approve permission |
| "rewrite that with type-scrip" | ❌ Transcription error | ✅ "rewrite utils.js with TypeScript" |

### 3. Architecture

```
User voice → Gemini 3 Flash → Text instruction → Claude Code
                  ↑
         Session context (history, permission requests, tool state)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(voice): Context-aware voice interaction using Gemini 3 Flash #397

Problem

Proposal

Background: Why Gemini 3 Flash

Core Features

1. Enhancement Level System (Configurable)

2. User Scenarios

3. Architecture

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Capability	Description
Native multimodal	Unified understanding of audio + text + code
24 languages	Including Chinese, English, etc.
API support	Supports streaming and non-streaming calls

Level	Feature	Example
Level 0	Pure ASR	Voice-to-text only, no context
Level 1 ⭐	Voice tolerance	"ree-act" → "React", inferred from project context
Level 2 ⭐⭐	Reference resolution	"this" → current focused file
Level 3	Intent enhancement	"add a button" → "add a submit button at the bottom of the form"

User says	Current solution	Gemini solution
"change this to hoox"	❌ Cannot understand	✅ "change UserProfile.tsx to use hooks"
"allow"	✅ Approve permission	✅ Approve permission
"rewrite that with type-scrip"	❌ Transcription error	✅ "rewrite utils.js with TypeScript"

feat(voice): Context-aware voice interaction using Gemini 3 Flash #397

Description

Problem

Proposal

Background: Why Gemini 3 Flash

Core Features

1. Enhancement Level System (Configurable)

2. User Scenarios

3. Architecture

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions