Voice Conversation Mode: 3 UX improvements needed (streaming TTS, auto re-listen, barge-in)

## Description

The "Start Voice Conversation" feature currently works in a **walkie-talkie style** rather than a true conversation mode. Three key improvements would make it feel like a real-time voice assistant (similar to ChatGPT Advanced Voice Mode).

## Current Behavior

1. User clicks button → speaks → AI generates **full text response** → converts **entire response** to audio → plays
2. After AI finishes, the mic stays off — user must **manually click the button again** for each new turn
3. While AI is responding (text or audio), the mic is **completely disabled** — no way to interrupt

## Proposed Improvements

### 1. Sentence-level Streaming TTS
**Problem:** The current flow waits for the LLM to finish generating the entire response, then converts all text to audio at once. This creates a noticeable delay, especially for longer responses.

**Proposed:** Split the LLM's streaming text output into sentences, and send each sentence to TTS as soon as it's available. Play audio chunks progressively instead of waiting for the full response.

```
Current:  LLM [full response] → TTS [full audio] → Play
Proposed: LLM [sentence 1] → TTS → Play
          LLM [sentence 2] → TTS → Play (overlap with generation)
          ...
```

Many TTS providers (Edge TTS, MiMo TTS, OpenAI TTS) already support streaming audio output — the infrastructure just needs to be wired up.

### 2. Auto Re-listen After Response
**Problem:** After the AI finishes speaking, the conversation stops. The user must manually click the mic button again for each turn.

**Proposed:** Automatically re-engage the microphone after the AI's audio finishes playing, so the conversation flows naturally without manual clicks.

```
Current:  [Click] → Speak → AI responds → STOP (wait for click)
Proposed: [Click] → Speak → AI responds → Auto-listen → Speak → AI responds → ...
```

### 3. Barge-in (Interrupt) Support
**Problem:** Once the AI starts responding, there's no way to interrupt or redirect the conversation. The user has to wait for the full response to finish.

**Proposed:** Allow the user to speak while the AI is responding. When speech is detected:
- Stop the current TTS playback immediately
- Process the new user input
- Respond to the new context

This is especially important for longer AI responses where the user may want to say "stop" or change direction.

## Impact

These three changes together would transform the voice feature from a basic "speech-to-text input + text-to-speech output" tool into a genuine **real-time voice conversation experience**, which is increasingly expected by users familiar with ChatGPT Voice, Siri, etc.

## Environment

- Hermes Desktop (Electron GUI)
- Hermes Agent v2.x
- TTS: Edge TTS (also tested with MiMo TTS which natively supports streaming)
- STT: MiMo ASR (command provider)
- OS: Windows 11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Voice Conversation Mode: 3 UX improvements needed (streaming TTS, auto re-listen, barge-in) #652

Description

Current Behavior

Proposed Improvements

1. Sentence-level Streaming TTS

2. Auto Re-listen After Response

3. Barge-in (Interrupt) Support

Impact

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Voice Conversation Mode: 3 UX improvements needed (streaming TTS, auto re-listen, barge-in) #652

Description

Description

Current Behavior

Proposed Improvements

1. Sentence-level Streaming TTS

2. Auto Re-listen After Response

3. Barge-in (Interrupt) Support

Impact

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions