Description
The "Start Voice Conversation" feature currently works in a walkie-talkie style rather than a true conversation mode. Three key improvements would make it feel like a real-time voice assistant (similar to ChatGPT Advanced Voice Mode).
Current Behavior
- User clicks button → speaks → AI generates full text response → converts entire response to audio → plays
- After AI finishes, the mic stays off — user must manually click the button again for each new turn
- While AI is responding (text or audio), the mic is completely disabled — no way to interrupt
Proposed Improvements
1. Sentence-level Streaming TTS
Problem: The current flow waits for the LLM to finish generating the entire response, then converts all text to audio at once. This creates a noticeable delay, especially for longer responses.
Proposed: Split the LLM's streaming text output into sentences, and send each sentence to TTS as soon as it's available. Play audio chunks progressively instead of waiting for the full response.
Current: LLM [full response] → TTS [full audio] → Play
Proposed: LLM [sentence 1] → TTS → Play
LLM [sentence 2] → TTS → Play (overlap with generation)
...
Many TTS providers (Edge TTS, MiMo TTS, OpenAI TTS) already support streaming audio output — the infrastructure just needs to be wired up.
2. Auto Re-listen After Response
Problem: After the AI finishes speaking, the conversation stops. The user must manually click the mic button again for each turn.
Proposed: Automatically re-engage the microphone after the AI's audio finishes playing, so the conversation flows naturally without manual clicks.
Current: [Click] → Speak → AI responds → STOP (wait for click)
Proposed: [Click] → Speak → AI responds → Auto-listen → Speak → AI responds → ...
3. Barge-in (Interrupt) Support
Problem: Once the AI starts responding, there's no way to interrupt or redirect the conversation. The user has to wait for the full response to finish.
Proposed: Allow the user to speak while the AI is responding. When speech is detected:
- Stop the current TTS playback immediately
- Process the new user input
- Respond to the new context
This is especially important for longer AI responses where the user may want to say "stop" or change direction.
Impact
These three changes together would transform the voice feature from a basic "speech-to-text input + text-to-speech output" tool into a genuine real-time voice conversation experience, which is increasingly expected by users familiar with ChatGPT Voice, Siri, etc.
Environment
- Hermes Desktop (Electron GUI)
- Hermes Agent v2.x
- TTS: Edge TTS (also tested with MiMo TTS which natively supports streaming)
- STT: MiMo ASR (command provider)
- OS: Windows 11
Description
The "Start Voice Conversation" feature currently works in a walkie-talkie style rather than a true conversation mode. Three key improvements would make it feel like a real-time voice assistant (similar to ChatGPT Advanced Voice Mode).
Current Behavior
Proposed Improvements
1. Sentence-level Streaming TTS
Problem: The current flow waits for the LLM to finish generating the entire response, then converts all text to audio at once. This creates a noticeable delay, especially for longer responses.
Proposed: Split the LLM's streaming text output into sentences, and send each sentence to TTS as soon as it's available. Play audio chunks progressively instead of waiting for the full response.
Many TTS providers (Edge TTS, MiMo TTS, OpenAI TTS) already support streaming audio output — the infrastructure just needs to be wired up.
2. Auto Re-listen After Response
Problem: After the AI finishes speaking, the conversation stops. The user must manually click the mic button again for each turn.
Proposed: Automatically re-engage the microphone after the AI's audio finishes playing, so the conversation flows naturally without manual clicks.
3. Barge-in (Interrupt) Support
Problem: Once the AI starts responding, there's no way to interrupt or redirect the conversation. The user has to wait for the full response to finish.
Proposed: Allow the user to speak while the AI is responding. When speech is detected:
This is especially important for longer AI responses where the user may want to say "stop" or change direction.
Impact
These three changes together would transform the voice feature from a basic "speech-to-text input + text-to-speech output" tool into a genuine real-time voice conversation experience, which is increasingly expected by users familiar with ChatGPT Voice, Siri, etc.
Environment