diff --git a/AGENTS.md b/AGENTS.md
index 73b7f34..1083805 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -283,17 +283,18 @@ Task patterns allow query-based customization. Each pattern has:
## TTS Plugin (`tts.ts`)
### Overview
-Reads the final agent response aloud when a session completes. Supports two engines:
-- **OS TTS**: Native macOS `say` command (default, instant)
-- **Chatterbox**: High-quality neural TTS with voice cloning
+Reads the final agent response aloud when a session completes. Supports three engines:
+- **Coqui TTS**: High-quality neural TTS (default) - Model: `tts_models/en/vctk/vits` with p226 voice
+- **OS TTS**: Native macOS `say` command (instant, no setup)
+- **Chatterbox**: Alternative neural TTS with voice cloning
### Features
-- **Dual engine support**: OS TTS (instant) or Chatterbox (high quality)
-- **Server mode**: Chatterbox model stays loaded for fast subsequent requests
-- **Shared server**: Single Chatterbox instance shared across all OpenCode sessions
+- **Multiple engine support**: Coqui TTS (recommended), OS TTS (instant), Chatterbox
+- **Server mode**: TTS model stays loaded for fast subsequent requests
+- **Shared server**: Single TTS instance shared across all OpenCode sessions
- **Lock mechanism**: Prevents multiple server startups from concurrent sessions
- **Device auto-detection**: Supports CUDA, MPS (Apple Silicon), CPU
-- **Turbo model**: 10x faster Chatterbox inference
+- **Multi-speaker support**: Coqui VCTK model supports 109 speakers (p226 default)
- Cleans markdown/code from text before speaking
- Truncates long messages (1000 char limit)
- Skips judge/reflection sessions
@@ -304,11 +305,17 @@ Edit `~/.config/opencode/tts.json`:
```json
{
"enabled": true,
- "engine": "chatterbox",
+ "engine": "coqui",
"os": {
"voice": "Samantha",
"rate": 200
},
+ "coqui": {
+ "model": "vctk_vits",
+ "device": "mps",
+ "speaker": "p226",
+ "serverMode": true
+ },
"chatterbox": {
"device": "mps",
"useTurbo": true,
@@ -318,14 +325,24 @@ Edit `~/.config/opencode/tts.json`:
}
```
-### Chatterbox Server Files
-Located in `~/.config/opencode/opencode-helpers/chatterbox/`:
+### Coqui TTS Models
+| Model | Description | Speed |
+|-------|-------------|-------|
+| `vctk_vits` | Multi-speaker VITS (109 speakers, p226 recommended) | Fast |
+| `vits` | LJSpeech single speaker | Fast |
+| `jenny` | Jenny voice | Medium |
+| `xtts_v2` | XTTS with voice cloning | Slower |
+| `bark` | Multilingual neural TTS | Slower |
+| `tortoise` | Very high quality | Very slow |
+
+### Coqui Server Files
+Located in `~/.config/opencode/opencode-helpers/coqui/`:
- `tts.py` - One-shot TTS script
- `tts_server.py` - Persistent server script
- `tts.sock` - Unix socket for IPC
- `server.pid` - Running server PID
- `server.lock` - Startup lock file
-- `venv/` - Python virtualenv with chatterbox-tts
+- `venv/` - Python virtualenv with TTS package
### Testing
```bash
@@ -335,14 +352,14 @@ npm run test:tts:manual # Actually speaks test phrases
### Debugging
```bash
-# Check if Chatterbox server is running
-ls -la ~/.config/opencode/opencode-helpers/chatterbox/tts.sock
+# Check if Coqui server is running
+ls -la ~/.config/opencode/opencode-helpers/coqui/tts.sock
# Check server PID
-cat ~/.config/opencode/opencode-helpers/chatterbox/server.pid
+cat ~/.config/opencode/opencode-helpers/coqui/server.pid
# Stop server manually
-kill $(cat ~/.config/opencode/opencode-helpers/chatterbox/server.pid)
+kill $(cat ~/.config/opencode/opencode-helpers/coqui/server.pid)
# Check server logs (stderr)
# Server automatically restarts on next TTS request
diff --git a/README.md b/README.md
index fba616d..7a67fe8 100644
--- a/README.md
+++ b/README.md
@@ -29,7 +29,7 @@ This plugin adds a **judge layer** that automatically evaluates task completion
- **Automatic task verification** - Judge evaluates completion after each agent response
- **Self-healing workflow** - Agent receives feedback and continues if work is incomplete
- **Telegram notifications** - Get notified when tasks finish, reply via text or voice
-- **Local TTS** - Hear responses read aloud (Coqui XTTS, Chatterbox, macOS)
+- **Local TTS** - Hear responses read aloud (Coqui VCTK/VITS, Chatterbox, macOS)
- **Voice-to-text** - Reply to Telegram with voice messages, transcribed by local Whisper
## Quick Install
@@ -152,10 +152,92 @@ Text-to-speech with Telegram integration for remote notifications and two-way co
| Engine | Quality | Speed | Setup |
|--------|---------|-------|-------|
-| **Coqui XTTS v2** | Excellent | 2-5s | Auto-installed, Python 3.9+ |
+| **Coqui TTS** | Excellent | Fast-Medium | Auto-installed, Python 3.9-3.11 |
| **Chatterbox** | Excellent | 2-5s | Auto-installed, Python 3.11 |
| **macOS say** | Good | Instant | None |
+### Coqui TTS Models
+
+| Model | Description | Multi-Speaker | Speed |
+|-------|-------------|---------------|-------|
+| `vctk_vits` | VCTK VITS (109 speakers, **recommended**) | Yes (p226 default) | Fast |
+| `vits` | LJSpeech single speaker | No | Fast |
+| `jenny` | Jenny voice | No | Medium |
+| `xtts_v2` | XTTS v2 with voice cloning | Yes (via voiceRef) | Slower |
+| `bark` | Multilingual neural TTS | No | Slower |
+| `tortoise` | Very high quality | No | Very slow |
+
+**Recommended**: `vctk_vits` with speaker `p226` (clear, professional British male voice)
+
+### VCTK Speakers (vctk_vits model)
+
+The VCTK corpus contains 109 speakers with various English accents. Speaker IDs are in format `pXXX`.
+
+**Popular speaker choices:**
+
+| Speaker | Gender | Accent | Description |
+|---------|--------|--------|-------------|
+| `p226` | Male | English | Clear, professional (recommended) |
+| `p225` | Female | English | Clear, neutral |
+| `p227` | Male | English | Deep voice |
+| `p228` | Female | English | Warm tone |
+| `p229` | Female | English | Higher pitch |
+| `p230` | Female | English | Soft voice |
+| `p231` | Male | English | Standard |
+| `p232` | Male | English | Casual |
+| `p233` | Female | Scottish | Scottish accent |
+| `p234` | Female | Scottish | Scottish accent |
+| `p236` | Female | English | Professional |
+| `p237` | Male | Scottish | Scottish accent |
+| `p238` | Female | N. Irish | Northern Irish |
+| `p239` | Female | English | Young voice |
+| `p240` | Female | English | Mature voice |
+| `p241` | Male | Scottish | Scottish accent |
+| `p243` | Male | English | Deep, authoritative |
+| `p244` | Female | English | Bright voice |
+| `p245` | Male | Irish | Irish accent |
+| `p246` | Male | Scottish | Scottish accent |
+| `p247` | Male | Scottish | Scottish accent |
+| `p248` | Female | Indian | Indian English |
+| `p249` | Female | Scottish | Scottish accent |
+| `p250` | Female | English | Standard |
+| `p251` | Male | Indian | Indian English |
+
+
+All 109 VCTK speakers
+
+```
+p225, p226, p227, p228, p229, p230, p231, p232, p233, p234,
+p236, p237, p238, p239, p240, p241, p243, p244, p245, p246,
+p247, p248, p249, p250, p251, p252, p253, p254, p255, p256,
+p257, p258, p259, p260, p261, p262, p263, p264, p265, p266,
+p267, p268, p269, p270, p271, p272, p273, p274, p275, p276,
+p277, p278, p279, p280, p281, p282, p283, p284, p285, p286,
+p287, p288, p292, p293, p294, p295, p297, p298, p299, p300,
+p301, p302, p303, p304, p305, p306, p307, p308, p310, p311,
+p312, p313, p314, p316, p317, p318, p323, p326, p329, p330,
+p333, p334, p335, p336, p339, p340, p341, p343, p345, p347,
+p351, p360, p361, p362, p363, p364, p374, p376, ED
+```
+
+
+
+### XTTS v2 Speakers
+
+XTTS v2 is primarily a voice cloning model. Use the `voiceRef` option to clone any voice:
+
+```json
+{
+ "coqui": {
+ "model": "xtts_v2",
+ "voiceRef": "/path/to/reference-voice.wav",
+ "language": "en"
+ }
+}
+```
+
+Supported languages: `en`, `es`, `fr`, `de`, `it`, `pt`, `pl`, `tr`, `ru`, `nl`, `cs`, `ar`, `zh-cn`, `ja`, `hu`, `ko`
+
### Configuration
`~/.config/opencode/tts.json`:
@@ -165,10 +247,21 @@ Text-to-speech with Telegram integration for remote notifications and two-way co
"enabled": true,
"engine": "coqui",
"coqui": {
- "model": "xtts_v2",
+ "model": "vctk_vits",
"device": "mps",
+ "speaker": "p226",
"serverMode": true
},
+ "os": {
+ "voice": "Samantha",
+ "rate": 200
+ },
+ "chatterbox": {
+ "device": "mps",
+ "useTurbo": true,
+ "serverMode": true,
+ "exaggeration": 0.5
+ },
"telegram": {
"enabled": true,
"uuid": "",
@@ -179,12 +272,49 @@ Text-to-speech with Telegram integration for remote notifications and two-way co
}
```
+### Configuration Options
+
+#### Engine Selection
+
+| Option | Description |
+|--------|-------------|
+| `engine` | `"coqui"` (default), `"chatterbox"`, or `"os"` |
+
+#### Coqui Options (`coqui`)
+
+| Option | Description | Default |
+|--------|-------------|---------|
+| `model` | TTS model (see table above) | `"vctk_vits"` |
+| `device` | `"cuda"`, `"mps"`, or `"cpu"` | auto-detect |
+| `speaker` | Speaker ID for multi-speaker models | `"p226"` |
+| `serverMode` | Keep model loaded for fast requests | `true` |
+| `voiceRef` | Path to voice clip for cloning (XTTS) | - |
+| `language` | Language code for XTTS | `"en"` |
+
+#### Chatterbox Options (`chatterbox`)
+
+| Option | Description | Default |
+|--------|-------------|---------|
+| `device` | `"cuda"`, `"mps"`, or `"cpu"` | auto-detect |
+| `useTurbo` | Use Turbo model (10x faster) | `true` |
+| `serverMode` | Keep model loaded | `true` |
+| `exaggeration` | Emotion level (0.0-1.0) | `0.5` |
+| `voiceRef` | Path to voice clip for cloning | - |
+
+#### OS TTS Options (`os`)
+
+| Option | Description | Default |
+|--------|-------------|---------|
+| `voice` | macOS voice name (run `say -v ?` to list) | `"Samantha"` |
+| `rate` | Words per minute | `200` |
+
### Toggle Commands
```
/tts Toggle on/off
/tts on Enable
/tts off Disable
+/tts status Check current state
```
---
@@ -469,7 +599,7 @@ npm run test:tts:manual
## Requirements
- OpenCode v1.0+
-- **TTS**: macOS (for `say`), Python 3.9+ (Coqui), Python 3.11 (Chatterbox)
+- **TTS**: macOS (for `say`), Python 3.9-3.11 (Coqui), Python 3.11 (Chatterbox)
- **Telegram voice**: ffmpeg (`brew install ffmpeg`)
- **Dependencies**: `bun` (OpenCode installs deps from package.json)
diff --git a/telegram.ts b/telegram.ts
index c1ec857..44be54f 100644
--- a/telegram.ts
+++ b/telegram.ts
@@ -688,7 +688,7 @@ async function transcribeAudio(
}
try {
- const response = await fetch(`http://127.0.0.1:${port}/transcribe-base64`, {
+ const response = await fetch(`http://127.0.0.1:${port}/transcribe`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
diff --git a/tts.ts b/tts.ts
index a219017..dcd0a18 100644
--- a/tts.ts
+++ b/tts.ts
@@ -205,7 +205,13 @@ const REFLECTION_POLL_INTERVAL_MS = 500 // Poll interval for verdict file
type TTSEngine = "coqui" | "chatterbox" | "os"
// Coqui TTS model types
-type CoquiModel = "bark" | "xtts_v2" | "tortoise" | "vits" | "jenny"
+// - bark: Multilingual neural TTS (slower, higher quality)
+// - xtts_v2: XTTS v2 with voice cloning support
+// - tortoise: Very high quality but slow
+// - vits: Fast VITS model (LJSpeech single speaker)
+// - vctk_vits: VCTK multi-speaker VITS (supports speaker selection, e.g., p226)
+// - jenny: Jenny voice model
+type CoquiModel = "bark" | "xtts_v2" | "tortoise" | "vits" | "vctk_vits" | "jenny"
interface TTSConfig {
enabled?: boolean
@@ -215,14 +221,14 @@ interface TTSConfig {
voice?: string // Voice name (e.g., "Samantha", "Alex"). Run `say -v ?` on macOS to list voices
rate?: number // Speaking rate in words per minute (default: 200)
}
- // Coqui TTS options (supports bark, xtts_v2, tortoise, vits, etc.)
+ // Coqui TTS options (supports bark, xtts_v2, tortoise, vits, vctk_vits, etc.)
coqui?: {
- model?: CoquiModel // Model to use: "bark", "xtts_v2", "tortoise", "vits" (default: "xtts_v2")
+ model?: CoquiModel // Model to use: "vctk_vits" (recommended), "xtts_v2", "vits", etc.
device?: "cuda" | "cpu" | "mps" // GPU, CPU, or Apple Silicon (default: auto-detect)
// XTTS-specific options
voiceRef?: string // Path to reference voice clip for cloning (XTTS)
language?: string // Language code for XTTS (default: "en")
- speaker?: string // Speaker name for XTTS (default: "Ana Florence")
+ speaker?: string // Speaker name/ID (e.g., "p226" for vctk_vits, "Ana Florence" for xtts)
serverMode?: boolean // Keep model loaded for fast subsequent requests (default: true)
}
// Chatterbox-specific options
@@ -337,9 +343,9 @@ async function loadConfig(): Promise {
enabled: true,
engine: "coqui",
coqui: {
- model: "xtts_v2",
+ model: "vctk_vits",
device: "mps",
- language: "en",
+ speaker: "p226",
serverMode: true
},
os: {
@@ -1103,11 +1109,11 @@ def main():
parser = argparse.ArgumentParser(description="Coqui TTS")
parser.add_argument("text", help="Text to synthesize")
parser.add_argument("--output", "-o", required=True, help="Output WAV file")
- parser.add_argument("--model", default="xtts_v2", choices=["bark", "xtts_v2", "tortoise", "vits", "jenny"])
+ parser.add_argument("--model", default="vctk_vits", choices=["bark", "xtts_v2", "tortoise", "vits", "vctk_vits", "jenny"])
parser.add_argument("--device", default="cuda", choices=["cuda", "mps", "cpu"])
parser.add_argument("--voice-ref", help="Reference voice audio path (for XTTS voice cloning)")
parser.add_argument("--language", default="en", help="Language code (for XTTS)")
- parser.add_argument("--speaker", default="Ana Florence", help="Speaker name for XTTS (e.g., 'Ana Florence', 'Claribel Dervla')")
+ parser.add_argument("--speaker", default="p226", help="Speaker ID for multi-speaker models (e.g., 'p226' for vctk_vits)")
args = parser.parse_args()
try:
@@ -1159,6 +1165,11 @@ def main():
tts = TTS("tts_models/en/ljspeech/vits")
tts = tts.to(device)
tts.tts_to_file(text=args.text, file_path=args.output)
+ elif args.model == "vctk_vits":
+ # VCTK VITS multi-speaker model - clear, professional voices
+ tts = TTS("tts_models/en/vctk/vits")
+ tts = tts.to(device)
+ tts.tts_to_file(text=args.text, file_path=args.output, speaker=args.speaker)
elif args.model == "jenny":
tts = TTS("tts_models/en/jenny/jenny")
tts = tts.to(device)
@@ -1186,10 +1197,10 @@ import argparse
def main():
parser = argparse.ArgumentParser(description="Coqui TTS Server")
parser.add_argument("--socket", required=True, help="Unix socket path")
- parser.add_argument("--model", default="xtts_v2", choices=["bark", "xtts_v2", "tortoise", "vits", "jenny"])
+ parser.add_argument("--model", default="vctk_vits", choices=["bark", "xtts_v2", "tortoise", "vits", "vctk_vits", "jenny"])
parser.add_argument("--device", default="cuda", choices=["cuda", "cpu", "mps"])
parser.add_argument("--voice-ref", help="Default reference voice (for XTTS)")
- parser.add_argument("--speaker", default="Ana Florence", help="Default XTTS speaker")
+ parser.add_argument("--speaker", default="p226", help="Default speaker ID (e.g., 'p226' for vctk_vits)")
parser.add_argument("--language", default="en", help="Default language")
args = parser.parse_args()
@@ -1222,6 +1233,8 @@ def main():
tts = TTS("tts_models/en/multi-dataset/tortoise-v2")
elif args.model == "vits":
tts = TTS("tts_models/en/ljspeech/vits")
+ elif args.model == "vctk_vits":
+ tts = TTS("tts_models/en/vctk/vits")
elif args.model == "jenny":
tts = TTS("tts_models/en/jenny/jenny")
@@ -1265,6 +1278,9 @@ def main():
tts.tts_to_file(text=text, file_path=output, speaker_wav=voice_ref, language=language)
else:
tts.tts_to_file(text=text, file_path=output, speaker=speaker, language=language)
+ elif args.model in ("vctk_vits",):
+ # Multi-speaker models use speaker ID
+ tts.tts_to_file(text=text, file_path=output, speaker=speaker)
else:
tts.tts_to_file(text=text, file_path=output)