dzianisv · dzianisv · Feb 7, 2026 · Feb 6, 2026 · Feb 7, 2026 · Feb 7, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -283,17 +283,18 @@ Task patterns allow query-based customization. Each pattern has:
 ## TTS Plugin (`tts.ts`)
 
 ### Overview
-Reads the final agent response aloud when a session completes. Supports two engines:
-- **OS TTS**: Native macOS `say` command (default, instant)
-- **Chatterbox**: High-quality neural TTS with voice cloning
+Reads the final agent response aloud when a session completes. Supports three engines:
+- **Coqui TTS**: High-quality neural TTS (default) - Model: `tts_models/en/vctk/vits` with p226 voice
+- **OS TTS**: Native macOS `say` command (instant, no setup)
+- **Chatterbox**: Alternative neural TTS with voice cloning
 
 ### Features
-- **Dual engine support**: OS TTS (instant) or Chatterbox (high quality)
-- **Server mode**: Chatterbox model stays loaded for fast subsequent requests
-- **Shared server**: Single Chatterbox instance shared across all OpenCode sessions
+- **Multiple engine support**: Coqui TTS (recommended), OS TTS (instant), Chatterbox
+- **Server mode**: TTS model stays loaded for fast subsequent requests
+- **Shared server**: Single TTS instance shared across all OpenCode sessions
 - **Lock mechanism**: Prevents multiple server startups from concurrent sessions
 - **Device auto-detection**: Supports CUDA, MPS (Apple Silicon), CPU
-- **Turbo model**: 10x faster Chatterbox inference
+- **Multi-speaker support**: Coqui VCTK model supports 109 speakers (p226 default)
 - Cleans markdown/code from text before speaking
 - Truncates long messages (1000 char limit)
 - Skips judge/reflection sessions
@@ -304,11 +305,17 @@ Edit `~/.config/opencode/tts.json`:
 ```json
 {
   "enabled": true,
-  "engine": "chatterbox",
+  "engine": "coqui",
   "os": {
     "voice": "Samantha",
     "rate": 200
   },
+  "coqui": {
+    "model": "vctk_vits",
+    "device": "mps",
+    "speaker": "p226",
+    "serverMode": true
+  },
   "chatterbox": {
     "device": "mps",
     "useTurbo": true,
@@ -318,14 +325,24 @@ Edit `~/.config/opencode/tts.json`:
 }
 ```
 
-### Chatterbox Server Files
-Located in `~/.config/opencode/opencode-helpers/chatterbox/`:
+### Coqui TTS Models
+| Model | Description | Speed |
+|-------|-------------|-------|
+| `vctk_vits` | Multi-speaker VITS (109 speakers, p226 recommended) | Fast |
+| `vits` | LJSpeech single speaker | Fast |
+| `jenny` | Jenny voice | Medium |
+| `xtts_v2` | XTTS with voice cloning | Slower |
+| `bark` | Multilingual neural TTS | Slower |
+| `tortoise` | Very high quality | Very slow |
+
+### Coqui Server Files
+Located in `~/.config/opencode/opencode-helpers/coqui/`:
 - `tts.py` - One-shot TTS script
 - `tts_server.py` - Persistent server script
 - `tts.sock` - Unix socket for IPC
 - `server.pid` - Running server PID
 - `server.lock` - Startup lock file
-- `venv/` - Python virtualenv with chatterbox-tts
+- `venv/` - Python virtualenv with TTS package
 
 ### Testing
 ```bash
@@ -335,14 +352,14 @@ npm run test:tts:manual # Actually speaks test phrases
 
 ### Debugging
 ```bash
-# Check if Chatterbox server is running
-ls -la ~/.config/opencode/opencode-helpers/chatterbox/tts.sock
+# Check if Coqui server is running
+ls -la ~/.config/opencode/opencode-helpers/coqui/tts.sock
 
 # Check server PID
-cat ~/.config/opencode/opencode-helpers/chatterbox/server.pid
+cat ~/.config/opencode/opencode-helpers/coqui/server.pid
 
 # Stop server manually
-kill $(cat ~/.config/opencode/opencode-helpers/chatterbox/server.pid)
+kill $(cat ~/.config/opencode/opencode-helpers/coqui/server.pid)
 
 # Check server logs (stderr)
 # Server automatically restarts on next TTS request

diff --git a/README.md b/README.md
@@ -29,7 +29,7 @@ This plugin adds a **judge layer** that automatically evaluates task completion
 - **Automatic task verification** - Judge evaluates completion after each agent response
 - **Self-healing workflow** - Agent receives feedback and continues if work is incomplete
 - **Telegram notifications** - Get notified when tasks finish, reply via text or voice
-- **Local TTS** - Hear responses read aloud (Coqui XTTS, Chatterbox, macOS)
+- **Local TTS** - Hear responses read aloud (Coqui VCTK/VITS, Chatterbox, macOS)
 - **Voice-to-text** - Reply to Telegram with voice messages, transcribed by local Whisper
 
 ## Quick Install
@@ -152,10 +152,92 @@ Text-to-speech with Telegram integration for remote notifications and two-way co
 
 | Engine | Quality | Speed | Setup |
 |--------|---------|-------|-------|
-| **Coqui XTTS v2** | Excellent | 2-5s | Auto-installed, Python 3.9+ |
+| **Coqui TTS** | Excellent | Fast-Medium | Auto-installed, Python 3.9-3.11 |
 | **Chatterbox** | Excellent | 2-5s | Auto-installed, Python 3.11 |
 | **macOS say** | Good | Instant | None |
 
+### Coqui TTS Models
+
+| Model | Description | Multi-Speaker | Speed |
+|-------|-------------|---------------|-------|
+| `vctk_vits` | VCTK VITS (109 speakers, **recommended**) | Yes (p226 default) | Fast |
+| `vits` | LJSpeech single speaker | No | Fast |
+| `jenny` | Jenny voice | No | Medium |
+| `xtts_v2` | XTTS v2 with voice cloning | Yes (via voiceRef) | Slower |
+| `bark` | Multilingual neural TTS | No | Slower |
+| `tortoise` | Very high quality | No | Very slow |
+
+**Recommended**: `vctk_vits` with speaker `p226` (clear, professional British male voice)
+
+### VCTK Speakers (vctk_vits model)
+
+The VCTK corpus contains 109 speakers with various English accents. Speaker IDs are in format `pXXX`.
+
+**Popular speaker choices:**
+
+| Speaker | Gender | Accent | Description |
+|---------|--------|--------|-------------|
+| `p226` | Male | English | Clear, professional (recommended) |
+| `p225` | Female | English | Clear, neutral |
+| `p227` | Male | English | Deep voice |
+| `p228` | Female | English | Warm tone |
+| `p229` | Female | English | Higher pitch |
+| `p230` | Female | English | Soft voice |
+| `p231` | Male | English | Standard |
+| `p232` | Male | English | Casual |
+| `p233` | Female | Scottish | Scottish accent |
+| `p234` | Female | Scottish | Scottish accent |
+| `p236` | Female | English | Professional |
+| `p237` | Male | Scottish | Scottish accent |
+| `p238` | Female | N. Irish | Northern Irish |
+| `p239` | Female | English | Young voice |
+| `p240` | Female | English | Mature voice |
+| `p241` | Male | Scottish | Scottish accent |
+| `p243` | Male | English | Deep, authoritative |
+| `p244` | Female | English | Bright voice |
+| `p245` | Male | Irish | Irish accent |
+| `p246` | Male | Scottish | Scottish accent |
+| `p247` | Male | Scottish | Scottish accent |
+| `p248` | Female | Indian | Indian English |
+| `p249` | Female | Scottish | Scottish accent |
+| `p250` | Female | English | Standard |
+| `p251` | Male | Indian | Indian English |
+
+<details>
+<summary>All 109 VCTK speakers</summary>
+
+```
+p225, p226, p227, p228, p229, p230, p231, p232, p233, p234,
+p236, p237, p238, p239, p240, p241, p243, p244, p245, p246,
+p247, p248, p249, p250, p251, p252, p253, p254, p255, p256,
+p257, p258, p259, p260, p261, p262, p263, p264, p265, p266,
+p267, p268, p269, p270, p271, p272, p273, p274, p275, p276,
+p277, p278, p279, p280, p281, p282, p283, p284, p285, p286,
+p287, p288, p292, p293, p294, p295, p297, p298, p299, p300,
+p301, p302, p303, p304, p305, p306, p307, p308, p310, p311,
+p312, p313, p314, p316, p317, p318, p323, p326, p329, p330,
+p333, p334, p335, p336, p339, p340, p341, p343, p345, p347,
+p351, p360, p361, p362, p363, p364, p374, p376, ED
+```
+
+</details>
+
+### XTTS v2 Speakers
+
+XTTS v2 is primarily a voice cloning model. Use the `voiceRef` option to clone any voice:
+
+```json
+{
+  "coqui": {
+    "model": "xtts_v2",
+    "voiceRef": "/path/to/reference-voice.wav",
+    "language": "en"
+  }
+}
+```
+
+Supported languages: `en`, `es`, `fr`, `de`, `it`, `pt`, `pl`, `tr`, `ru`, `nl`, `cs`, `ar`, `zh-cn`, `ja`, `hu`, `ko`
+
 ### Configuration
 
 `~/.config/opencode/tts.json`:
@@ -165,10 +247,21 @@ Text-to-speech with Telegram integration for remote notifications and two-way co
   "enabled": true,
   "engine": "coqui",
   "coqui": {
-    "model": "xtts_v2",
+    "model": "vctk_vits",
     "device": "mps",
+    "speaker": "p226",
     "serverMode": true
   },
+  "os": {
+    "voice": "Samantha",
+    "rate": 200
+  },
+  "chatterbox": {
+    "device": "mps",
+    "useTurbo": true,
+    "serverMode": true,
+    "exaggeration": 0.5
+  },
   "telegram": {
     "enabled": true,
     "uuid": "<your-uuid>",
@@ -179,12 +272,49 @@ Text-to-speech with Telegram integration for remote notifications and two-way co
 }
 ```
 
+### Configuration Options
+
+#### Engine Selection
+
+| Option | Description |
+|--------|-------------|
+| `engine` | `"coqui"` (default), `"chatterbox"`, or `"os"` |
+
+#### Coqui Options (`coqui`)
+
+| Option | Description | Default |
+|--------|-------------|---------|
+| `model` | TTS model (see table above) | `"vctk_vits"` |
+| `device` | `"cuda"`, `"mps"`, or `"cpu"` | auto-detect |
+| `speaker` | Speaker ID for multi-speaker models | `"p226"` |
+| `serverMode` | Keep model loaded for fast requests | `true` |
+| `voiceRef` | Path to voice clip for cloning (XTTS) | - |
+| `language` | Language code for XTTS | `"en"` |
+
+#### Chatterbox Options (`chatterbox`)
+
+| Option | Description | Default |
+|--------|-------------|---------|
+| `device` | `"cuda"`, `"mps"`, or `"cpu"` | auto-detect |
+| `useTurbo` | Use Turbo model (10x faster) | `true` |
+| `serverMode` | Keep model loaded | `true` |
+| `exaggeration` | Emotion level (0.0-1.0) | `0.5` |
+| `voiceRef` | Path to voice clip for cloning | - |
+
+#### OS TTS Options (`os`)
+
+| Option | Description | Default |
+|--------|-------------|---------|
+| `voice` | macOS voice name (run `say -v ?` to list) | `"Samantha"` |
+| `rate` | Words per minute | `200` |
+
 ### Toggle Commands
 
 ```
 /tts        Toggle on/off
 /tts on     Enable
 /tts off    Disable
+/tts status Check current state
 ```
 
 ---
@@ -469,7 +599,7 @@ npm run test:tts:manual
 ## Requirements
 
 - OpenCode v1.0+
-- **TTS**: macOS (for `say`), Python 3.9+ (Coqui), Python 3.11 (Chatterbox)
+- **TTS**: macOS (for `say`), Python 3.9-3.11 (Coqui), Python 3.11 (Chatterbox)
 - **Telegram voice**: ffmpeg (`brew install ffmpeg`)
 - **Dependencies**: `bun` (OpenCode installs deps from package.json)
 

diff --git a/telegram.ts b/telegram.ts
@@ -688,7 +688,7 @@ async function transcribeAudio(
   }
 
   try {
-    const response = await fetch(`http://127.0.0.1:${port}/transcribe-base64`, {
+    const response = await fetch(`http://127.0.0.1:${port}/transcribe`, {
       method: "POST",
       headers: { "Content-Type": "application/json" },
       body: JSON.stringify({

diff --git a/tts.ts b/tts.ts
@@ -205,7 +205,13 @@ const REFLECTION_POLL_INTERVAL_MS = 500    // Poll interval for verdict file
 type TTSEngine = "coqui" | "chatterbox" | "os"
 
 // Coqui TTS model types
-type CoquiModel = "bark" | "xtts_v2" | "tortoise" | "vits" | "jenny"
+// - bark: Multilingual neural TTS (slower, higher quality)
+// - xtts_v2: XTTS v2 with voice cloning support
+// - tortoise: Very high quality but slow
+// - vits: Fast VITS model (LJSpeech single speaker)
+// - vctk_vits: VCTK multi-speaker VITS (supports speaker selection, e.g., p226)
+// - jenny: Jenny voice model
+type CoquiModel = "bark" | "xtts_v2" | "tortoise" | "vits" | "vctk_vits" | "jenny"
 
 interface TTSConfig {
   enabled?: boolean
@@ -215,14 +221,14 @@ interface TTSConfig {
     voice?: string                    // Voice name (e.g., "Samantha", "Alex"). Run `say -v ?` on macOS to list voices
     rate?: number                     // Speaking rate in words per minute (default: 200)
   }
-  // Coqui TTS options (supports bark, xtts_v2, tortoise, vits, etc.)
+  // Coqui TTS options (supports bark, xtts_v2, tortoise, vits, vctk_vits, etc.)
   coqui?: {
-    model?: CoquiModel                // Model to use: "bark", "xtts_v2", "tortoise", "vits" (default: "xtts_v2")
+    model?: CoquiModel                // Model to use: "vctk_vits" (recommended), "xtts_v2", "vits", etc.
     device?: "cuda" | "cpu" | "mps"   // GPU, CPU, or Apple Silicon (default: auto-detect)
     // XTTS-specific options  
     voiceRef?: string                 // Path to reference voice clip for cloning (XTTS)
     language?: string                 // Language code for XTTS (default: "en")
-    speaker?: string                  // Speaker name for XTTS (default: "Ana Florence")
+    speaker?: string                  // Speaker name/ID (e.g., "p226" for vctk_vits, "Ana Florence" for xtts)
     serverMode?: boolean              // Keep model loaded for fast subsequent requests (default: true)
   }
   // Chatterbox-specific options
@@ -337,9 +343,9 @@ async function loadConfig(): Promise<TTSConfig> {
       enabled: true, 
       engine: "coqui",
       coqui: {
-        model: "xtts_v2",
+        model: "vctk_vits",
         device: "mps",
-        language: "en",
+        speaker: "p226",
         serverMode: true
       },
       os: {
@@ -1103,11 +1109,11 @@ def main():
     parser = argparse.ArgumentParser(description="Coqui TTS")
     parser.add_argument("text", help="Text to synthesize")
     parser.add_argument("--output", "-o", required=True, help="Output WAV file")
-    parser.add_argument("--model", default="xtts_v2", choices=["bark", "xtts_v2", "tortoise", "vits", "jenny"])
+    parser.add_argument("--model", default="vctk_vits", choices=["bark", "xtts_v2", "tortoise", "vits", "vctk_vits", "jenny"])
     parser.add_argument("--device", default="cuda", choices=["cuda", "mps", "cpu"])
     parser.add_argument("--voice-ref", help="Reference voice audio path (for XTTS voice cloning)")
     parser.add_argument("--language", default="en", help="Language code (for XTTS)")
-    parser.add_argument("--speaker", default="Ana Florence", help="Speaker name for XTTS (e.g., 'Ana Florence', 'Claribel Dervla')")
+    parser.add_argument("--speaker", default="p226", help="Speaker ID for multi-speaker models (e.g., 'p226' for vctk_vits)")
     args = parser.parse_args()
 
     try:
@@ -1159,6 +1165,11 @@ def main():
             tts = TTS("tts_models/en/ljspeech/vits")
             tts = tts.to(device)
             tts.tts_to_file(text=args.text, file_path=args.output)
+        elif args.model == "vctk_vits":
+            # VCTK VITS multi-speaker model - clear, professional voices
+            tts = TTS("tts_models/en/vctk/vits")
+            tts = tts.to(device)
+            tts.tts_to_file(text=args.text, file_path=args.output, speaker=args.speaker)
         elif args.model == "jenny":
             tts = TTS("tts_models/en/jenny/jenny")
             tts = tts.to(device)
@@ -1186,10 +1197,10 @@ import argparse
 def main():
     parser = argparse.ArgumentParser(description="Coqui TTS Server")
     parser.add_argument("--socket", required=True, help="Unix socket path")
-    parser.add_argument("--model", default="xtts_v2", choices=["bark", "xtts_v2", "tortoise", "vits", "jenny"])
+    parser.add_argument("--model", default="vctk_vits", choices=["bark", "xtts_v2", "tortoise", "vits", "vctk_vits", "jenny"])
     parser.add_argument("--device", default="cuda", choices=["cuda", "cpu", "mps"])
     parser.add_argument("--voice-ref", help="Default reference voice (for XTTS)")
-    parser.add_argument("--speaker", default="Ana Florence", help="Default XTTS speaker")
+    parser.add_argument("--speaker", default="p226", help="Default speaker ID (e.g., 'p226' for vctk_vits)")
     parser.add_argument("--language", default="en", help="Default language")
     args = parser.parse_args()
 
@@ -1222,6 +1233,8 @@ def main():
         tts = TTS("tts_models/en/multi-dataset/tortoise-v2")
     elif args.model == "vits":
         tts = TTS("tts_models/en/ljspeech/vits")
+    elif args.model == "vctk_vits":
+        tts = TTS("tts_models/en/vctk/vits")
     elif args.model == "jenny":
         tts = TTS("tts_models/en/jenny/jenny")
 
@@ -1265,6 +1278,9 @@ def main():
                     tts.tts_to_file(text=text, file_path=output, speaker_wav=voice_ref, language=language)
                 else:
                     tts.tts_to_file(text=text, file_path=output, speaker=speaker, language=language)
+            elif args.model in ("vctk_vits",):
+                # Multi-speaker models use speaker ID
+                tts.tts_to_file(text=text, file_path=output, speaker=speaker)
             else:
                 tts.tts_to_file(text=text, file_path=output)