diff --git a/AGENTS.md b/AGENTS.md index 73b7f34..1083805 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -283,17 +283,18 @@ Task patterns allow query-based customization. Each pattern has: ## TTS Plugin (`tts.ts`) ### Overview -Reads the final agent response aloud when a session completes. Supports two engines: -- **OS TTS**: Native macOS `say` command (default, instant) -- **Chatterbox**: High-quality neural TTS with voice cloning +Reads the final agent response aloud when a session completes. Supports three engines: +- **Coqui TTS**: High-quality neural TTS (default) - Model: `tts_models/en/vctk/vits` with p226 voice +- **OS TTS**: Native macOS `say` command (instant, no setup) +- **Chatterbox**: Alternative neural TTS with voice cloning ### Features -- **Dual engine support**: OS TTS (instant) or Chatterbox (high quality) -- **Server mode**: Chatterbox model stays loaded for fast subsequent requests -- **Shared server**: Single Chatterbox instance shared across all OpenCode sessions +- **Multiple engine support**: Coqui TTS (recommended), OS TTS (instant), Chatterbox +- **Server mode**: TTS model stays loaded for fast subsequent requests +- **Shared server**: Single TTS instance shared across all OpenCode sessions - **Lock mechanism**: Prevents multiple server startups from concurrent sessions - **Device auto-detection**: Supports CUDA, MPS (Apple Silicon), CPU -- **Turbo model**: 10x faster Chatterbox inference +- **Multi-speaker support**: Coqui VCTK model supports 109 speakers (p226 default) - Cleans markdown/code from text before speaking - Truncates long messages (1000 char limit) - Skips judge/reflection sessions @@ -304,11 +305,17 @@ Edit `~/.config/opencode/tts.json`: ```json { "enabled": true, - "engine": "chatterbox", + "engine": "coqui", "os": { "voice": "Samantha", "rate": 200 }, + "coqui": { + "model": "vctk_vits", + "device": "mps", + "speaker": "p226", + "serverMode": true + }, "chatterbox": { "device": "mps", "useTurbo": true, @@ -318,14 +325,24 @@ Edit `~/.config/opencode/tts.json`: } ``` -### Chatterbox Server Files -Located in `~/.config/opencode/opencode-helpers/chatterbox/`: +### Coqui TTS Models +| Model | Description | Speed | +|-------|-------------|-------| +| `vctk_vits` | Multi-speaker VITS (109 speakers, p226 recommended) | Fast | +| `vits` | LJSpeech single speaker | Fast | +| `jenny` | Jenny voice | Medium | +| `xtts_v2` | XTTS with voice cloning | Slower | +| `bark` | Multilingual neural TTS | Slower | +| `tortoise` | Very high quality | Very slow | + +### Coqui Server Files +Located in `~/.config/opencode/opencode-helpers/coqui/`: - `tts.py` - One-shot TTS script - `tts_server.py` - Persistent server script - `tts.sock` - Unix socket for IPC - `server.pid` - Running server PID - `server.lock` - Startup lock file -- `venv/` - Python virtualenv with chatterbox-tts +- `venv/` - Python virtualenv with TTS package ### Testing ```bash @@ -335,14 +352,14 @@ npm run test:tts:manual # Actually speaks test phrases ### Debugging ```bash -# Check if Chatterbox server is running -ls -la ~/.config/opencode/opencode-helpers/chatterbox/tts.sock +# Check if Coqui server is running +ls -la ~/.config/opencode/opencode-helpers/coqui/tts.sock # Check server PID -cat ~/.config/opencode/opencode-helpers/chatterbox/server.pid +cat ~/.config/opencode/opencode-helpers/coqui/server.pid # Stop server manually -kill $(cat ~/.config/opencode/opencode-helpers/chatterbox/server.pid) +kill $(cat ~/.config/opencode/opencode-helpers/coqui/server.pid) # Check server logs (stderr) # Server automatically restarts on next TTS request diff --git a/README.md b/README.md index fba616d..7a67fe8 100644 --- a/README.md +++ b/README.md @@ -29,7 +29,7 @@ This plugin adds a **judge layer** that automatically evaluates task completion - **Automatic task verification** - Judge evaluates completion after each agent response - **Self-healing workflow** - Agent receives feedback and continues if work is incomplete - **Telegram notifications** - Get notified when tasks finish, reply via text or voice -- **Local TTS** - Hear responses read aloud (Coqui XTTS, Chatterbox, macOS) +- **Local TTS** - Hear responses read aloud (Coqui VCTK/VITS, Chatterbox, macOS) - **Voice-to-text** - Reply to Telegram with voice messages, transcribed by local Whisper ## Quick Install @@ -152,10 +152,92 @@ Text-to-speech with Telegram integration for remote notifications and two-way co | Engine | Quality | Speed | Setup | |--------|---------|-------|-------| -| **Coqui XTTS v2** | Excellent | 2-5s | Auto-installed, Python 3.9+ | +| **Coqui TTS** | Excellent | Fast-Medium | Auto-installed, Python 3.9-3.11 | | **Chatterbox** | Excellent | 2-5s | Auto-installed, Python 3.11 | | **macOS say** | Good | Instant | None | +### Coqui TTS Models + +| Model | Description | Multi-Speaker | Speed | +|-------|-------------|---------------|-------| +| `vctk_vits` | VCTK VITS (109 speakers, **recommended**) | Yes (p226 default) | Fast | +| `vits` | LJSpeech single speaker | No | Fast | +| `jenny` | Jenny voice | No | Medium | +| `xtts_v2` | XTTS v2 with voice cloning | Yes (via voiceRef) | Slower | +| `bark` | Multilingual neural TTS | No | Slower | +| `tortoise` | Very high quality | No | Very slow | + +**Recommended**: `vctk_vits` with speaker `p226` (clear, professional British male voice) + +### VCTK Speakers (vctk_vits model) + +The VCTK corpus contains 109 speakers with various English accents. Speaker IDs are in format `pXXX`. + +**Popular speaker choices:** + +| Speaker | Gender | Accent | Description | +|---------|--------|--------|-------------| +| `p226` | Male | English | Clear, professional (recommended) | +| `p225` | Female | English | Clear, neutral | +| `p227` | Male | English | Deep voice | +| `p228` | Female | English | Warm tone | +| `p229` | Female | English | Higher pitch | +| `p230` | Female | English | Soft voice | +| `p231` | Male | English | Standard | +| `p232` | Male | English | Casual | +| `p233` | Female | Scottish | Scottish accent | +| `p234` | Female | Scottish | Scottish accent | +| `p236` | Female | English | Professional | +| `p237` | Male | Scottish | Scottish accent | +| `p238` | Female | N. Irish | Northern Irish | +| `p239` | Female | English | Young voice | +| `p240` | Female | English | Mature voice | +| `p241` | Male | Scottish | Scottish accent | +| `p243` | Male | English | Deep, authoritative | +| `p244` | Female | English | Bright voice | +| `p245` | Male | Irish | Irish accent | +| `p246` | Male | Scottish | Scottish accent | +| `p247` | Male | Scottish | Scottish accent | +| `p248` | Female | Indian | Indian English | +| `p249` | Female | Scottish | Scottish accent | +| `p250` | Female | English | Standard | +| `p251` | Male | Indian | Indian English | + +
+All 109 VCTK speakers + +``` +p225, p226, p227, p228, p229, p230, p231, p232, p233, p234, +p236, p237, p238, p239, p240, p241, p243, p244, p245, p246, +p247, p248, p249, p250, p251, p252, p253, p254, p255, p256, +p257, p258, p259, p260, p261, p262, p263, p264, p265, p266, +p267, p268, p269, p270, p271, p272, p273, p274, p275, p276, +p277, p278, p279, p280, p281, p282, p283, p284, p285, p286, +p287, p288, p292, p293, p294, p295, p297, p298, p299, p300, +p301, p302, p303, p304, p305, p306, p307, p308, p310, p311, +p312, p313, p314, p316, p317, p318, p323, p326, p329, p330, +p333, p334, p335, p336, p339, p340, p341, p343, p345, p347, +p351, p360, p361, p362, p363, p364, p374, p376, ED +``` + +
+ +### XTTS v2 Speakers + +XTTS v2 is primarily a voice cloning model. Use the `voiceRef` option to clone any voice: + +```json +{ + "coqui": { + "model": "xtts_v2", + "voiceRef": "/path/to/reference-voice.wav", + "language": "en" + } +} +``` + +Supported languages: `en`, `es`, `fr`, `de`, `it`, `pt`, `pl`, `tr`, `ru`, `nl`, `cs`, `ar`, `zh-cn`, `ja`, `hu`, `ko` + ### Configuration `~/.config/opencode/tts.json`: @@ -165,10 +247,21 @@ Text-to-speech with Telegram integration for remote notifications and two-way co "enabled": true, "engine": "coqui", "coqui": { - "model": "xtts_v2", + "model": "vctk_vits", "device": "mps", + "speaker": "p226", "serverMode": true }, + "os": { + "voice": "Samantha", + "rate": 200 + }, + "chatterbox": { + "device": "mps", + "useTurbo": true, + "serverMode": true, + "exaggeration": 0.5 + }, "telegram": { "enabled": true, "uuid": "", @@ -179,12 +272,49 @@ Text-to-speech with Telegram integration for remote notifications and two-way co } ``` +### Configuration Options + +#### Engine Selection + +| Option | Description | +|--------|-------------| +| `engine` | `"coqui"` (default), `"chatterbox"`, or `"os"` | + +#### Coqui Options (`coqui`) + +| Option | Description | Default | +|--------|-------------|---------| +| `model` | TTS model (see table above) | `"vctk_vits"` | +| `device` | `"cuda"`, `"mps"`, or `"cpu"` | auto-detect | +| `speaker` | Speaker ID for multi-speaker models | `"p226"` | +| `serverMode` | Keep model loaded for fast requests | `true` | +| `voiceRef` | Path to voice clip for cloning (XTTS) | - | +| `language` | Language code for XTTS | `"en"` | + +#### Chatterbox Options (`chatterbox`) + +| Option | Description | Default | +|--------|-------------|---------| +| `device` | `"cuda"`, `"mps"`, or `"cpu"` | auto-detect | +| `useTurbo` | Use Turbo model (10x faster) | `true` | +| `serverMode` | Keep model loaded | `true` | +| `exaggeration` | Emotion level (0.0-1.0) | `0.5` | +| `voiceRef` | Path to voice clip for cloning | - | + +#### OS TTS Options (`os`) + +| Option | Description | Default | +|--------|-------------|---------| +| `voice` | macOS voice name (run `say -v ?` to list) | `"Samantha"` | +| `rate` | Words per minute | `200` | + ### Toggle Commands ``` /tts Toggle on/off /tts on Enable /tts off Disable +/tts status Check current state ``` --- @@ -469,7 +599,7 @@ npm run test:tts:manual ## Requirements - OpenCode v1.0+ -- **TTS**: macOS (for `say`), Python 3.9+ (Coqui), Python 3.11 (Chatterbox) +- **TTS**: macOS (for `say`), Python 3.9-3.11 (Coqui), Python 3.11 (Chatterbox) - **Telegram voice**: ffmpeg (`brew install ffmpeg`) - **Dependencies**: `bun` (OpenCode installs deps from package.json) diff --git a/telegram.ts b/telegram.ts index c1ec857..44be54f 100644 --- a/telegram.ts +++ b/telegram.ts @@ -688,7 +688,7 @@ async function transcribeAudio( } try { - const response = await fetch(`http://127.0.0.1:${port}/transcribe-base64`, { + const response = await fetch(`http://127.0.0.1:${port}/transcribe`, { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ diff --git a/tts.ts b/tts.ts index a219017..dcd0a18 100644 --- a/tts.ts +++ b/tts.ts @@ -205,7 +205,13 @@ const REFLECTION_POLL_INTERVAL_MS = 500 // Poll interval for verdict file type TTSEngine = "coqui" | "chatterbox" | "os" // Coqui TTS model types -type CoquiModel = "bark" | "xtts_v2" | "tortoise" | "vits" | "jenny" +// - bark: Multilingual neural TTS (slower, higher quality) +// - xtts_v2: XTTS v2 with voice cloning support +// - tortoise: Very high quality but slow +// - vits: Fast VITS model (LJSpeech single speaker) +// - vctk_vits: VCTK multi-speaker VITS (supports speaker selection, e.g., p226) +// - jenny: Jenny voice model +type CoquiModel = "bark" | "xtts_v2" | "tortoise" | "vits" | "vctk_vits" | "jenny" interface TTSConfig { enabled?: boolean @@ -215,14 +221,14 @@ interface TTSConfig { voice?: string // Voice name (e.g., "Samantha", "Alex"). Run `say -v ?` on macOS to list voices rate?: number // Speaking rate in words per minute (default: 200) } - // Coqui TTS options (supports bark, xtts_v2, tortoise, vits, etc.) + // Coqui TTS options (supports bark, xtts_v2, tortoise, vits, vctk_vits, etc.) coqui?: { - model?: CoquiModel // Model to use: "bark", "xtts_v2", "tortoise", "vits" (default: "xtts_v2") + model?: CoquiModel // Model to use: "vctk_vits" (recommended), "xtts_v2", "vits", etc. device?: "cuda" | "cpu" | "mps" // GPU, CPU, or Apple Silicon (default: auto-detect) // XTTS-specific options voiceRef?: string // Path to reference voice clip for cloning (XTTS) language?: string // Language code for XTTS (default: "en") - speaker?: string // Speaker name for XTTS (default: "Ana Florence") + speaker?: string // Speaker name/ID (e.g., "p226" for vctk_vits, "Ana Florence" for xtts) serverMode?: boolean // Keep model loaded for fast subsequent requests (default: true) } // Chatterbox-specific options @@ -337,9 +343,9 @@ async function loadConfig(): Promise { enabled: true, engine: "coqui", coqui: { - model: "xtts_v2", + model: "vctk_vits", device: "mps", - language: "en", + speaker: "p226", serverMode: true }, os: { @@ -1103,11 +1109,11 @@ def main(): parser = argparse.ArgumentParser(description="Coqui TTS") parser.add_argument("text", help="Text to synthesize") parser.add_argument("--output", "-o", required=True, help="Output WAV file") - parser.add_argument("--model", default="xtts_v2", choices=["bark", "xtts_v2", "tortoise", "vits", "jenny"]) + parser.add_argument("--model", default="vctk_vits", choices=["bark", "xtts_v2", "tortoise", "vits", "vctk_vits", "jenny"]) parser.add_argument("--device", default="cuda", choices=["cuda", "mps", "cpu"]) parser.add_argument("--voice-ref", help="Reference voice audio path (for XTTS voice cloning)") parser.add_argument("--language", default="en", help="Language code (for XTTS)") - parser.add_argument("--speaker", default="Ana Florence", help="Speaker name for XTTS (e.g., 'Ana Florence', 'Claribel Dervla')") + parser.add_argument("--speaker", default="p226", help="Speaker ID for multi-speaker models (e.g., 'p226' for vctk_vits)") args = parser.parse_args() try: @@ -1159,6 +1165,11 @@ def main(): tts = TTS("tts_models/en/ljspeech/vits") tts = tts.to(device) tts.tts_to_file(text=args.text, file_path=args.output) + elif args.model == "vctk_vits": + # VCTK VITS multi-speaker model - clear, professional voices + tts = TTS("tts_models/en/vctk/vits") + tts = tts.to(device) + tts.tts_to_file(text=args.text, file_path=args.output, speaker=args.speaker) elif args.model == "jenny": tts = TTS("tts_models/en/jenny/jenny") tts = tts.to(device) @@ -1186,10 +1197,10 @@ import argparse def main(): parser = argparse.ArgumentParser(description="Coqui TTS Server") parser.add_argument("--socket", required=True, help="Unix socket path") - parser.add_argument("--model", default="xtts_v2", choices=["bark", "xtts_v2", "tortoise", "vits", "jenny"]) + parser.add_argument("--model", default="vctk_vits", choices=["bark", "xtts_v2", "tortoise", "vits", "vctk_vits", "jenny"]) parser.add_argument("--device", default="cuda", choices=["cuda", "cpu", "mps"]) parser.add_argument("--voice-ref", help="Default reference voice (for XTTS)") - parser.add_argument("--speaker", default="Ana Florence", help="Default XTTS speaker") + parser.add_argument("--speaker", default="p226", help="Default speaker ID (e.g., 'p226' for vctk_vits)") parser.add_argument("--language", default="en", help="Default language") args = parser.parse_args() @@ -1222,6 +1233,8 @@ def main(): tts = TTS("tts_models/en/multi-dataset/tortoise-v2") elif args.model == "vits": tts = TTS("tts_models/en/ljspeech/vits") + elif args.model == "vctk_vits": + tts = TTS("tts_models/en/vctk/vits") elif args.model == "jenny": tts = TTS("tts_models/en/jenny/jenny") @@ -1265,6 +1278,9 @@ def main(): tts.tts_to_file(text=text, file_path=output, speaker_wav=voice_ref, language=language) else: tts.tts_to_file(text=text, file_path=output, speaker=speaker, language=language) + elif args.model in ("vctk_vits",): + # Multi-speaker models use speaker ID + tts.tts_to_file(text=text, file_path=output, speaker=speaker) else: tts.tts_to_file(text=text, file_path=output)