Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 32 additions & 15 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -283,17 +283,18 @@ Task patterns allow query-based customization. Each pattern has:
## TTS Plugin (`tts.ts`)

### Overview
Reads the final agent response aloud when a session completes. Supports two engines:
- **OS TTS**: Native macOS `say` command (default, instant)
- **Chatterbox**: High-quality neural TTS with voice cloning
Reads the final agent response aloud when a session completes. Supports three engines:
- **Coqui TTS**: High-quality neural TTS (default) - Model: `tts_models/en/vctk/vits` with p226 voice
- **OS TTS**: Native macOS `say` command (instant, no setup)
- **Chatterbox**: Alternative neural TTS with voice cloning

### Features
- **Dual engine support**: OS TTS (instant) or Chatterbox (high quality)
- **Server mode**: Chatterbox model stays loaded for fast subsequent requests
- **Shared server**: Single Chatterbox instance shared across all OpenCode sessions
- **Multiple engine support**: Coqui TTS (recommended), OS TTS (instant), Chatterbox
- **Server mode**: TTS model stays loaded for fast subsequent requests
- **Shared server**: Single TTS instance shared across all OpenCode sessions
- **Lock mechanism**: Prevents multiple server startups from concurrent sessions
- **Device auto-detection**: Supports CUDA, MPS (Apple Silicon), CPU
- **Turbo model**: 10x faster Chatterbox inference
- **Multi-speaker support**: Coqui VCTK model supports 109 speakers (p226 default)
- Cleans markdown/code from text before speaking
- Truncates long messages (1000 char limit)
- Skips judge/reflection sessions
Expand All @@ -304,11 +305,17 @@ Edit `~/.config/opencode/tts.json`:
```json
{
"enabled": true,
"engine": "chatterbox",
"engine": "coqui",
"os": {
"voice": "Samantha",
"rate": 200
},
"coqui": {
"model": "vctk_vits",
"device": "mps",
"speaker": "p226",
"serverMode": true
},
"chatterbox": {
"device": "mps",
"useTurbo": true,
Expand All @@ -318,14 +325,24 @@ Edit `~/.config/opencode/tts.json`:
}
```

### Chatterbox Server Files
Located in `~/.config/opencode/opencode-helpers/chatterbox/`:
### Coqui TTS Models
| Model | Description | Speed |
|-------|-------------|-------|
| `vctk_vits` | Multi-speaker VITS (109 speakers, p226 recommended) | Fast |
| `vits` | LJSpeech single speaker | Fast |
| `jenny` | Jenny voice | Medium |
| `xtts_v2` | XTTS with voice cloning | Slower |
| `bark` | Multilingual neural TTS | Slower |
| `tortoise` | Very high quality | Very slow |

### Coqui Server Files
Located in `~/.config/opencode/opencode-helpers/coqui/`:
- `tts.py` - One-shot TTS script
- `tts_server.py` - Persistent server script
- `tts.sock` - Unix socket for IPC
- `server.pid` - Running server PID
- `server.lock` - Startup lock file
- `venv/` - Python virtualenv with chatterbox-tts
- `venv/` - Python virtualenv with TTS package

### Testing
```bash
Expand All @@ -335,14 +352,14 @@ npm run test:tts:manual # Actually speaks test phrases

### Debugging
```bash
# Check if Chatterbox server is running
ls -la ~/.config/opencode/opencode-helpers/chatterbox/tts.sock
# Check if Coqui server is running
ls -la ~/.config/opencode/opencode-helpers/coqui/tts.sock

# Check server PID
cat ~/.config/opencode/opencode-helpers/chatterbox/server.pid
cat ~/.config/opencode/opencode-helpers/coqui/server.pid

# Stop server manually
kill $(cat ~/.config/opencode/opencode-helpers/chatterbox/server.pid)
kill $(cat ~/.config/opencode/opencode-helpers/coqui/server.pid)

# Check server logs (stderr)
# Server automatically restarts on next TTS request
Expand Down
138 changes: 134 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ This plugin adds a **judge layer** that automatically evaluates task completion
- **Automatic task verification** - Judge evaluates completion after each agent response
- **Self-healing workflow** - Agent receives feedback and continues if work is incomplete
- **Telegram notifications** - Get notified when tasks finish, reply via text or voice
- **Local TTS** - Hear responses read aloud (Coqui XTTS, Chatterbox, macOS)
- **Local TTS** - Hear responses read aloud (Coqui VCTK/VITS, Chatterbox, macOS)
- **Voice-to-text** - Reply to Telegram with voice messages, transcribed by local Whisper

## Quick Install
Expand Down Expand Up @@ -152,10 +152,92 @@ Text-to-speech with Telegram integration for remote notifications and two-way co

| Engine | Quality | Speed | Setup |
|--------|---------|-------|-------|
| **Coqui XTTS v2** | Excellent | 2-5s | Auto-installed, Python 3.9+ |
| **Coqui TTS** | Excellent | Fast-Medium | Auto-installed, Python 3.9-3.11 |
| **Chatterbox** | Excellent | 2-5s | Auto-installed, Python 3.11 |
| **macOS say** | Good | Instant | None |

### Coqui TTS Models

| Model | Description | Multi-Speaker | Speed |
|-------|-------------|---------------|-------|
| `vctk_vits` | VCTK VITS (109 speakers, **recommended**) | Yes (p226 default) | Fast |
| `vits` | LJSpeech single speaker | No | Fast |
| `jenny` | Jenny voice | No | Medium |
| `xtts_v2` | XTTS v2 with voice cloning | Yes (via voiceRef) | Slower |
| `bark` | Multilingual neural TTS | No | Slower |
| `tortoise` | Very high quality | No | Very slow |

**Recommended**: `vctk_vits` with speaker `p226` (clear, professional British male voice)

### VCTK Speakers (vctk_vits model)

The VCTK corpus contains 109 speakers with various English accents. Speaker IDs are in format `pXXX`.

**Popular speaker choices:**

| Speaker | Gender | Accent | Description |
|---------|--------|--------|-------------|
| `p226` | Male | English | Clear, professional (recommended) |
| `p225` | Female | English | Clear, neutral |
| `p227` | Male | English | Deep voice |
| `p228` | Female | English | Warm tone |
| `p229` | Female | English | Higher pitch |
| `p230` | Female | English | Soft voice |
| `p231` | Male | English | Standard |
| `p232` | Male | English | Casual |
| `p233` | Female | Scottish | Scottish accent |
| `p234` | Female | Scottish | Scottish accent |
| `p236` | Female | English | Professional |
| `p237` | Male | Scottish | Scottish accent |
| `p238` | Female | N. Irish | Northern Irish |
| `p239` | Female | English | Young voice |
| `p240` | Female | English | Mature voice |
| `p241` | Male | Scottish | Scottish accent |
| `p243` | Male | English | Deep, authoritative |
| `p244` | Female | English | Bright voice |
| `p245` | Male | Irish | Irish accent |
| `p246` | Male | Scottish | Scottish accent |
| `p247` | Male | Scottish | Scottish accent |
| `p248` | Female | Indian | Indian English |
| `p249` | Female | Scottish | Scottish accent |
| `p250` | Female | English | Standard |
| `p251` | Male | Indian | Indian English |

<details>
<summary>All 109 VCTK speakers</summary>

```
p225, p226, p227, p228, p229, p230, p231, p232, p233, p234,
p236, p237, p238, p239, p240, p241, p243, p244, p245, p246,
p247, p248, p249, p250, p251, p252, p253, p254, p255, p256,
p257, p258, p259, p260, p261, p262, p263, p264, p265, p266,
p267, p268, p269, p270, p271, p272, p273, p274, p275, p276,
p277, p278, p279, p280, p281, p282, p283, p284, p285, p286,
p287, p288, p292, p293, p294, p295, p297, p298, p299, p300,
p301, p302, p303, p304, p305, p306, p307, p308, p310, p311,
p312, p313, p314, p316, p317, p318, p323, p326, p329, p330,
p333, p334, p335, p336, p339, p340, p341, p343, p345, p347,
p351, p360, p361, p362, p363, p364, p374, p376, ED
```

</details>

### XTTS v2 Speakers

XTTS v2 is primarily a voice cloning model. Use the `voiceRef` option to clone any voice:

```json
{
"coqui": {
"model": "xtts_v2",
"voiceRef": "/path/to/reference-voice.wav",
"language": "en"
}
}
```

Supported languages: `en`, `es`, `fr`, `de`, `it`, `pt`, `pl`, `tr`, `ru`, `nl`, `cs`, `ar`, `zh-cn`, `ja`, `hu`, `ko`

### Configuration

`~/.config/opencode/tts.json`:
Expand All @@ -165,10 +247,21 @@ Text-to-speech with Telegram integration for remote notifications and two-way co
"enabled": true,
"engine": "coqui",
"coqui": {
"model": "xtts_v2",
"model": "vctk_vits",
"device": "mps",
"speaker": "p226",
"serverMode": true
},
"os": {
"voice": "Samantha",
"rate": 200
},
"chatterbox": {
"device": "mps",
"useTurbo": true,
"serverMode": true,
"exaggeration": 0.5
},
"telegram": {
"enabled": true,
"uuid": "<your-uuid>",
Expand All @@ -179,12 +272,49 @@ Text-to-speech with Telegram integration for remote notifications and two-way co
}
```

### Configuration Options

#### Engine Selection

| Option | Description |
|--------|-------------|
| `engine` | `"coqui"` (default), `"chatterbox"`, or `"os"` |

#### Coqui Options (`coqui`)

| Option | Description | Default |
|--------|-------------|---------|
| `model` | TTS model (see table above) | `"vctk_vits"` |
| `device` | `"cuda"`, `"mps"`, or `"cpu"` | auto-detect |
| `speaker` | Speaker ID for multi-speaker models | `"p226"` |
| `serverMode` | Keep model loaded for fast requests | `true` |
| `voiceRef` | Path to voice clip for cloning (XTTS) | - |
| `language` | Language code for XTTS | `"en"` |

#### Chatterbox Options (`chatterbox`)

| Option | Description | Default |
|--------|-------------|---------|
| `device` | `"cuda"`, `"mps"`, or `"cpu"` | auto-detect |
| `useTurbo` | Use Turbo model (10x faster) | `true` |
| `serverMode` | Keep model loaded | `true` |
| `exaggeration` | Emotion level (0.0-1.0) | `0.5` |
| `voiceRef` | Path to voice clip for cloning | - |

#### OS TTS Options (`os`)

| Option | Description | Default |
|--------|-------------|---------|
| `voice` | macOS voice name (run `say -v ?` to list) | `"Samantha"` |
| `rate` | Words per minute | `200` |

### Toggle Commands

```
/tts Toggle on/off
/tts on Enable
/tts off Disable
/tts status Check current state
```

---
Expand Down Expand Up @@ -469,7 +599,7 @@ npm run test:tts:manual
## Requirements

- OpenCode v1.0+
- **TTS**: macOS (for `say`), Python 3.9+ (Coqui), Python 3.11 (Chatterbox)
- **TTS**: macOS (for `say`), Python 3.9-3.11 (Coqui), Python 3.11 (Chatterbox)
- **Telegram voice**: ffmpeg (`brew install ffmpeg`)
- **Dependencies**: `bun` (OpenCode installs deps from package.json)

Expand Down
2 changes: 1 addition & 1 deletion telegram.ts
Original file line number Diff line number Diff line change
Expand Up @@ -688,7 +688,7 @@ async function transcribeAudio(
}

try {
const response = await fetch(`http://127.0.0.1:${port}/transcribe-base64`, {
const response = await fetch(`http://127.0.0.1:${port}/transcribe`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
Expand Down
36 changes: 26 additions & 10 deletions tts.ts
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,13 @@ const REFLECTION_POLL_INTERVAL_MS = 500 // Poll interval for verdict file
type TTSEngine = "coqui" | "chatterbox" | "os"

// Coqui TTS model types
type CoquiModel = "bark" | "xtts_v2" | "tortoise" | "vits" | "jenny"
// - bark: Multilingual neural TTS (slower, higher quality)
// - xtts_v2: XTTS v2 with voice cloning support
// - tortoise: Very high quality but slow
// - vits: Fast VITS model (LJSpeech single speaker)
// - vctk_vits: VCTK multi-speaker VITS (supports speaker selection, e.g., p226)
// - jenny: Jenny voice model
type CoquiModel = "bark" | "xtts_v2" | "tortoise" | "vits" | "vctk_vits" | "jenny"

interface TTSConfig {
enabled?: boolean
Expand All @@ -215,14 +221,14 @@ interface TTSConfig {
voice?: string // Voice name (e.g., "Samantha", "Alex"). Run `say -v ?` on macOS to list voices
rate?: number // Speaking rate in words per minute (default: 200)
}
// Coqui TTS options (supports bark, xtts_v2, tortoise, vits, etc.)
// Coqui TTS options (supports bark, xtts_v2, tortoise, vits, vctk_vits, etc.)
coqui?: {
model?: CoquiModel // Model to use: "bark", "xtts_v2", "tortoise", "vits" (default: "xtts_v2")
model?: CoquiModel // Model to use: "vctk_vits" (recommended), "xtts_v2", "vits", etc.
device?: "cuda" | "cpu" | "mps" // GPU, CPU, or Apple Silicon (default: auto-detect)
// XTTS-specific options
voiceRef?: string // Path to reference voice clip for cloning (XTTS)
language?: string // Language code for XTTS (default: "en")
speaker?: string // Speaker name for XTTS (default: "Ana Florence")
speaker?: string // Speaker name/ID (e.g., "p226" for vctk_vits, "Ana Florence" for xtts)
serverMode?: boolean // Keep model loaded for fast subsequent requests (default: true)
}
// Chatterbox-specific options
Expand Down Expand Up @@ -337,9 +343,9 @@ async function loadConfig(): Promise<TTSConfig> {
enabled: true,
engine: "coqui",
coqui: {
model: "xtts_v2",
model: "vctk_vits",
device: "mps",
language: "en",
speaker: "p226",
serverMode: true
},
os: {
Expand Down Expand Up @@ -1103,11 +1109,11 @@ def main():
parser = argparse.ArgumentParser(description="Coqui TTS")
parser.add_argument("text", help="Text to synthesize")
parser.add_argument("--output", "-o", required=True, help="Output WAV file")
parser.add_argument("--model", default="xtts_v2", choices=["bark", "xtts_v2", "tortoise", "vits", "jenny"])
parser.add_argument("--model", default="vctk_vits", choices=["bark", "xtts_v2", "tortoise", "vits", "vctk_vits", "jenny"])
parser.add_argument("--device", default="cuda", choices=["cuda", "mps", "cpu"])
parser.add_argument("--voice-ref", help="Reference voice audio path (for XTTS voice cloning)")
parser.add_argument("--language", default="en", help="Language code (for XTTS)")
parser.add_argument("--speaker", default="Ana Florence", help="Speaker name for XTTS (e.g., 'Ana Florence', 'Claribel Dervla')")
parser.add_argument("--speaker", default="p226", help="Speaker ID for multi-speaker models (e.g., 'p226' for vctk_vits)")
args = parser.parse_args()

try:
Expand Down Expand Up @@ -1159,6 +1165,11 @@ def main():
tts = TTS("tts_models/en/ljspeech/vits")
tts = tts.to(device)
tts.tts_to_file(text=args.text, file_path=args.output)
elif args.model == "vctk_vits":
# VCTK VITS multi-speaker model - clear, professional voices
tts = TTS("tts_models/en/vctk/vits")
tts = tts.to(device)
tts.tts_to_file(text=args.text, file_path=args.output, speaker=args.speaker)
elif args.model == "jenny":
tts = TTS("tts_models/en/jenny/jenny")
tts = tts.to(device)
Expand Down Expand Up @@ -1186,10 +1197,10 @@ import argparse
def main():
parser = argparse.ArgumentParser(description="Coqui TTS Server")
parser.add_argument("--socket", required=True, help="Unix socket path")
parser.add_argument("--model", default="xtts_v2", choices=["bark", "xtts_v2", "tortoise", "vits", "jenny"])
parser.add_argument("--model", default="vctk_vits", choices=["bark", "xtts_v2", "tortoise", "vits", "vctk_vits", "jenny"])
parser.add_argument("--device", default="cuda", choices=["cuda", "cpu", "mps"])
parser.add_argument("--voice-ref", help="Default reference voice (for XTTS)")
parser.add_argument("--speaker", default="Ana Florence", help="Default XTTS speaker")
parser.add_argument("--speaker", default="p226", help="Default speaker ID (e.g., 'p226' for vctk_vits)")
parser.add_argument("--language", default="en", help="Default language")
args = parser.parse_args()

Expand Down Expand Up @@ -1222,6 +1233,8 @@ def main():
tts = TTS("tts_models/en/multi-dataset/tortoise-v2")
elif args.model == "vits":
tts = TTS("tts_models/en/ljspeech/vits")
elif args.model == "vctk_vits":
tts = TTS("tts_models/en/vctk/vits")
elif args.model == "jenny":
tts = TTS("tts_models/en/jenny/jenny")

Expand Down Expand Up @@ -1265,6 +1278,9 @@ def main():
tts.tts_to_file(text=text, file_path=output, speaker_wav=voice_ref, language=language)
else:
tts.tts_to_file(text=text, file_path=output, speaker=speaker, language=language)
elif args.model in ("vctk_vits",):
# Multi-speaker models use speaker ID
tts.tts_to_file(text=text, file_path=output, speaker=speaker)
else:
tts.tts_to_file(text=text, file_path=output)

Expand Down