Self-hosted text-to-speech microservice for the Cognitive Companion platform. Exposes an OpenAI-compatible API (/v1/audio/speech) and integrates with Home Assistant via the Wyoming protocol through a wyoming_openai sidecar.
- Multi-engine architecture: swap between TTS models via config
- OpenAI-compatible API (
POST /v1/audio/speech): drop-in replacement for OpenAI TTS - Home Assistant integration: Wyoming protocol bridge via wyoming_openai sidecar
- True token-level streaming: Svara streams audio as SNAC frames are decoded; Edge TTS proxies streaming from remote
- Voice cloning: upload reference audio samples (10-30s) for personalised voices
- Multi-language: English (Indian accent) and Tamil as primary languages
- GPU accelerated: runs on NVIDIA GPUs (DGX Spark, RTX 4080, etc.)
| Engine | Model | Languages | Voice Cloning | Streaming | VRAM | License |
|---|---|---|---|---|---|---|
| svara (default) | kenpath/svara-tts-v1 | English (IN), Tamil, Hindi, Telugu, + 10 more Indian languages | No | Yes (token-level) | ~8 GB (bf16) | Apache-2.0 |
| parler | ai4bharat/indic-parler-tts | English (IN), Tamil, Hindi, Telugu, + 7 more | No (text prompts) | Chunked | ~4 GB (fp16) | Apache-2.0 |
| fish_speech | fishaudio/s2-pro | 80+ languages incl. English, Tamil, Hindi | Yes | Chunked | ~8-12 GB (bf16) | Fish Audio Research* |
| seamless | facebook/seamless-m4t-v2-large | 36 languages incl. English, Tamil, Hindi, Telugu | No | No | ~6-8 GB (fp16) | CC-BY-NC-4.0* |
| edge_tts | openai-edge-tts | 40+ languages (Microsoft Edge voices) | No | Yes (proxied) | None (remote) | MIT |
* fish_speech: Free for research and non-commercial use. Commercial use requires a licence from Fish Audio. * seamless: CC-BY-NC-4.0, non-commercial use only. Commercial use requires permission from Meta.
| Engine | Streaming Type | Notes |
|---|---|---|
| svara | Token-level | Batched SNAC decode. Tokens are grouped into 7-token frames; frames are accumulated into configurable batches (stream_frame_buffer, default 21 frames = ~210ms) and decoded together for clean audio. |
| parler | Chunked | Full generation then chunked output. |
| fish_speech | Chunked | Full generation then chunked output. |
| seamless | Not supported | Batch inference only. Falls back to chunked output when stream=true. |
| edge_tts | Proxied | Streams PCM chunks directly from the remote openai-edge-tts service. |
- Best quality for Indian English + Tamil:
svara(3B-parameter model, excellent prosody) - Most languages + voice cloning:
fish_speech(80+ languages, high-quality cloning from short samples) - Lower VRAM / emotion control:
parler(uses text descriptions to control voice characteristics) - Multilingual TTS (non-commercial):
seamless(36 speech output languages via Meta's SeamlessM4T v2) - No GPU / remote synthesis:
edge_tts(pass-through to a remote openai-edge-tts service using Microsoft Edge TTS voices)
cp .env.example .env
docker compose up -dFirst start downloads models (~6 GB). Subsequent starts use cached models.
python -m venv .venv
source .venv/bin/activate
pip install -e ".[svara,dev]"
uvicorn app.main:app --host 0.0.0.0 --port 8600 --reloadpip install -e ".[dev]"
pytest tests/ -vGenerate speech from text. Drop-in compatible with the OpenAI TTS API.
curl -X POST http://localhost:8600/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "svara",
"input": "Hello, how are you today?",
"voice": "speaker_0",
"response_format": "mp3",
"speed": 1.0
}' \
--output speech.mp3Request body:
| Field | Type | Default | Description |
|---|---|---|---|
model |
string | svara |
Engine: svara, parler, fish_speech, seamless, edge_tts |
input |
string | required | Text to synthesise (max 4096 chars) |
voice |
string | default |
Voice ID or description (engine-specific) |
response_format |
string | mp3 |
Output format: mp3, wav, opus, flac, pcm |
speed |
float | 1.0 |
Speed multiplier (0.25 to 4.0) |
language |
string | auto | Language code: en, ta, hi, etc. |
stream |
bool | false |
Stream raw PCM chunks |
Streaming:
curl -X POST http://localhost:8600/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "svara", "input": "Hello", "stream": true}' \
--output stream.pcmStreaming returns raw 16-bit signed PCM at 24kHz mono. Response headers include X-Sample-Rate, X-Sample-Width, and X-Channels.
With Svara, streaming uses true token-level generation: tokens are intercepted during model.generate(), accumulated into batches of SNAC frames, and decoded together for clean audio. The first audio arrives after ~210ms (21 frames), with subsequent batches following incrementally.
List available TTS engines.
curl http://localhost:8600/v1/modelsList all voices (built-in + custom samples).
curl http://localhost:8600/api/v1/voicesUpload a voice reference sample for voice cloning.
curl -X POST http://localhost:8600/api/v1/voices/upload \
-F "voice_id=grandma" \
-F "name=Grandma's Voice" \
-F "language=ta" \
-F "description=Warm Tamil voice" \
-F "file=@reference.wav"Delete a custom voice sample.
curl http://localhost:8600/healthReturns GPU status, loaded engines, and voice sample count.
Engines that support voice cloning (fish_speech) can use reference audio samples to generate speech in a specific voice.
- Record a 10-30 second clear audio sample of the target voice
- Use WAV format, 16-bit, mono, 24kHz (other formats accepted but will be resampled)
- Ensure minimal background noise
- The speaker should talk naturally: reading a paragraph works well
Via API:
curl -X POST http://localhost:8600/api/v1/voices/upload \
-F "voice_id=grandma" \
-F "name=Grandma" \
-F "language=ta" \
-F "file=@grandma_sample.wav"Via filesystem (for pre-provisioning):
Place samples in data/voice_samples/ with this structure:
data/voice_samples/
+-- grandma/
| +-- meta.json
| +-- reference.wav
+-- caregiver/
+-- meta.json
+-- reference.wav
meta.json format:
{
"name": "Grandma",
"language": "ta",
"description": "Warm Tamil voice for familiar interactions",
"sample_rate": 24000
}Once uploaded, use the voice_id in synthesis requests:
curl -X POST http://localhost:8600/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "fish_speech", "input": "Hello paatti", "voice": "grandma"}' \
--output speech.mp3Home Assistant's voice pipeline uses the Wyoming protocol for STT and TTS. The wyoming_openai proxy bridges the TTS service's OpenAI-compatible API to Wyoming, allowing Home Assistant to use any of the configured TTS engines.
docker compose --profile wyoming up -dThis starts the wyoming-openai sidecar on port 10300. It connects to the TTS service at http://tts:8600/v1 and exposes a Wyoming server.
- Go to Settings > Devices & Services > Add Integration
- Search for Wyoming Protocol
- Enter the host IP and port
10300 - Home Assistant will discover the available TTS voices
The TTS engines then appear as voice options in any Home Assistant voice assistant pipeline.
The Wyoming sidecar is configured via environment variables in .env:
WYOMING_TTS_OPENAI_URL=http://tts:8600/v1
WYOMING_TTS_MODELS=svara,parler,fish_speech,seamless,edge_tts
WYOMING_TTS_VOICES=speaker_0,speaker_1,female_calm,male_clear,defaultAll configuration is in config/settings.yaml with ${ENV_VAR} interpolation.
Edit config/settings.yaml:
engines:
enabled:
- svara # primary (GPU)
- edge_tts # fallback (remote)
default: svaraYou can load multiple engines simultaneously. The model field in the OpenAI API selects which engine to use:
# Use svara (default)
curl -X POST .../v1/audio/speech -d '{"input": "Hello", "model": "svara"}'
# Use edge_tts (remote)
curl -X POST .../v1/audio/speech -d '{"input": "Hello", "model": "edge_tts"}'| Engine | Voice Format | Examples |
|---|---|---|
| svara | Speaker tag | speaker_0 (female), speaker_1 (male) |
| parler | Preset name or free-text description | female_calm, female_tamil, male_clear, or any description string |
| fish_speech | default or custom voice sample ID |
default, grandma (uploaded via voice cloning API) |
| seamless | Speaker index | default, speaker_0, speaker_1, speaker_2 |
| edge_tts | Edge TTS voice name | en-IN-NeerjaExpressiveNeural, ta-IN-PallaviNeural, hi-IN-SwaraNeural |
The Parler engine uses text descriptions to control voice characteristics. Built-in presets:
| Preset | Description |
|---|---|
female_calm |
Calm Indian English female, moderate pace, warm tone |
male_clear |
Clear Indian English male, steady pace, friendly tone |
female_tamil |
Gentle Tamil female, slow and clear |
male_tamil |
Clear Tamil male, moderate pace |
female_elderly_friendly |
Warm, gentle, slow: ideal for speaking to seniors |
Custom descriptions work too:
{"model": "parler", "voice": "A cheerful young woman speaks quickly with excitement"}Fish Speech (fish_speech) supports high-quality voice cloning from short reference samples. Upload a 10-30 second WAV sample via the voice management API, then use the voice_id in requests:
# Upload a reference sample
curl -X POST http://localhost:8600/api/v1/voices/upload \
-F "voice_id=grandma" \
-F "name=Grandma" \
-F "language=ta" \
-F "file=@grandma_sample.wav"
# Use the cloned voice with Fish Speech
curl -X POST http://localhost:8600/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "fish_speech", "input": "Hello paatti", "voice": "grandma"}' \
--output speech.mp3Fish Speech supports 80+ languages. Pass the language field for best results when synthesising non-English text.
SeamlessM4T (seamless) uses integer speaker IDs. Available voices: default (speaker 0), speaker_0, speaker_1, speaker_2. The engine supports 36 speech output languages. Specify the language field with a BCP-47 code:
curl -X POST http://localhost:8600/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "seamless", "input": "வணக்கம்", "language": "ta"}' \
--output speech.mp3Fish Speech requires the fish-speech package from GitHub (not on PyPI):
pip install git+https://github.com/fishaudio/fish-speech.git
# Or when installing the tts-service:
pip install -e ".[fish_speech,dev]"Fish Speech is not part of .[all] because its dependency chain conflicts with parler-tts.
The model weights (~6 GB) are downloaded from HuggingFace on first use.
The edge_tts engine proxies requests to a remote openai-edge-tts service, which uses Microsoft Edge TTS voices. No local GPU or model downloads required.
Setup: Deploy openai-edge-tts on your network (e.g. via Docker), then point the TTS service to it:
engines:
enabled:
- edge_tts
default: edge_tts
edge_tts:
base_url: "http://192.168.1.31:6060/v1"
default_voice: "en-IN-NeerjaExpressiveNeural"
default_speed: 0.85Or via environment variable: EDGE_TTS_URL=http://192.168.1.31:6060/v1
Popular voices for Indian languages:
| Voice ID | Language | Description |
|---|---|---|
en-IN-NeerjaExpressiveNeural |
English (IN) | Expressive Indian English female (default) |
en-IN-PrabhatNeural |
English (IN) | Indian English male |
ta-IN-PallaviNeural |
Tamil | Tamil female |
ta-IN-ValluvarNeural |
Tamil | Tamil male |
hi-IN-SwaraNeural |
Hindi | Hindi female |
hi-IN-MadhurNeural |
Hindi | Hindi male |
te-IN-ShrutiNeural |
Telugu | Telugu female |
kn-IN-SapnaNeural |
Kannada | Kannada female |
ml-IN-SobhanaNeural |
Malayalam | Malayalam female |
See the full list of voices at the Edge TTS voice gallery.
| Engine | Min VRAM | Recommended | CPU Fallback |
|---|---|---|---|
| svara | 8 GB (bf16) | 16 GB | Not practical |
| parler | 4 GB (fp16) | 8 GB | Slow but works |
| fish_speech | 8 GB (bf16) | 16 GB | Not practical |
| seamless | 6 GB (fp16) | 8 GB | Slow but works |
| edge_tts | N/A | N/A | N/A (remote service) |
Tested on:
- NVIDIA DGX Spark (Grace + Blackwell, 128 GB unified): all engines run comfortably
- NVIDIA RTX 4080 (16 GB): svara in bf16, parler in fp16; fish_speech and seamless also fit in bf16/fp16
# Build and push to local registry
docker build -t localhost:32000/tts-service:latest .
docker push localhost:32000/tts-service:latest
# Deploy
kubectl apply -f kubernetes/base/pvc.yaml
kubectl apply -f kubernetes/local/deployment.yaml
kubectl apply -f kubernetes/base/service.yamlThe service is available at tts-svc:8600 within the cluster.
The Cognitive Companion backend connects to this service via its TTS client. Configure in the Cognitive Companion's config/settings.yaml:
tts:
url: "http://tts-service:8600"
default_voice: "speaker_0"
default_speed: 0.85The TTSClient in backend/integrations/tts.py supports both batch generation (generate_audio()) and streaming (stream_audio()). The announcement notification channel streams audio directly to PWA clients via WebSocket for real-time playback.