Skip to content

Add TTS endpoint (/v1/audio/speech) to local server#471

Open
j0nl1 wants to merge 1 commit into
argmaxinc:mainfrom
j0nl1:main
Open

Add TTS endpoint (/v1/audio/speech) to local server#471
j0nl1 wants to merge 1 commit into
argmaxinc:mainfrom
j0nl1:main

Conversation

@j0nl1
Copy link
Copy Markdown

@j0nl1 j0nl1 commented May 8, 2026

Summary

Adds a text-to-speech endpoint to the local server (argmax-cli serve), enabling the same server to handle both STT and TTS via OpenAI-compatible APIs.

  • POST /v1/audio/transcriptions — STT (existing)
  • POST /v1/audio/speech — TTS (new)

Changes

  • Import TTSKit in ServeCLI.swift
  • Load Qwen3-TTS 0.6B model alongside WhisperKit on server startup
  • Add POST /v1/audio/speech as a manual Vapor route (same pattern as /health)
  • OpenAI voice name mapping (alloy → ryan, nova → serena, echo → aiden, shimmer → vivian, onyx → eric, fable → dylan) with passthrough for native Qwen3 voice names
  • 10 languages supported via short code or full name (es/spanish, en/english, fr/french, etc.)
  • WAV response encoder (24kHz mono 16-bit PCM)
  • Updated endpoint listing on root / route

Motivation

The local server currently only handles transcription. Adding TTS makes it a complete voice server — useful for smart home assistants, accessibility tools, or any application needing
both STT and TTS from a single local endpoint on Apple Silicon.

Real-world usage

I built a Home Assistant custom integration that consumes this endpoint: ha-argmax-tts. It registers as a TTS provider in Home Assistant,
allowing voice assistants to use argmax for local speech synthesis — no cloud APIs needed. The integration supports all 10 languages, configurable voice/model selection via UI, and
connection validation via /health.

Request format

POST /v1/audio/speech
{
  "input": "Hello world",
  "voice": "nova",
  "language": "en",
  "model": "qwen3-tts-0.6b"
}

Returns audio/wav (24kHz mono 16-bit PCM).

Testing

Tested on Mac Studio (Apple Silicon) with Qwen3-TTS 0.6B model. TTS generation completes in ~300-500ms for short sentences.

- Import TTSKit and load Qwen3-TTS 0.6B model on server startup
- New POST /v1/audio/speech endpoint compatible with OpenAI TTS API
- Voice mapping: OpenAI names (alloy, echo, nova...) to Qwen3 voices
- 10 language support: es, en, fr, de, pt, it, ja, ko, zh, ru
- WAV encoder (24kHz mono 16-bit PCM)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant