Skip to content

richardr1126/KittenTTS-FastAPI

Repository files navigation

KittenTTS FastAPI

A high-performance, lightweight Text-to-Speech (TTS) server built with FastAPI, wrapping the onnx KittenTTS v0.8 model, which can run very efficiently on CPU with optional GPU acceleration.

This project provides a robust, production-ready interface for the ultra-lightweight KittenTTS model and engine. It features a modern Web UI, true GPU acceleration via ONNX Runtime, and full OpenAI API compatibility for easy integration into existing workflows.

License: Apache 2.0 Python Version Framework Model Source Docker Web UI CUDA Compatible API

✨ Features

Production-ready FastAPI wrapper around KittenTTS, focused on fast local/self-hosted deployment.

  • Modern Web UI: Text input, voice controls, playback, and download.
  • OpenAI-Compatible API: Includes /v1/models, /v1/audio/speech, and /v1/audio/voices.
  • GPU Acceleration: Uses ONNX Runtime GPU providers when available.
  • CPU Friendly: Lightweight model (~15M params, under 25MB).
  • Long-Text Support: Optional chunking and merged output.
  • Robust Text Preprocessing: Cleans noisy artifacts and normalizes tricky text forms for more stable synthesis.
  • Env-Based Config: Configure runtime with .env and KITTEN_* vars.
  • Browser UI State: UI preferences are stored in local browser storage.
  • Built-in Voices: 8 voices included (Bella, Jasper, Luna, Bruno, Rosie, Hugo, Kiki, Leo).
  • Piper Alternative: Compact self-hosted TTS with low overhead.

🐳 Docker Quickstart

Fastest way to run on CPU with the published image:

docker run -it -d \
  --name kittentts-fastapi \
  --restart unless-stopped \
  -e KITTEN_MODEL_REPO_ID="KittenML/kitten-tts-nano-0.8-fp32" \
  -p 8005:8005 \
  ghcr.io/richardr1126/kittentts-fastapi-cpu

Works well on Raspberry Pi (64-bit OS) as well.

Environment Variables (Optional)

Supported environment variables:

  • KITTEN_SERVER_HOST (default: 0.0.0.0)
  • KITTEN_SERVER_PORT (default: 8005)
  • KITTEN_SERVER_ENABLE_PERFORMANCE_MONITOR (default: false)
  • KITTEN_MODEL_REPO_ID (default: KittenML/kitten-tts-nano-0.8-fp32)
  • KITTEN_TTS_DEVICE (default: auto, options: auto, cpu, cuda)
  • KITTEN_MODEL_CACHE (default: model_cache)
  • KITTEN_GEN_DEFAULT_SPEED (default: 1.1)
  • KITTEN_GEN_DEFAULT_LANGUAGE (default: en)
  • KITTEN_AUDIO_FORMAT (default: wav, options: wav, mp3, opus, aac)
  • KITTEN_AUDIO_SAMPLE_RATE (default: 24000)
  • KITTEN_TEXT_PROFILE (default: balanced, options: balanced, narration, dialogue)
  • KITTEN_TEXT_PROFILES_JSON (optional JSON object to override/extend profile defaults)
  • KITTEN_UI_TITLE (default: Kitten TTS Server)
  • KITTEN_UI_SHOW_LANGUAGE_SELECT (default: true)

πŸ› οΈ Local Installation

1. Prerequisites

  • Python: 3.13+
  • uv: Install uv
  • Audio runtime libs: libsndfile and ffmpeg available on system path.

2. Local Setup

# Clone the repository
git clone https://github.com/richardr1126/KittenTTS-FastAPI.git
cd KittenTTS-FastAPI

# Create local environment config
cp .env.example .env

# Sync dependencies and create virtual environment (CPU/default)
uv sync

# Run the server
uv run src/server.py

For NVIDIA GPU local installs, sync the dedicated dependency group:

uv sync --group nvidia

Then set KITTEN_TTS_DEVICE=cuda in .env (or export it in your shell) before starting the server.

After startup, the server logs the exact UI URL to visit (typically http://localhost:8005/).

🐳 Docker Compose Setup

The fastest way to deploy is using Docker Compose. Create .env first (cp .env.example .env), then run:

CPU (Default)

docker compose up -d --build

NVIDIA GPU

Make sure you have the NVIDIA Container Toolkit installed.

docker compose -f docker-compose-gpu.yml up -d --build

πŸ“– API Usage

OpenAI Compatible Endpoint (/v1/audio/speech)

curl http://localhost:8005/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Hello from the Kitten TTS FastAPI server!",
    "voice": "Jasper",
    "speed": 1.1,
    "response_format": "mp3"
  }' \
  --output speech.mp3

model accepts canonical tts-1 and also supports aliases KittenTTS and kitten-tts.

Model List Endpoint (/v1/models)

curl http://localhost:8005/v1/models

Voice List Endpoint (/v1/audio/voices)

curl http://localhost:8005/v1/audio/voices

Custom Endpoint (/tts) with Text Profile

curl http://localhost:8005/tts \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Alice: Hi there.\nBob: Hey, ready to start?",
    "voice": "Jasper",
    "output_format": "mp3",
    "split_text": true,
    "chunk_size": 120,
    "speed": 1.0,
    "text_options": {
      "profile": "dialogue"
    }
  }' \
  --output speech.mp3

/tts supports request-level text_options overrides for: profile, remove_punctuation, normalize_pause_punctuation, pause_strength, dialogue_turn_splitting, speaker_label_mode, max_punct_run. All other preprocessing flags are profile-defined server defaults.

Interactive Docs

Visit http://localhost:8005/docs for the full Swagger UI.


βš™οΈ Configuration

Server settings are loaded from environment variables (.env for local/dev). Copy .env.example to .env and edit values as needed, then restart the server.

  • KITTEN_TTS_DEVICE: auto, cuda, or cpu.
  • KITTEN_AUDIO_FORMAT: wav, mp3, opus, or aac.
  • KITTEN_MODEL_REPO_ID: Hugging Face model repo.
  • KITTEN_MODEL_CACHE: Model cache directory path.
  • KITTEN_TEXT_PROFILE: Active text profile (balanced, narration, dialogue).
  • KITTEN_TEXT_PROFILES_JSON: Optional JSON object merged onto default text_processing.profiles at startup (example: {"balanced":{"pause_strength":"strong"},"dialogue":{"dialogue_turn_splitting":true}}).
  • Profile defaults: Full preprocessing defaults (cleanup + normalization pipeline flags) are defined in text_processing.profiles in src/config.py.
  • Override model: Selected profile provides the baseline. /tts can override only this focused subset via text_options: remove_punctuation, normalize_pause_punctuation, pause_strength, dialogue_turn_splitting, speaker_label_mode, max_punct_run.
  • Text preprocessing is enabled by default and includes cleanup/normalization steps (for example, mixed alphanumeric tokens like gpt4 are normalized to improve phonemization stability).
  • OpenAI route compatibility: /v1/audio/speech keeps strict OpenAI request fields; advanced text options are configured server-side (profile/defaults).
  • UI state (last text, voice, theme) is stored in browser localStorage, not on the API server.

πŸ› οΈ Troubleshooting

  • Phonemizer Errors: The app uses bundled espeakng_loader; if your platform blocks dynamic libraries, install system espeak-ng.
  • GPU Not Used: Ensure torch.cuda.is_available() is True. The container/host must have NVIDIA drivers and the Container Toolkit.
  • Audio Errors on Linux: Ensure libsndfile1 is installed (sudo apt install libsndfile1).

πŸ™ Acknowledgements


πŸ“„ License

This project is licensed under the Apache License 2.0. See the LICENSE file for details. This license choice aligns with the upstream KittenTTS project/model licensing used by this repository.

Sponsor this project

Packages

 
 
 

Contributors