🎙️ Local TTS Studio

Open-source, GPU-accelerated speech studio for single-voice generation and multi-speaker podcast production — 100% offline.

Think: a local, self-hosted ElevenLabs alternative you fully control.

Getting Started · Features · Architecture · Configuration · API Reference · Contributing

Why Local TTS Studio?

Cloud TTS services	Local TTS Studio
Per-minute billing that scales with usage	$0 marginal cost — run unlimited generations
Audio leaves your network	100% local & private — nothing leaves your GPU
Rate limits and vendor lock-in	No API keys required for core TTS
Limited voice customization	Design, clone, or pick from 9 preset voices

Performance: ~8–12s per generation on RTX 3060 · bfloat16 inference · 24 kHz output

✨ Features

Single Voice Generation

Custom Voice — 9 multilingual presets (English, Chinese, Japanese, Korean, + 7 more)
Voice Design — describe a voice in natural language and generate it
Voice Clone — clone any voice from a short audio sample (ICL or x-vector modes)

Podcast Mode — Script-to-Audio Compiler

Up to 10 speakers per production with mixed voice types
Per-segment timing, volume, and emotion control
Deterministic rendering — same script produces identical audio
Fault-tolerant pipeline — failed segments get silence placeholders instead of crashing the entire render

v3: Timeline Studio (New)

Multi-track timeline editor with speech + music lanes
Music Library — search royalty-free tracks from Jamendo, Freesound, and Openverse
Audio ducking — music auto-lowers under speech segments
Loop, trim, fade — per-track audio manipulation
Live timeline preview — estimated duration updates as you edit

Screenshots

Custom Voice	Voice Design

Voice Clone	Podcast Mode

🚀 Quick Start

Requirements: NVIDIA GPU (6 GB+ VRAM) · Python 3.10+ · ~15 GB disk space

git clone https://github.com/sammy995/Local-TTS-Studio.git
cd Local-TTS-Studio
pip install -r requirements.txt
conda install -c conda-forge ffmpeg -y
python run_local.py

Open http://localhost:8000 — that's it.

First run downloads models automatically (~10 GB). Subsequent starts take ~30 s for model loading.

📋 Detailed Installation

Hardware

	Minimum	Recommended
GPU	GTX 1660 (6 GB VRAM)	RTX 3060+ (8 GB+ VRAM)
RAM	16 GB	32 GB
Disk	15 GB	20 GB

Step-by-Step

# 1. Clone
git clone https://github.com/sammy995/Local-TTS-Studio.git
cd Local-TTS-Studio

# 2. Create environment
conda create -n local-tts python=3.12 -y
conda activate local-tts
pip install -r requirements.txt

# 3. Install ffmpeg (required for MP3/M4A export)
conda install -c conda-forge ffmpeg -y
# Alternative (Windows): winget install Gyan.FFmpeg

# 4. Launch
python run_local.py

Optional: Pre-download Models

Skip the first-run download by pulling models ahead of time:

pip install -U "huggingface_hub[cli]"
huggingface-cli download Qwen/Qwen3-TTS-Tokenizer-12Hz       --local-dir models/Qwen3-TTS-Tokenizer-12Hz
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --local-dir models/Qwen3-TTS-12Hz-1.7B-CustomVoice
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --local-dir models/Qwen3-TTS-12Hz-1.7B-VoiceDesign
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base        --local-dir models/Qwen3-TTS-12Hz-1.7B-Base

🏗️ Architecture

Hexagonal (ports & adapters) layout — core logic has zero framework dependencies:

Local-TTS-Studio/
├── core/                   # Pure domain logic (no I/O)
│   ├── tts_engine.py       #   TTS generation interface
│   ├── model_manager.py    #   Model loading & lifecycle
│   └── audio_pipeline.py   #   Mix, duck, loop, trim, fade, resample
├── services/               # Stateless orchestration
│   ├── tts_service.py      #   Single-voice generation
│   ├── podcast_service.py  #   Multi-speaker render pipeline
│   ├── podcast_models.py   #   Pydantic models for podcast scripts
│   └── music_service.py    #   Jamendo / Freesound / Openverse client
├── infra/                  # Side-effect adapters
│   └── storage.py          #   File I/O & output management
├── runtimes/               # Delivery mechanism
│   ├── local_api.py        #   FastAPI server & endpoints
│   └── config_loader.py    #   YAML config reader
├── simple-ui.html          # Single-file frontend (~3 400 lines)
├── config.yaml             # All tunables in one place
├── requirements.txt
└── run_local.py            # Entry point

Key Design Decisions

Decision	Rationale
Single-file HTML frontend	Zero build step — open and go
Hexagonal backend	Core logic is testable without FastAPI
Speaker-stable deterministic seeds	Same speaker always gets the same voice timbre
Per-segment fault tolerance	One failed TTS segment can't crash the whole podcast
Music ducking in `audio_pipeline`	Keeps mixing logic out of the render loop

⚙️ Configuration

All settings live in config.yaml:

# Model size — switch to 0.6B if running low on VRAM
models:
  default_size: "1.7B"    # or "0.6B"

# Music library API keys (optional — for Timeline Studio)
music_apis:
  jamendo:
    client_id: ""          # Free → https://devportal.jamendo.com
  freesound:
    token: ""              # Free → https://freesound.org/apiv2/apply
  openverse:
    token: ""              # Optional (anonymous access works)

Tip: Copy .env.example → .env for secret management. The app reads both files.

📡 API Reference

All endpoints are served at http://localhost:8000.

Core TTS

Method	Endpoint	Description
`POST`	`/api/v1/tts/generate`	Single-voice generation (custom, design, or clone)
`GET`	`/api/v1/voices`	List available preset voices
`GET`	`/api/v1/models/status`	Model load state & GPU memory

Podcast — v2 (Script Mode)

Method	Endpoint	Description
`POST`	`/api/v2/podcast/render`	Render a multi-speaker script to audio

Podcast — v3 (Timeline Studio)

Method	Endpoint	Description
`POST`	`/api/v3/podcast/render`	Render a timeline (speech + music tracks)

Music Library

Method	Endpoint	Description
`GET`	`/api/v1/music/search`	Search royalty-free music (Jamendo, Freesound, Openverse)
`POST`	`/api/v1/music/download`	Download & cache a track locally
`GET`	`/api/v1/music/assets`	List cached music assets

Example: Generate speech

curl -X POST http://localhost:8000/api/v1/tts/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello world, this is Local TTS Studio.",
    "mode": "custom_voice",
    "speaker": "Serena",
    "language": "English"
  }'

🔧 Troubleshooting

Problem	Fix
CUDA out of memory	Set `default_size: "0.6B"` in `config.yaml`, or close other GPU programs
MP3 / M4A export fails	Install ffmpeg: `conda install -c conda-forge ffmpeg -y`
First generation slow (~30 s)	Normal — model loading. Subsequent runs: 8–12 s
Flash Attention warning	Safe to ignore (optional optimization)
Music search returns no results	Add API keys to `config.yaml` → `music_apis` section
Voice clone sounds different each time	Provide `ref_text` alongside `ref_audio` for ICL mode (more stable than x-vector)

🤝 Contributing

Contributions are welcome! Here's how to get started:

Fork the repository
Create a branch — git checkout -b feature/your-feature
Make changes — follow the existing hexagonal structure
Test — make sure python -c "import py_compile; py_compile.compile('runtimes/local_api.py')" passes
Submit a PR with a clear description

Areas We'd Love Help With

Streaming audio output (chunked WAV)
WebSocket progress events during render
Additional TTS model backends (Bark, XTTS-v2)
Docker image for one-command deployment
Test suite (pytest) for core/ and services/

📄 License

This project is licensed under the MIT License.

Model License: Qwen3-TTS models are subject to their original license terms. This application is a UI wrapper and does not claim ownership of underlying models.

🙏 Acknowledgements

Qwen3-TTS — the 1.7B parameter TTS model that powers all generation
Jamendo, Freesound, Openverse — royalty-free music APIs
FastAPI — the async Python web framework

Resources: Qwen3-TTS Paper · Models on HuggingFace · Official Repo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎙️ Local TTS Studio

Why Local TTS Studio?

✨ Features

Single Voice Generation

Podcast Mode — Script-to-Audio Compiler

v3: Timeline Studio (New)

Screenshots

🚀 Quick Start

Hardware

Step-by-Step

Optional: Pre-download Models

🏗️ Architecture

Key Design Decisions

⚙️ Configuration

📡 API Reference

Core TTS

Podcast — v2 (Script Mode)

Podcast — v3 (Timeline Studio)

Music Library

🔧 Troubleshooting

🤝 Contributing

Areas We'd Love Help With

📄 License

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
core		core
infra		infra
runtimes		runtimes
services		services
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
README.md		README.md
config.yaml		config.yaml
favicon.svg		favicon.svg
requirements.txt		requirements.txt
run_local.py		run_local.py
simple-ui.html		simple-ui.html

Folders and files

Latest commit

History

Repository files navigation

🎙️ Local TTS Studio

Why Local TTS Studio?

✨ Features

Single Voice Generation

Podcast Mode — Script-to-Audio Compiler

v3: Timeline Studio (New)

Screenshots

🚀 Quick Start

Hardware

Step-by-Step

Optional: Pre-download Models

🏗️ Architecture

Key Design Decisions

⚙️ Configuration

📡 API Reference

Core TTS

Podcast — v2 (Script Mode)

Podcast — v3 (Timeline Studio)

Music Library

🔧 Troubleshooting

🤝 Contributing

Areas We'd Love Help With

📄 License

🙏 Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages