Skip to content

sammy995/Local-TTS-Studio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸŽ™οΈ Local TTS Studio

Open-source, GPU-accelerated speech studio for single-voice generation and multi-speaker podcast production β€” 100% offline.

Python 3.10+ FastAPI Qwen3-TTS License

Think: a local, self-hosted ElevenLabs alternative you fully control.

Getting Started Β· Features Β· Architecture Β· Configuration Β· API Reference Β· Contributing


Why Local TTS Studio?

Cloud TTS services Local TTS Studio
Per-minute billing that scales with usage $0 marginal cost β€” run unlimited generations
Audio leaves your network 100% local & private β€” nothing leaves your GPU
Rate limits and vendor lock-in No API keys required for core TTS
Limited voice customization Design, clone, or pick from 9 preset voices

Performance: ~8–12s per generation on RTX 3060 Β· bfloat16 inference Β· 24 kHz output


✨ Features

Single Voice Generation

  • Custom Voice β€” 9 multilingual presets (English, Chinese, Japanese, Korean, + 7 more)
  • Voice Design β€” describe a voice in natural language and generate it
  • Voice Clone β€” clone any voice from a short audio sample (ICL or x-vector modes)

Podcast Mode β€” Script-to-Audio Compiler

  • Up to 10 speakers per production with mixed voice types
  • Per-segment timing, volume, and emotion control
  • Deterministic rendering β€” same script produces identical audio
  • Fault-tolerant pipeline β€” failed segments get silence placeholders instead of crashing the entire render

v3: Timeline Studio (New)

  • Multi-track timeline editor with speech + music lanes
  • Music Library β€” search royalty-free tracks from Jamendo, Freesound, and Openverse
  • Audio ducking β€” music auto-lowers under speech segments
  • Loop, trim, fade β€” per-track audio manipulation
  • Live timeline preview β€” estimated duration updates as you edit

Screenshots

Custom Voice Voice Design
Custom Voice Voice Design
Voice Clone Podcast Mode
Voice Clone Podcast Mode

πŸš€ Quick Start

Requirements: NVIDIA GPU (6 GB+ VRAM) Β· Python 3.10+ Β· ~15 GB disk space

git clone https://github.com/sammy995/Local-TTS-Studio.git
cd Local-TTS-Studio
pip install -r requirements.txt
conda install -c conda-forge ffmpeg -y
python run_local.py

Open http://localhost:8000 β€” that's it.

First run downloads models automatically (~10 GB). Subsequent starts take ~30 s for model loading.

πŸ“‹ Detailed Installation

Hardware

Minimum Recommended
GPU GTX 1660 (6 GB VRAM) RTX 3060+ (8 GB+ VRAM)
RAM 16 GB 32 GB
Disk 15 GB 20 GB

Step-by-Step

# 1. Clone
git clone https://github.com/sammy995/Local-TTS-Studio.git
cd Local-TTS-Studio

# 2. Create environment
conda create -n local-tts python=3.12 -y
conda activate local-tts
pip install -r requirements.txt

# 3. Install ffmpeg (required for MP3/M4A export)
conda install -c conda-forge ffmpeg -y
# Alternative (Windows): winget install Gyan.FFmpeg

# 4. Launch
python run_local.py

Optional: Pre-download Models

Skip the first-run download by pulling models ahead of time:

pip install -U "huggingface_hub[cli]"
huggingface-cli download Qwen/Qwen3-TTS-Tokenizer-12Hz       --local-dir models/Qwen3-TTS-Tokenizer-12Hz
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --local-dir models/Qwen3-TTS-12Hz-1.7B-CustomVoice
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --local-dir models/Qwen3-TTS-12Hz-1.7B-VoiceDesign
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base        --local-dir models/Qwen3-TTS-12Hz-1.7B-Base

πŸ—οΈ Architecture

Hexagonal (ports & adapters) layout β€” core logic has zero framework dependencies:

Local-TTS-Studio/
β”œβ”€β”€ core/                   # Pure domain logic (no I/O)
β”‚   β”œβ”€β”€ tts_engine.py       #   TTS generation interface
β”‚   β”œβ”€β”€ model_manager.py    #   Model loading & lifecycle
β”‚   └── audio_pipeline.py   #   Mix, duck, loop, trim, fade, resample
β”œβ”€β”€ services/               # Stateless orchestration
β”‚   β”œβ”€β”€ tts_service.py      #   Single-voice generation
β”‚   β”œβ”€β”€ podcast_service.py  #   Multi-speaker render pipeline
β”‚   β”œβ”€β”€ podcast_models.py   #   Pydantic models for podcast scripts
β”‚   └── music_service.py    #   Jamendo / Freesound / Openverse client
β”œβ”€β”€ infra/                  # Side-effect adapters
β”‚   └── storage.py          #   File I/O & output management
β”œβ”€β”€ runtimes/               # Delivery mechanism
β”‚   β”œβ”€β”€ local_api.py        #   FastAPI server & endpoints
β”‚   └── config_loader.py    #   YAML config reader
β”œβ”€β”€ simple-ui.html          # Single-file frontend (~3 400 lines)
β”œβ”€β”€ config.yaml             # All tunables in one place
β”œβ”€β”€ requirements.txt
└── run_local.py            # Entry point

Key Design Decisions

Decision Rationale
Single-file HTML frontend Zero build step β€” open and go
Hexagonal backend Core logic is testable without FastAPI
Speaker-stable deterministic seeds Same speaker always gets the same voice timbre
Per-segment fault tolerance One failed TTS segment can't crash the whole podcast
Music ducking in audio_pipeline Keeps mixing logic out of the render loop

βš™οΈ Configuration

All settings live in config.yaml:

# Model size β€” switch to 0.6B if running low on VRAM
models:
  default_size: "1.7B"    # or "0.6B"

# Music library API keys (optional β€” for Timeline Studio)
music_apis:
  jamendo:
    client_id: ""          # Free β†’ https://devportal.jamendo.com
  freesound:
    token: ""              # Free β†’ https://freesound.org/apiv2/apply
  openverse:
    token: ""              # Optional (anonymous access works)

Tip: Copy .env.example β†’ .env for secret management. The app reads both files.


πŸ“‘ API Reference

All endpoints are served at http://localhost:8000.

Core TTS

Method Endpoint Description
POST /api/v1/tts/generate Single-voice generation (custom, design, or clone)
GET /api/v1/voices List available preset voices
GET /api/v1/models/status Model load state & GPU memory

Podcast β€” v2 (Script Mode)

Method Endpoint Description
POST /api/v2/podcast/render Render a multi-speaker script to audio

Podcast β€” v3 (Timeline Studio)

Method Endpoint Description
POST /api/v3/podcast/render Render a timeline (speech + music tracks)

Music Library

Method Endpoint Description
GET /api/v1/music/search Search royalty-free music (Jamendo, Freesound, Openverse)
POST /api/v1/music/download Download & cache a track locally
GET /api/v1/music/assets List cached music assets
Example: Generate speech
curl -X POST http://localhost:8000/api/v1/tts/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello world, this is Local TTS Studio.",
    "mode": "custom_voice",
    "speaker": "Serena",
    "language": "English"
  }'

πŸ”§ Troubleshooting

Problem Fix
CUDA out of memory Set default_size: "0.6B" in config.yaml, or close other GPU programs
MP3 / M4A export fails Install ffmpeg: conda install -c conda-forge ffmpeg -y
First generation slow (~30 s) Normal β€” model loading. Subsequent runs: 8–12 s
Flash Attention warning Safe to ignore (optional optimization)
Music search returns no results Add API keys to config.yaml β†’ music_apis section
Voice clone sounds different each time Provide ref_text alongside ref_audio for ICL mode (more stable than x-vector)

🀝 Contributing

Contributions are welcome! Here's how to get started:

  1. Fork the repository
  2. Create a branch β€” git checkout -b feature/your-feature
  3. Make changes β€” follow the existing hexagonal structure
  4. Test β€” make sure python -c "import py_compile; py_compile.compile('runtimes/local_api.py')" passes
  5. Submit a PR with a clear description

Areas We'd Love Help With

  • Streaming audio output (chunked WAV)
  • WebSocket progress events during render
  • Additional TTS model backends (Bark, XTTS-v2)
  • Docker image for one-command deployment
  • Test suite (pytest) for core/ and services/

πŸ“„ License

This project is licensed under the MIT License.

Model License: Qwen3-TTS models are subject to their original license terms. This application is a UI wrapper and does not claim ownership of underlying models.


πŸ™ Acknowledgements

Resources: Qwen3-TTS Paper Β· Models on HuggingFace Β· Official Repo

About

Local, offline text-to-speech with custom voices, voice design, and cloning. Powered by Qwen3-TTS and GPU inference.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors