An end-to-end pipeline for voice cloning — collecting and labelling training data, then fine-tuning a TTS model to reproduce a target voice.
Prepares a labelled speech dataset from raw video sources. The pipeline takes video files and produces a HuggingFace-compatible dataset of clean, transcribed audio clips ready for TTS fine-tuning.
See dataset-builder/README.md for full usage.
| Model | Stage | Why |
|---|---|---|
| Silero VAD | Speech filtering | Lightweight CPU-based voice activity detector. Scans 15s audio chunks and extracts only segments containing human speech, discarding silence and background noise before any heavy processing. |
| Meta SAM Audio | Voice isolation | GPU-accelerated source separation model (run on Colab). Separates the target speaker's voice from background music, ambient sound, and other speakers, producing clean isolated vocal tracks. |
| faster-whisper | Transcription | CPU int8-quantised Whisper. Transcribes the cleaned audio clips to produce the text labels required for TTS fine-tuning. Runs fully locally with no API calls. |
Fine-tunes Orpheus TTS (3B, Llama backbone) on the dataset produced by dataset-builder using LoRA via unsloth. Runs in Google Colab on a GPU.
Key steps in orpheus-tts.ipynb:
- Load dataset from Google Drive or HF Hub
- Assign a voice name (e.g.
"david") — used as speaker prefix at inference - Tokenize audio with SNAC codec + assemble training sequences
- Fine-tune with LoRA (configurable rank, epochs, batch size)
- Save merged 16-bit weights + LoRA adapters to Drive (optionally push to HF Hub)
See model-trainer/orpheus-tts.ipynb for full usage.
Minimal streaming TTS service that wraps a llama-server running an Orpheus-format GGUF and returns chunked WAV audio. Endpoints: POST /tts, GET /voices, GET /health, plus a small browser demo at /. Pointing it at a different fine-tuned voice is purely an env-var change.
On Apple Silicon, run natively — don't use Docker. Docker on Mac can't reach Metal so
llama-serverfalls back to CPU (~14 tok/s vs ~80–150 tok/s native, i.e. ~5–10× slowdown). Use Docker only for CI / cloud where the container reaches a real CUDA GPU.
# 1. convert your fine-tuned HF checkpoint → GGUF (one-time, per voice)
brew install llama.cpp
pip install --force-reinstall \
"git+https://github.com/ggml-org/llama.cpp.git@master#subdirectory=gguf-py"
mkdir -p models/gguf/<voice>
python "$(brew --prefix llama.cpp)/bin/convert_hf_to_gguf.py" \
models/raw-16bit/<voice> \
--outfile models/gguf/<voice>/model-f16.gguf \
--outtype f16
llama-quantize models/gguf/<voice>/model-f16.gguf \
models/gguf/<voice>/model-q4_k_m.gguf Q4_K_M
# 2. start llama-server with full Metal offload
llama-server \
-m models/gguf/<voice>/model-q4_k_m.gguf \
-c 8192 --port 1234 -ngl 999
# 3. start the API (separate shell)
cd siren-api
cp .env.example .env # set DEFAULT_VOICE / VOICES to match the trained speaker prefix
uv sync
uv run uvicorn main:app --host 0.0.0.0 --port 8000
# 4. open the demo
open http://localhost:8000/cd siren-api
cp .env.example .env # set MODEL_DIR, MODEL_FILE, DEFAULT_VOICE, VOICES
docker compose up --buildcurl http://localhost:8000/voices # available voices + currently loaded model
curl -X POST http://localhost:8000/tts \
-H 'content-type: application/json' \
-d '{"text":"Hello there.","voice":"<voice>"}' \
--output out.wavSee siren-api/README.md for the full reference (config knobs, debugging hangs, all the prompt-format details).


