Siren - A voice cloning framework

An end-to-end pipeline for voice cloning — collecting and labelling training data, then fine-tuning a TTS model to reproduce a target voice.

Modules

Dataset Builder

Prepares a labelled speech dataset from raw video sources. The pipeline takes video files and produces a HuggingFace-compatible dataset of clean, transcribed audio clips ready for TTS fine-tuning.

See dataset-builder/README.md for full usage.

Models used

Model	Stage	Why
Silero VAD	Speech filtering	Lightweight CPU-based voice activity detector. Scans 15s audio chunks and extracts only segments containing human speech, discarding silence and background noise before any heavy processing.
Meta SAM Audio	Voice isolation	GPU-accelerated source separation model (run on Colab). Separates the target speaker's voice from background music, ambient sound, and other speakers, producing clean isolated vocal tracks.
faster-whisper	Transcription	CPU int8-quantised Whisper. Transcribes the cleaned audio clips to produce the text labels required for TTS fine-tuning. Runs fully locally with no API calls.

Model Trainer

Orpheus TTS

Fine-tunes Orpheus TTS (3B, Llama backbone) on the dataset produced by dataset-builder using LoRA via unsloth. Runs in Google Colab on a GPU.

Key steps in orpheus-tts.ipynb:

Load dataset from Google Drive or HF Hub
Assign a voice name (e.g. "david") — used as speaker prefix at inference
Tokenize audio with SNAC codec + assemble training sequences
Fine-tune with LoRA (configurable rank, epochs, batch size)
Save merged 16-bit weights + LoRA adapters to Drive (optionally push to HF Hub)

See model-trainer/orpheus-tts.ipynb for full usage.

Siren API

Minimal streaming TTS service that wraps a llama-server running an Orpheus-format GGUF and returns chunked WAV audio. Endpoints: POST /tts, GET /voices, GET /health, plus a small browser demo at /. Pointing it at a different fine-tuned voice is purely an env-var change.

Quick start

On Apple Silicon, run natively — don't use Docker. Docker on Mac can't reach Metal so llama-server falls back to CPU (~14 tok/s vs ~80–150 tok/s native, i.e. ~5–10× slowdown). Use Docker only for CI / cloud where the container reaches a real CUDA GPU.

Native (recommended on Mac)

# 1. convert your fine-tuned HF checkpoint → GGUF (one-time, per voice)
brew install llama.cpp
pip install --force-reinstall \
  "git+https://github.com/ggml-org/llama.cpp.git@master#subdirectory=gguf-py"

mkdir -p models/gguf/<voice>
python "$(brew --prefix llama.cpp)/bin/convert_hf_to_gguf.py" \
  models/raw-16bit/<voice> \
  --outfile models/gguf/<voice>/model-f16.gguf \
  --outtype f16
llama-quantize models/gguf/<voice>/model-f16.gguf \
  models/gguf/<voice>/model-q4_k_m.gguf Q4_K_M

# 2. start llama-server with full Metal offload
llama-server \
  -m models/gguf/<voice>/model-q4_k_m.gguf \
  -c 8192 --port 1234 -ngl 999

# 3. start the API (separate shell)
cd siren-api
cp .env.example .env       # set DEFAULT_VOICE / VOICES to match the trained speaker prefix
uv sync
uv run uvicorn main:app --host 0.0.0.0 --port 8000

# 4. open the demo
open http://localhost:8000/

Docker Compose (CI / cloud GPU)

cd siren-api
cp .env.example .env       # set MODEL_DIR, MODEL_FILE, DEFAULT_VOICE, VOICES
docker compose up --build

Endpoints

curl http://localhost:8000/voices    # available voices + currently loaded model
curl -X POST http://localhost:8000/tts \
  -H 'content-type: application/json' \
  -d '{"text":"Hello there.","voice":"<voice>"}' \
  --output out.wav

See siren-api/README.md for the full reference (config knobs, debugging hangs, all the prompt-format details).

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
dataset-builder		dataset-builder
model-trainer		model-trainer
readme-images		readme-images
siren-api		siren-api
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Siren - A voice cloning framework

Modules

Dataset Builder

Models used

Model Trainer

Orpheus TTS

Siren API

Quick start

Native (recommended on Mac)

Docker Compose (CI / cloud GPU)

Endpoints

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Siren - A voice cloning framework

Modules

Dataset Builder

Models used

Model Trainer

Orpheus TTS

Siren API

Quick start

Native (recommended on Mac)

Docker Compose (CI / cloud GPU)

Endpoints

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages