VODER - Voice Blender

VODER is a Local, Free, Offline, professional-grade voice processing and transformation tool that enables seamless conversion between speech, text, and music. Built for creators, developers, and audio professionals, VODER delivers high-quality synthesis, voice cloning, transcription, music generation, sound effects, and speech enhancement capabilities through an intuitive interface.

🚀 Ready in Colab: Open VODER in Google Colab

🤖 For AI agents and automated tools: See Bots.md

Quick Start

Run from Source

# Clone the repository
git clone https://github.com/HAKORADev/VODER.git
cd VODER

# Install dependencies
pip install -r requirements.txt

# IMPORTANT: After installing requirements, upgrade protobuf to avoid compatibility issues
pip install --upgrade protobuf==5.29.6

# Launch GUI
python src/voder.py

# Or use CLI mode
python src/voder.py cli

Run in Google Colab

Open the link, connect to a runtime, and press Run All (or run cells one by one until the last one). Once execution completes, VODER is ready to use directly in your browser — no installation required.

Installation Requirements

# Install FFmpeg (required for audio processing)
# Windows: winget install FFmpeg
# macOS: brew install ffmpeg
# Linux: sudo apt install ffmpeg

# Additional system dependencies (required for new features)
# Linux: sudo apt install sox
# macOS: brew install sox
# yt-dlp: pip install yt-dlp

New Dependencies (v04/08/2026 update): VODER now requires yt-dlp (for YouTube/Bilibili/TikTok URL support), easyocr and onnxruntime (for image text extraction), lightning (for pyannote model loading), sox (for audio manipulation), einx, x-transformers, safetensors, soxr (for UniSE speech enhancement), and tqdm/packaging. These are included in requirements.txt — simply run pip install -r requirements.txt after pulling the latest version.

Core Capabilities

🎤 9 Processing Modes

VODER offers nine distinct voice processing modes, each designed for specific audio transformation needs:

Mode	Description	Input	Output
STT+TTS	Speech-to-Text then Text-to-Speech	Audio	Audio
TTS	Text-to-Speech with Voice Design	Text	Audio
TTS+VC	Text-to-Speech + Voice Cloning	Text + Reference	Audio
STS	Speech-to-Speech (Voice Conversion)	Audio + Reference	Audio
TTM	Text-to-Music Generation	Text	Audio
TTM+VC	Text-to-Music + Voice Conversion	Text + Reference	Audio
STT	Speech-to-Text (Transcription)	Audio / Video / Image / URL	Text
SE	Speech Enhancement (Denoise/Dereverb)	Audio / Video	Audio / Video
SFX	Sound Effects Generation	Text	Audio

MSTS (Music-STS): STS mode now supports musical inputs. When processing songs or musical audio, select "musical inputs?" to use the Seed-VC v1 model (44.1kHz) instead of the standard v2 model (22.05kHz), providing better voice conversion quality for music content.

SE Mode (Speech Enhancement)

SE (Speech Enhancement) is a standalone mode for improving audio quality by removing noise, reducing reverberation, and restoring speech clarity.

Supported Inputs:

Audio files (WAV, MP3, FLAC, OGG, etc.)
Video files (MP4, MKV, AVI, etc.) — audio is extracted automatically

Features:

Denoising — removes background noise and artifacts
Dereverberation — reduces room echo and reverb effects
Speech restoration — enhances clarity and intelligibility
Outputs at 16kHz sample rate (optimized for speech)
Not designed for musical enhancement — use for speech content only

Quick Examples:

# Basic speech enhancement
python src/voder.py se "noisy_audio.wav"

# Enhance audio from video
python src/voder.py se "recording.mp4"

# Save to specific location
python src/voder.py se "audio.wav" result "/path/to/enhanced.wav"

CLI Usage:

# Interactive mode
python src/voder.py cli
# Select option 7 (SE)

# One-liner mode
python src/voder.py se "audio_file.wav" result "/output/enhanced.wav"

SFX Mode (Sound Effects Generation)

SFX (Sound Effects) is a standalone mode for generating custom sound effects from text descriptions.

Features:

Text-to-audio generation for any sound effect
Configurable duration (1-30 seconds)
Adjustable inference steps (1-100, default 30)
Adjustable guidance scale (1.0-10.0, default 4.5)
44.1kHz output quality

Quick Examples:

# Generate a simple sound effect (default 10 seconds)
python src/voder.py sfx sound "thunder rumbling in the distance"

# Specify duration
python src/voder.py sfx sound "rain on a tin roof" duration 15

# Adjust quality parameters
python src/voder.py sfx sound "explosion with debris" duration 5 steps 50 guide 3.5

# Save to specific location
python src/voder.py sfx sound "footsteps on gravel" duration 8 result "/output/footsteps.wav"

Parameters:

Parameter	Description	Range	Default
`sound`	Text description of the sound effect	Any text	Required
`duration`	Duration in seconds	1-30	Required
`steps`	Inference steps (quality vs speed)	1-100	30
`guide`	Guidance scale (adherence to prompt)	1.0-10.0	4.5
`result`	Output file path	Any path	Optional

Sound Prompt Tips:

Be descriptive but concise
Include environmental context (e.g., "in a forest", "in a small room")
Specify intensity (e.g., "distant", "loud", "faint")
Combine multiple elements (e.g., "thunder with heavy rain")

STT Mode (Speech-to-Text)

STT is a standalone transcription mode available as a one-line CLI command. It transcribes audio, video, images, or YouTube URLs into plain text with optional enhancements.

Supported Inputs:

Audio files (WAV, MP3, FLAC, OGG, etc.)
Video files (MP4, MKV, AVI, etc.)
Image files containing text (PNG, JPG, etc.) — text is extracted via OCR before transcription
YouTube / Bilibili / TikTok URLs — downloaded and processed automatically

Features:

Clean text transcription output
Optional timestamps for word-level or segment-level timing
Optional dialogue mode that detects and formats multi-speaker conversations
Optional speaker diarization that identifies individual speakers by name/label
Batch processing — pass multiple files/URLs in a single command to process them all at once
Results saved to a specified output file or printed to the terminal

Quick Examples:

# Basic transcription
python src/voder.py stt "audio.wav"

# With timestamps
python src/voder.py stt "audio.wav" timestamp

# With dialogue formatting
python src/voder.py stt "audio.wav" dialogue

# Batch processing
python src/voder.py stt "audio1.wav" "audio2.wav"

# Transcribe a YouTube video directly
python src/voder.py stt "https://youtube.com/watch?v=..."

# Save TTS output to a specific file
python src/voder.py tts script "Hello" voice "male" result "/path/to/output.txt"

🎭 Dialogue System

VODER features a powerful row-based dialogue editor designed for creating multi-speaker audio content such as podcasts, AI news broadcasts, audiobooks, and conversational content. This system enables script-based generation where multiple characters speak with distinct voices in a cohesive narrative flow.

GUI Dialogue Input:

Each line is a separate row with Character and Dialogue fields.
New rows are added automatically when you fill the last row.
First row has no delete button; subsequent rows can be deleted individually.
Voice prompts or audio assignments appear dynamically for every character found in the script.

Script Directives (Per-Line):

VODER now supports powerful directives that can be appended to any dialogue line for fine-grained control:

Directive	Format	Description
`/time:nn`	`/time:5`	Position this line at 5 seconds from the start
`/time:nn-nn`	`/time:10-3`	Position at 10s, cut 3s from end
`/time:nn+nn`	`/time:5+2`	Position at 5s, cut 2s from start
`/time:nn-nn+nn`	`/time:10-3+2`	Position at 10s, cut 3s from end, cut 2s from start
`/level:0-100`	`/level:75`	Set volume level for this line (default: 100)
`/duration:1-30`	`/duration:10`	Duration for SFX lines (required for `sfx:` character)

SFX Lines in Dialogue:

You can now embed sound effects directly in dialogue scripts using the special sfx: character:

James: Welcome to our podcast!
sfx: door creaking open /duration:3
Sarah: Hello everyone, glad to be here.
sfx: gentle background music /duration:10 /level:30
James: Let's dive into today's topic.

SFX Line Requirements:

Character field must be sfx (case-insensitive)
/duration:nn directive is required (1-30 seconds)
Optional /level:0-100 to control volume

Optional Background Music:

When generating dialogue (TTS or TTS+VC mode), VODER can automatically add ambient background music that matches the length of the spoken audio.
A clean dialog appears before processing, asking: "Enter music description (or press Skip):"
If a description is provided (e.g., "soft piano, cinematic strings"), VODER:
- Generates music via ACE-Step using the description as style prompt and "..." as empty lyrics.
- Automatically fits the music duration to the exact length of the dialogue.
- Mixes the music at 35% volume relative to the dialogue (configurable via level parameter).
- Cleans up temporary files and saves the final result with an _m suffix (e.g., voder_tts_dialogue_..._m.wav).
If the user skips, processing proceeds normally without music.

Music Volume Level Control:

The level parameter allows fine-grained control over background music volume throughout the dialogue:

# Constant volume
python src/voder.py tts script "James: Hello" voice "James: male" music "soft piano" level "50"

# Time-based segments (from 0s: 20%, at 30s: fade to 50%)
python src/voder.py tts script "James: Hello" voice "James: male" music "soft piano" level "0:20-30:50"

# With fade transitions
python src/voder.py tts script "James: Hello" voice "James: male" music "cinematic" level "0:20-30:50+60"

Level Format:

"volume" — Constant volume percentage (e.g., "35" for 35%)
"start:vol-end:vol" — Volume at start time, different volume at end time
"start:from-to+fade" — Fade from volume to another over specified duration

This feature is available in both GUI and CLI modes (interactive and one‑line). It is only triggered for dialogue scripts (i.e., more than one line, or a single line containing a colon).

Example Script (conceptual):

James: Welcome to our podcast! Today we'll explore AI advances.
Sarah: Thanks James! I'm excited to discuss my latest research.
sfx: keyboard typing /duration:5 /level:40
James: Let's dive in. First, tell us about neural networks.

Key Features:

Multi-character script support with real-time character extraction
Individual voice prompts for each character (TTS mode)
Reference audio assignment per character via dropdown numbers (TTS+VC mode)
SFX character support — embed sound effects in dialogue
Script directives — time positioning, volume control, duration control per line
Optional background music – automatically generated, duration‑fitted, volume‑controlled
Automatic audio concatenation with proper pacing
Ideal for podcasts, news segments, interviews, and storytelling

The dialogue system is available in both TTS (Voice Design) and TTS+VC (Voice Cloning) modes, allowing you to create voices either through descriptive prompts or by cloning from real audio samples.

🧠 Intelligent Source Analysis

VODER now supports cross-platform source input — a unified input pipeline that accepts audio, video, images, and URLs across multiple processing modes. This enables powerful new workflows:

YouTube / Bilibili / TikTok URL Support: Paste a video URL directly as input in STT, STT+TTS, and dialogue modes. VODER automatically downloads the audio track and processes it — no manual downloading or conversion required.
Image Text Extraction (OCR): Feed image files (PNG, JPG, etc.) as input. VODER uses EasyOCR to extract embedded text, which is then processed as dialogue script content. This works in STT, TTS, and TTS+VC modes — enabling workflows like "photo of a script → spoken audio."
Automatic Voice Clip Extraction: When processing multi-speaker audio (e.g., a podcast recording), VODER can automatically identify and extract individual speaker segments. This replaces the previous manual approach of splitting audio files.
Speaker Diarization: Powered by pyannote, VODER identifies who spoke when in multi-speaker audio. Each speaker is labeled consistently, and the diarization output can be combined with transcription for fully annotated results.

Multi-Speaker Input — Now Supported! Previous versions of VODER required manually separating multi-speaker audio before processing. With the new Intelligent Source Analysis system, VODER can now accept multi-speaker audio directly. The speaker diarization pipeline automatically identifies speakers, extracts their voice clips, and makes them available for voice cloning and transcription. See Guide.md for the updated workflow.

🔧 AI Model Integration

VODER leverages state-of-the-art open-source models for professional-grade audio processing:

Speech Recognition: openai/whisper — Whisper for accurate audio transcription
Voice Synthesis: QwenLM/Qwen3-TTS — Qwen3-TTS for natural text-to-speech
Voice Conversion: Plachtaa/seed-vc — Seed-VC for speech-to-speech transformation
Music Generation: ace-step/ACE-Step-1.5 — ACE-Step for lyrics-to-music synthesis
Sound Effects: declare-lab/TangoFlux — TangoFlux for text-to-audio generation
Speech Enhancement: alibaba/unified-audio — UniSE for denoising, dereverberation, and speech restoration
Speaker Diarization: pyannote/speaker-diarization-community-1 — pyannote for identifying and labeling individual speakers in multi-speaker audio
Image Text Extraction: EasyOCR — EasyOCR for extracting text from images, enabling image-to-speech workflows

Usage Guide

GUI Mode

Launch: python src/voder.py
Select mode from dropdown (9 available modes)
Load input files based on mode:
- STT+TTS: Load base audio (content), then load target audio (voice)
- STT: Load audio, video, image, or enter a URL for transcription
- TTS: Enter dialogue row‑by‑row in the script area, and fill the automatically generated voice prompts for each character
  Optional: Before generation, a dialog will ask if you want background music; enter a description or press Skip.
- TTS+VC: Enter dialogue rows, load voice reference audio files (each assigned a number), then assign each character an audio number via dropdown
  Optional: The same background music dialog appears before generation.
- STS: Load base audio and target voice audio
- TTM: Enter lyrics and style prompt
- TTM+VC: Enter lyrics, style prompt, and load target voice audio
- SE: Load audio or video file for enhancement
Click "Generate" (TTS/TTS+VC/TTM/TTM+VC) or "Patch" (STT+TTS/STS) or "Transcribe" (STT) or "Enhance" (SE)
Listen to output and save results

CLI Mode (Interactive)

python src/voder.py cli

The interactive CLI now supports full dialogue creation:

Enter multiple lines (empty line to finish).
Lines without a colon → single mode (one text, one voice prompt/audio).
Lines with colon (Character: text) → dialogue mode.
Use sfx: description /duration:nn for embedded sound effects.
VODER will ask for a voice prompt (TTS) or audio file path (TTS+VC) for each character, in order.
After collecting all voice prompts/assignments, you will be asked:
Add background music? (y/N):
If you answer y or yes, you can enter a music description and optionally a level specification.
Leaving the description blank or entering empty skips the music.

One-Line Commands

One‑line commands now support dialogue mode through multiple values per parameter, as well as the optional music and level parameters for background music.

Single mode examples:

python src/voder.py tts script "Hello world" voice "female, cheerful"
python src/voder.py tts+vc script "Hello" target "voice.wav"
python src/voder.py tts ocr "path/to/image.png" voice "text: female voice"
python src/voder.py tts+vc ocr "path/to/image.png" target "text: voice.wav"
python src/voder.py sts base "input.wav" target "voice.wav"
python src/voder.py sts base "source.wav" target "reference.wav" mimic
python src/voder.py ttm lyrics "Verse 1:\nWalking down the empty street\nFeeling the rhythm in my feet" styling "upbeat pop" duration 30
python src/voder.py ttm+vc lyrics "..." styling "pop" duration 30 target "voice.wav"
python src/voder.py se "noisy_audio.wav"
python src/voder.py sfx sound "thunder" duration 5

STT mode examples:

# Basic transcription
python src/voder.py stt "audio.wav"

# With timestamps
python src/voder.py stt "audio.wav" timestamp

# With dialogue formatting
python src/voder.py stt "audio.wav" dialogue

# Batch processing — multiple files in one command
python src/voder.py stt "audio1.wav" "audio2.wav"

# Transcribe a YouTube video directly
python src/voder.py stt "https://youtube.com/watch?v=..."

# Save TTS output to a specific file
python src/voder.py tts script "Hello" voice "male" result "/path/to/output.txt"

Dialogue mode examples (TTS):

python src/voder.py tts script "James: Welcome to the show!" "Sarah: Glad to be here." voice "James: deep male voice, authoritative" "Sarah: bright female voice, energetic"
python src/voder.py tts script "James: Welcome to the show!" "Sarah: Glad to be here." voice "James: deep male voice, authoritative" "Sarah: bright female voice, energetic" music "soft piano, cinematic"
python src/voder.py tts script "James: Hello" "sfx: door bell /duration:3" voice "James: deep male" music "ambient" level "0:30-60:50"

Dialogue mode examples (TTS+VC):

python src/voder.py tts+vc script "James: Let's start with AI." "Sarah: I've been working on this for years." target "James: /path/to/james_voice.wav" "Sarah: /path/to/sarah_voice.wav"
python src/voder.py tts+vc script "James: Let's start with AI." "Sarah: I've been working on this for years." target "James: /path/to/james_voice.wav" "Sarah: /path/to/sarah_voice.wav" music "ambient electronic, chill" level "40"

Cross-use feature (TTS and TTS+VC): Both TTS and TTS+VC one-line modes support mixing generated and cloned voices in the same dialogue. Use voice "Character: prompt" for generated voices and target "Character: path" for cloned voices:

# James uses a generated voice, Sarah uses a cloned voice (TTS mode)
python src/voder.py tts script "James: Hello!" "Sarah: Hi there!" voice "James: deep male voice" target "Sarah: /path/to/sarah_voice.wav"

# James uses a cloned voice, Sarah uses a generated voice (TTS+VC mode)
python src/voder.py tts+vc script "James: Welcome!" "Sarah: Thanks!" target "James: /path/to/james_voice.wav" voice "Sarah: bright female voice"

Note: A character cannot have both voice and target assignments — each character must use either generated or cloned voice, not both.

SFX mode examples:

python src/voder.py sfx sound "rain on a tin roof" duration 10
python src/voder.py sfx sound "thunder rumbling" duration 5 steps 50 guide 3.5
python src/voder.py sfx sound "footsteps on gravel" duration 8 result "/output/footsteps.wav"

Note: STT+TTS mode is not available in one-line CLI because it requires interactive text editing. If the music parameter is supplied in single‑mode (plain text without colon), it is ignored with a warning.

System Requirements

Minimum Requirements

Component	Specification
CPU	4-6 cores
RAM	12GB+ system memory
GPU (CUDA)	Optional (CPU-only operation supported)
VRAM	4GB minimum (6GB recommended, 16GB for best performance)
Storage	SSD recommended

Note: VODER runs entirely on CPU. No GPU is required for any mode. However, having a GPU with sufficient VRAM can significantly improve processing speed for certain modes.

Speaker Diarization Requirements

Speaker diarization (STT with diarization or multi-speaker analysis) adds additional memory requirements:

RAM: Expect approximately 2–3GB more system memory when using speaker diarization, as the pyannote model loads alongside the transcription pipeline
HF_TOKEN: The pyannote speaker-diarization-community-1 model requires a Hugging Face access token with accepted terms of use. Set the HF_TOKEN environment variable before running:
```
export HF_TOKEN="hf_your_token_here"
```
You can obtain a token from huggingface.co/settings/tokens after accepting the model's license on its Hugging Face page.

Recommended Requirements

VODER is designed to maximize output quality rather than speed. Meeting the minimum requirements ensures reliable operation — the focus is on achieving professional-grade audio results, not processing benchmarks. More RAM allows for longer audio generation and more complex workflows. For the best experience with all features (including speaker diarization and speech enhancement), 16GB+ RAM is recommended.

Technical Highlights

Unified Audio Pipeline: Nine processing modes in a single interface eliminates the need for multiple tools
Centralized Model Management: A unified model management system handles loading, caching, and offloading of all AI models — ensuring efficient resource usage and preventing memory accumulation across multi-step workflows
Intelligent Dialogue Editor: Row‑based script input with automatic character tracking and per‑voice assignment
Script Directives: Per-line control over timing, volume, and duration for precise audio production
SFX Integration: Embed sound effects directly in dialogue scripts using the special sfx: character
State-of-the-Art Models: Production-quality models from leading AI research organizations
Voice Cloning: Extract and replicate voice characteristics from reference audio samples
Music Generation: Lyrics-to-music synthesis with style control and voice conversion
Sound Effects Generation: Text-to-audio synthesis for custom sound design
Speech Enhancement: Denoise, dereverberate, and restore speech audio
Cross-Modal Transformation: Speech-to-speech, text-to-speech, speech-to-text, and text-to-text conversions
Cross-Platform Source Input: Unified input pipeline accepts audio files, video files, images, and URLs (YouTube, Bilibili, TikTok) — no manual format conversion required
Automatic Speaker Identification: Multi-speaker audio is automatically segmented and labeled using pyannote speaker diarization, with individual voice clips extracted for downstream processing
Speaker Diarization with Word-Level Alignment: Combines Whisper transcription with pyannote diarization to produce speaker-labeled, timestamped transcripts with per-word speaker attribution
MSTS (Music-STS): STS mode supports musical inputs using Seed-VC v1 at 44.1kHz for better music voice conversion
Memory Optimisation: Models are now explicitly offloaded after each operation to prevent memory accumulation in session-based workflows
Background Music for Dialogue: Automatically generated, duration‑fitted, volume‑controlled ambient music with time-based level adjustments

Documentation

Guide.md — Detailed usage guide, technical implementation, and creative techniques
CHANGELOG.md — Development history and version changes
Bots.md — Guidelines for AI agents and automated systems
voder-skill.md — Direct Agent skill
Languages.md — Supported languages across all components, auto‑detection capabilities, and language configuration

Version Information

Note: VODER does not maintain PyPI packages or pre-built binaries. Running from source ensures access to the most recent features and improvements.

Contributing

VODER is open-source (MIT License) and welcomes contributions:

New voice processing modes
Additional model integrations
UI/UX improvements
Performance optimizations
Documentation and translations
Bug reports and feature requests

Please submit pull requests or issues via GitHub.

License

MIT License — See LICENSE for full details.

Acknowledgments

Built with appreciation for the open-source AI voice synthesis community and the amazing models that power VODER.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VODER - Voice Blender

Quick Start

Run from Source

Run in Google Colab

Installation Requirements

Core Capabilities

🎤 9 Processing Modes

SE Mode (Speech Enhancement)

SFX Mode (Sound Effects Generation)

STT Mode (Speech-to-Text)

🎭 Dialogue System

🧠 Intelligent Source Analysis

🔧 AI Model Integration

Usage Guide

GUI Mode

CLI Mode (Interactive)

One-Line Commands

System Requirements

Minimum Requirements

Speaker Diarization Requirements

Recommended Requirements

Technical Highlights

Documentation

Version Information

Contributing

License

Acknowledgments

About

Uh oh!

Releases 7

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 420 Commits
src		src
Bots.md		Bots.md
CHANGELOG.md		CHANGELOG.md
Guide.md		Guide.md
LICENSE		LICENSE
Languages.md		Languages.md
README.md		README.md
requirements.txt		requirements.txt
voder-skill.md		voder-skill.md

Folders and files

Latest commit

History

Repository files navigation

VODER - Voice Blender

Quick Start

Run from Source

Run in Google Colab

Installation Requirements

Core Capabilities

🎤 9 Processing Modes

SE Mode (Speech Enhancement)

SFX Mode (Sound Effects Generation)

STT Mode (Speech-to-Text)

🎭 Dialogue System

🧠 Intelligent Source Analysis

🔧 AI Model Integration

Usage Guide

GUI Mode

CLI Mode (Interactive)

One-Line Commands

System Requirements

Minimum Requirements

Speaker Diarization Requirements

Recommended Requirements

Technical Highlights

Documentation

Version Information

Contributing

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors 1

Languages

Packages