A comprehensive AI-powered Text-to-Speech application with voice cloning, script generation, and real-time voice chat capabilities.
- Standard TTS: High-quality speech synthesis using Tacotron2 and HiFiGAN models
- Voice Cloning: Create custom voices from audio samples using XTTS v2
- Multiple Audio Formats: Support for WAV, MP3, FLAC, M4A, and OGG
- Text Preprocessing: Automatic number-to-words conversion and text cleaning
- Intelligent Script Writing: Generate compelling scripts using Ollama-powered LLMs
- Topic-based Generation: Create content from simple topic descriptions
- Conversation History: Maintain chat context for iterative improvements
- Direct Integration: Copy generated scripts directly to TTS input
- Voice Upload: Add custom voices with audio samples (5-120 seconds)
- Voice Validation: Automatic audio quality and format checking
- Voice Library: Manage multiple voices with metadata and descriptions
- Real-time Refresh: Dynamic voice list updates
- Real-time Conversation: Live voice chat using FastRTC streaming
- Speech-to-Text: Convert voice input to text for AI processing
- Configurable Responses: Adjustable response length and voice selection
- Low-latency Streaming: Optimized for responsive voice interactions
- Multi-tab Interface: Organized workflow with dedicated sections
- Progress Tracking: Real-time status updates and error handling
- Statistics Dashboard: System performance and usage metrics
- Configuration Management: Export/import settings and voice configurations
# Core Python packages
pip install torch torchaudio
pip install speechbrain
pip install TTS
pip install gradio
pip install requests
pip install numpy
pip install pathlib
pip install num2words
# FastRTC for real-time audio
pip install fastrtc
pip install sphn
# Optional dependencies
pip install logging
pip install tempfile
pip install uuid
pip install datetime
pip install typing- Ollama Server: Required for AI script generation and chat functionality
- Audio Processing: torchaudio for audio manipulation and format conversion
The easiest way to run the TTS Studio is using Docker Compose, which automatically sets up all dependencies including Ollama.
- Clone the Repository
git clone <repository-url>
cd tts-studio- Configure Environment Variables
Create a
.envfile:
WWWUSER=1000
WWWGROUP=1001
OLLAMA_BASE_URL=http://ollama:11434
OLLAMA_MODEL=deepseek-r1:7b- Launch with Docker Compose
docker-compose up -dThe application will be available at:
- TTS Studio:
http://localhost:1602 - Ollama API:
http://localhost:11434
- Initialize the AI Model
# Wait for Ollama to start, then pull the model
docker-compose exec ollama ollama pull deepseek-r1:7b- Clone the Repository
git clone <repository-url>
cd tts-studio- Install Python Dependencies
pip install -r requirements.txt- Set Up Ollama
# Install Ollama (visit https://ollama.ai)
ollama pull deepseek-r1:7b # or your preferred model- Configure Environment
export OLLAMA_BASE_URL="http://localhost:11434"
export OLLAMA_MODEL="deepseek-r1:7b"- Create Required Directories
mkdir -p voice_samples generated_audiopython main.pyThe interface will launch at http://localhost:1602
- Enter text in the input field
- Select a voice from the dropdown
- Click "Generate Audio" to create speech
- Download or play the generated audio
- Navigate to the "Voice Cloning" tab
- Upload a clear audio sample (5-120 seconds)
- Provide a name for the voice
- Click "Upload Voice" to add it to your library
- The new voice will appear in the voice selection dropdown
- Go to the "AI Script Writer" tab
- Enter a topic or subject
- Click "Generate Script" for AI-powered content
- Edit the generated script as needed
- Use "Copy to TTS" to transfer the script for audio generation
- Open the "Live Voice Chat" tab
- Configure response voice and settings
- Click the microphone to start talking
- The AI will respond with synthesized speech
- Adjust STT and response length settings as needed
OLLAMA_BASE_URL: Ollama server endpoint (default: http://ollama:11434)OLLAMA_MODEL: AI model for script generation (default: deepseek-r1:7b)
- Sample Rate: 22050 Hz (configurable)
- Max Audio Length: 120 seconds for voice samples
- Supported Formats: WAV, MP3, FLAC, M4A, OGG
- Max File Size: 50MB for uploads
Voice settings are stored in voice_config.json:
{
"voice_id": {
"name": "Voice Display Name",
"description": "Voice description",
"type": "cloned",
"enabled": true,
"created_at": "2024-01-01T00:00:00",
"metadata": {}
}
}tts-studio/
├── main.py # Main application file
├── voice_config.json # Voice configuration
├── voice_samples/ # Uploaded voice samples
│ └── voice_id/
│ └── reference.wav
├── generated_audio/ # Generated TTS outputs
├── tmp_tts/ # Temporary TTS model files
└── tmp_vocoder/ # Temporary vocoder files
The application connects to Ollama for AI-powered features:
- Script generation using LLM models
- Conversation management
- Real-time chat responses
- Standard Voice: Tacotron2 + HiFiGAN (SpeechBrain)
- Voice Cloning: XTTS v2 (Coqui TTS)
- Audio Processing: PyTorch Audio
Real-time audio streaming for voice chat:
- Low-latency audio transmission
- Bidirectional communication
- Configurable audio parameters
Ollama Connection Failed
- Verify Ollama is running:
ollama list - Check the base URL configuration
- Ensure the specified model is available
Voice Cloning Errors
- Verify audio sample quality (clear, noise-free)
- Check file format compatibility
- Ensure minimum duration requirements (5 seconds)
Audio Generation Failed
- Check text length (max 1000 characters)
- Verify voice file existence
- Review system resources and disk space
FastRTC Streaming Issues
- Check microphone permissions
- Verify network connectivity
- Ensure audio device compatibility
Memory Usage
- Models are cached after first load
- Generated audio files are automatically cleaned
- Voice validation runs periodically
Processing Speed
- GPU acceleration (if available)
- Concurrent audio processing
- Optimized text preprocessing
- VoiceManager: Handles voice configuration and file management
- OllamaChat: Manages AI chat and script generation
- TTSGenerator: Coordinates TTS model execution
- EnhancedVoiceChatHandler: Real-time voice chat processing
- Extend the appropriate class (VoiceManager, TTSGenerator, etc.)
- Add UI components in
create_interface() - Wire event handlers for new functionality
- Update configuration schema if needed
- Voice sample validation
- Audio format compatibility
- Ollama model availability
- FastRTC streaming functionality
This project uses various open-source components:
- SpeechBrain (Apache 2.0)
- Coqui TTS (MPL 2.0)
- Gradio (Apache 2.0)
- FastRTC (MIT)
For issues and questions:
- Check the troubleshooting section
- Verify all dependencies are installed
- Review system logs for detailed error messages
- Ensure external services (Ollama) are properly configured
Contributions welcome for:
- Additional TTS model support
- New voice processing features
- UI/UX improvements
- Performance optimizations
- Documentation updates