TrueTone is an AI-powered voice authenticity detection system that identifies whether an audio clip is real or AI-generated. It analyzes speech patterns, tone variations, and audio characteristics using machine learning techniques to improve trust and security in digital communication.
With the rapid growth of AI-generated voice technologies, detecting fake or manipulated audio has become a major challenge in digital communication. TrueTone addresses this by providing fast and accurate detection results, helping improve security, reduce misinformation, and build trust in digital audio communication.
| Feature | Description |
|---|---|
| π€ Real-Time Microphone Capture | Continuous 16 kHz mono audio capture via sounddevice with sliding-window chunk generation |
| π WAV Audio Upload | File upload support through the Streamlit dashboard with full pipeline replay |
| π System Audio Loopback | Capture speaker output through soundcard for monitoring playback or call audio |
| π§ AI Detection Engine | Lightweight Wav2Vec2 / RawNetLite-compatible models via Hugging Face, optimized for CPU inference |
| π― Silero VAD Integration | Speech detection using Silero VAD for filtering non-speech audio, plus energy-based and hybrid gates |
| π Streamlit Dashboard | Real-time probability meter, waveform visualization, historical probability graph, and warning alerts |
| Automatic red warning banner when AI probability exceeds configurable threshold | |
| π Multi-Threaded Architecture | Separate threads for audio capture, VAD, AI inference, and UI updates via Python threading and queue |
| π Temporal Aggregation | False-positive reduction through EMA smoothing, trend analysis, and hysteresis state machine |
| π¬ Audio Feature Analysis | Spectral entropy, pitch drift, jitter, shimmer, HNR, cadence consistency, breathiness scoring |
| π Detection Event Logging | Event log table and session analytics visualization in the dashboard |
| β‘ CPU-Only Execution | Runs on standard consumer hardware without GPU (auto-detects CUDA if available) |
| π§ͺ Testing Suite | Unit tests, batch testing tools, threshold tuning, and model comparison utilities |
TRUE TONE uses a modular real-time architecture to capture microphone or uploaded audio and process it in small chunks. The system preprocesses the audio using normalization, silence filtering, and resampling before sending it to AI models like RawNetLite and Wav2Vec2 for synthetic voice detection. Detection results are displayed on a Streamlit dashboard with live alerts and waveform analysis, while Python threading and queues enable smooth real-time processing and responsive UI updates.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TRUE TONE β Pipeline Architecture β
β β
β ββββββββββββββββ ββββββββββββββββ βββββββββββββββββββ βββββββββββ β
β β Audio Source βββββΆβ Thread Queue βββββΆβ Inference Thread βββββΆβ Scores β β
β β (Capture β β (bounded) β β β β (deque) β β
β β Thread) β ββββββββββββββββ β βββββββββββββββ β ββββββ¬βββββ β
β ββββββββ¬ββββββββ β β Silero VAD β β β β
β β β β Speech Gate β β ββββββ΄βββββ β
β ββββββββ΄ββββββββ β ββββββββ¬βββββββ β βStreamlitβ β
β β β’ Microphone β β ββββββββ΄βββββββ β βDashboardβ β
β β β’ System Audioβ β β Wav2Vec2 / β β β (UI) β β
β β β’ WAV Upload β β β RawNetLite β β β β β
β β β’ File Replay β β β Detector β β ββ’ Meter β β
β ββββββββββββββββ β ββββββββ¬βββββββ β ββ’ Wave β β
β β ββββββββ΄βββββββ β ββ’ Graph β β
β β β Feature β β ββ’ Alerts β β
β β β Fusion + β β ββ’ Logs β β
β β β Temporal β β βββββββββββ β
β β β Aggregator β β β
β β βββββββββββββββ β β
β βββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Audio Capture β 3-second overlapping chunks (48,000 samples at 16 kHz) via
sounddevice - Preprocessing β Mono conversion, resampling to 16 kHz, volume normalization, noise filtering
- Speech Detection β Silero VAD filters non-speech audio; energy-based gate as fast pre-filter
- AI Inference β Lightweight Wav2Vec2 / RawNetLite models generate AI probability score (0.0β1.0)
- Feature Fusion β Handcrafted DSP features fused with neural model scores
- Temporal Aggregation β EMA smoothing, rolling averages, hysteresis for false-positive reduction
- Dashboard Display β Real-time probability meter, waveform, score history, warning alerts
True-Tone/
βββ app.py # Main entry point (dashboard / terminal / file modes)
βββ live_pipeline.py # Standalone threaded live detection pipeline
βββ requirements.txt # Python dependencies
β
βββ audio/ # Audio capture and processing modules
β βββ mic_capture.py # Real-time microphone capture (sounddevice)
β βββ system_capture.py # System audio loopback capture (soundcard)
β βββ wav_loader.py # WAV file ingestion and replay
β βββ wav_utils.py # Audio loading, saving, chunking utilities
β βββ vad.py # VAD: EnergySpeechGate, SileroSpeechGate, HybridSpeechGate
β βββ buffer.py # Thread-safe audio buffer with sliding window
β
βββ processing/ # Audio processing and feature extraction
β βββ preprocessor.py # Mono conversion, resampling, normalization, pad/trim
β βββ features.py # Handcrafted DSP/behavioral features
β
βββ inference/ # AI detection engine
β βββ detector.py # HuggingFace Wav2Vec2/RawNetLite ensemble detector
β
βββ pipeline/ # Multi-threaded pipeline orchestration
β βββ orchestrator.py # Capture β VAD β Detector orchestrator (threading + queue)
β βββ temporal.py # Temporal confidence aggregation with hysteresis
β
βββ ui/ # Frontend dashboard
β βββ dashboard.py # Streamlit dashboard with live meter, logs, analytics
β
βββ tools/ # Testing, evaluation, and tuning utilities
β βββ run_tests.py # Batch accuracy testing with confusion matrix
β βββ compare_models.py # Side-by-side model comparison
β βββ evaluate_streaming.py # Session-level streaming evaluation
β βββ tune_threshold.py # Threshold optimization on labeled datasets
β βββ tune_streaming_threshold.py # Threshold tuning with temporal aggregation
β βββ download_online_samples.py # Download public real/fake test samples
β βββ generate_test_audio.py # Generate synthetic test audio
β
βββ tests/ # Unit test suite
β βββ test_core.py # Tests for preprocessing, VAD, chunking, features
β
βββ ARCHITECTURAL_BLUEPRINT.md # Detailed architecture and 6-day execution plan
βββ ROBUSTNESS_UPGRADE_PLAN.md # Roadmap for ensemble expansion and robustness
βββ CONTRIBUTING.md # Development and contribution guidelines
βββ LICENSE # MIT License
- Python 3.9 or later
- A working microphone (for live capture) or audio files for file-based analysis
- ~500 MB disk space for model download (cached after first run)
# Clone the repository
git clone https://github.com/WHENKEY2007/True-Tone.git
cd True-Tone
# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate # Linux/macOS
# venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtstreamlit run ui/dashboard.pySelect an audio source (Microphone / System Audio / WAV File) and press
# Live microphone detection
python app.py --mode terminal --source mic
# System audio loopback
python app.py --mode terminal --source system --device 0
# Analyze a single audio file
python app.py --mode file --source-file path/to/audio.wav| Requirement | Status | Implementation |
|---|---|---|
| Real-time microphone capture | β | audio/mic_capture.py β sounddevice with overlapping windows |
| WAV audio upload | β | Streamlit file uploader in dashboard sidebar |
| Sliding-window chunk generation | β | 3-second chunks with configurable overlap (default 2s = 1s stride) |
| Speech detection (Silero VAD) | β | audio/vad.py β SileroSpeechGate, HybridSpeechGate, EnergySpeechGate |
| Noise filtering & silence removal | β | Energy gate pre-filter + Silero VAD speech/silence classification |
| Audio normalization & 16 kHz resampling | β | processing/preprocessor.py β mono, resample, peak-normalize |
| AI probability score (0.0β1.0) | β | inference/detector.py β per-chunk and aggregated scores |
| CPU-only execution | β | Default CPU inference, auto-detects CUDA if available |
| Real-time probability meter | β | Large score display with color coding in dashboard |
| Waveform visualization | β | Live matplotlib waveform plot in dashboard |
| Historical probability graph | β | Score history chart with threshold line |
| Warning alerts | β | Red/yellow/green alerts based on configurable threshold |
| Detection event logging | β | Event log table + session analytics in dashboard |
| False-positive reduction | β | pipeline/temporal.py β EMA, rolling average, hysteresis |
| Requirement | Status | Implementation |
|---|---|---|
| Python + Streamlit | β | Python 3.9+, Streamlit dashboard |
| CPU hardware, no GPU | β | All inference on CPU by default |
| Audio as NumPy arrays | β | float32 mono arrays throughout |
| 3-second chunks (48,000 samples at 16 kHz) | β | Configurable via --chunk-seconds |
| Silero VAD integration | β | SileroSpeechGate in audio/vad.py |
| Wav2Vec2 / RawNetLite models | β | HuggingFace audio-classification pipeline |
| 1β2 second inference latency | β | Measured latency per chunk displayed in dashboard |
| Multi-threaded (threading + queue) | β | pipeline/orchestrator.py β capture thread + inference thread |
| Separate threads for capture, VAD, inference, UI | β | Orchestrator manages thread lifecycle |
| Technology | Purpose |
|---|---|
| Python 3.9+ | Core application language |
| Streamlit | Interactive dashboard frontend |
| PyTorch | Neural network inference runtime |
| Torchaudio | Audio processing and transforms |
| NumPy | Array operations and audio data handling |
| SciPy | Signal processing, resampling, spectral analysis |
| SoundDevice | Real-time microphone audio capture |
| SoundCard | System audio loopback capture |
| PyAudio | Optional fallback audio capture support |
| Silero VAD | Neural voice activity detection |
| Hugging Face Transformers | Pre-trained model loading and inference pipeline |
| Wav2Vec2 | Primary speech representation model for detection |
| RawNetLite | Lightweight waveform-based countermeasure model |
| Threading & Queue | Multi-threaded pipeline orchestration |
| Matplotlib | Waveform and score visualization |
| Librosa | Pitch estimation (YIN) and audio feature extraction |
- β Real-time microphone audio capture
- β WAV audio ingestion
- β Speech activity detection and filtering (Silero VAD)
- β AI-based synthetic speech detection
- β Live probability scoring (0.0β1.0)
- β Streamlit visualization dashboard
- β Real-time warning alerts
- β CPU-only execution support
- β Modular audio processing pipeline
- β Demo-ready live detection workflow
- β False-positive reduction through temporal aggregation
- Training or fine-tuning AI models
- GPU acceleration and TensorRT optimization
- Kubernetes or cloud-native deployment
- Zoom SDK or Recall.ai integration
- Enterprise-scale distributed infrastructure
- Speaker diarization
- Custom dataset creation
- Mobile application support
- WebSocket-based streaming infrastructure
- Multi-cloud deployment environments
See ROBUSTNESS_UPGRADE_PLAN.md for the detailed roadmap.
- π± Mobile platform integration
- π Enhanced waveform visualization
- π― Advanced probability meter UI
- π§ Email-based alert notifications
- π Multi-language voice detection
- π Real-time browser audio monitoring
- π Advanced AI ensemble detection models (AASIST, RawNet2, WavLM, Whisper, HuBERT)
- π‘οΈ Voice spoof attack classification
- π Dark mode dashboard support
TRUE TONE is a lightweight and practical AI voice deepfake detection platform designed for real-time synthetic speech analysis. The project emphasizes modular architecture, CPU-efficient execution, and rapid deployment while delivering meaningful live AI detection capabilities. Its primary strengths lie in real-time responsiveness, simplified deployment, and effective integration of modern audio classification pipelines. Future growth opportunities include improving detection accuracy, expanding supported audio sources, and integrating advanced AI ensemble techniques for stronger real-world resilience.
This project is licensed under the MIT License β see the LICENSE file for details.
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.