Skip to content

WHENKEY2007/True-Tone

Repository files navigation

πŸŽ™οΈ True Tone β€” AI-Powered Voice Authenticity Detection

Python 3.9+ Streamlit PyTorch License: MIT

TrueTone is an AI-powered voice authenticity detection system that identifies whether an audio clip is real or AI-generated. It analyzes speech patterns, tone variations, and audio characteristics using machine learning techniques to improve trust and security in digital communication.

With the rapid growth of AI-generated voice technologies, detecting fake or manipulated audio has become a major challenge in digital communication. TrueTone addresses this by providing fast and accurate detection results, helping improve security, reduce misinformation, and build trust in digital audio communication.


✨ Key Features

Feature Description
🎀 Real-Time Microphone Capture Continuous 16 kHz mono audio capture via sounddevice with sliding-window chunk generation
πŸ“ WAV Audio Upload File upload support through the Streamlit dashboard with full pipeline replay
πŸ”Š System Audio Loopback Capture speaker output through soundcard for monitoring playback or call audio
🧠 AI Detection Engine Lightweight Wav2Vec2 / RawNetLite-compatible models via Hugging Face, optimized for CPU inference
🎯 Silero VAD Integration Speech detection using Silero VAD for filtering non-speech audio, plus energy-based and hybrid gates
πŸ“Š Streamlit Dashboard Real-time probability meter, waveform visualization, historical probability graph, and warning alerts
⚠️ Warning Alerts Automatic red warning banner when AI probability exceeds configurable threshold
πŸ”€ Multi-Threaded Architecture Separate threads for audio capture, VAD, AI inference, and UI updates via Python threading and queue
πŸ“ˆ Temporal Aggregation False-positive reduction through EMA smoothing, trend analysis, and hysteresis state machine
πŸ”¬ Audio Feature Analysis Spectral entropy, pitch drift, jitter, shimmer, HNR, cadence consistency, breathiness scoring
πŸ“‹ Detection Event Logging Event log table and session analytics visualization in the dashboard
⚑ CPU-Only Execution Runs on standard consumer hardware without GPU (auto-detects CUDA if available)
πŸ§ͺ Testing Suite Unit tests, batch testing tools, threshold tuning, and model comparison utilities

πŸ—οΈ System Architecture

TRUE TONE uses a modular real-time architecture to capture microphone or uploaded audio and process it in small chunks. The system preprocesses the audio using normalization, silence filtering, and resampling before sending it to AI models like RawNetLite and Wav2Vec2 for synthetic voice detection. Detection results are displayed on a Streamlit dashboard with live alerts and waveform analysis, while Python threading and queues enable smooth real-time processing and responsive UI updates.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         TRUE TONE β€” Pipeline Architecture                    β”‚
β”‚                                                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Audio Source │───▢│ Thread Queue │───▢│ Inference Thread │───▢│ Scores  β”‚ β”‚
β”‚  β”‚  (Capture     β”‚    β”‚  (bounded)   β”‚    β”‚                 β”‚    β”‚ (deque) β”‚ β”‚
β”‚  β”‚   Thread)     β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚ β”‚ Silero VAD  β”‚ β”‚         β”‚      β”‚
β”‚         β”‚                                β”‚ β”‚ Speech Gate β”‚ β”‚    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β” β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”                        β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚    β”‚Streamlitβ”‚ β”‚
β”‚  β”‚ β€’ Microphone  β”‚                        β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β” β”‚    β”‚Dashboardβ”‚ β”‚
β”‚  β”‚ β€’ System Audioβ”‚                        β”‚ β”‚ Wav2Vec2 /  β”‚ β”‚    β”‚  (UI)   β”‚ β”‚
β”‚  β”‚ β€’ WAV Upload  β”‚                        β”‚ β”‚ RawNetLite  β”‚ β”‚    β”‚         β”‚ β”‚
β”‚  β”‚ β€’ File Replay β”‚                        β”‚ β”‚  Detector   β”‚ β”‚    β”‚β€’ Meter  β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚    β”‚β€’ Wave   β”‚ β”‚
β”‚                                          β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β” β”‚    β”‚β€’ Graph  β”‚ β”‚
β”‚                                          β”‚ β”‚  Feature    β”‚ β”‚    β”‚β€’ Alerts β”‚ β”‚
β”‚                                          β”‚ β”‚  Fusion +   β”‚ β”‚    β”‚β€’ Logs   β”‚ β”‚
β”‚                                          β”‚ β”‚  Temporal   β”‚ β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                          β”‚ β”‚  Aggregator β”‚ β”‚                 β”‚
β”‚                                          β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚                 β”‚
β”‚                                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Processing Pipeline

  1. Audio Capture β€” 3-second overlapping chunks (48,000 samples at 16 kHz) via sounddevice
  2. Preprocessing β€” Mono conversion, resampling to 16 kHz, volume normalization, noise filtering
  3. Speech Detection β€” Silero VAD filters non-speech audio; energy-based gate as fast pre-filter
  4. AI Inference β€” Lightweight Wav2Vec2 / RawNetLite models generate AI probability score (0.0–1.0)
  5. Feature Fusion β€” Handcrafted DSP features fused with neural model scores
  6. Temporal Aggregation β€” EMA smoothing, rolling averages, hysteresis for false-positive reduction
  7. Dashboard Display β€” Real-time probability meter, waveform, score history, warning alerts

πŸ“‚ Project Structure

True-Tone/
β”œβ”€β”€ app.py                          # Main entry point (dashboard / terminal / file modes)
β”œβ”€β”€ live_pipeline.py                # Standalone threaded live detection pipeline
β”œβ”€β”€ requirements.txt                # Python dependencies
β”‚
β”œβ”€β”€ audio/                          # Audio capture and processing modules
β”‚   β”œβ”€β”€ mic_capture.py              # Real-time microphone capture (sounddevice)
β”‚   β”œβ”€β”€ system_capture.py           # System audio loopback capture (soundcard)
β”‚   β”œβ”€β”€ wav_loader.py               # WAV file ingestion and replay
β”‚   β”œβ”€β”€ wav_utils.py                # Audio loading, saving, chunking utilities
β”‚   β”œβ”€β”€ vad.py                      # VAD: EnergySpeechGate, SileroSpeechGate, HybridSpeechGate
β”‚   └── buffer.py                   # Thread-safe audio buffer with sliding window
β”‚
β”œβ”€β”€ processing/                     # Audio processing and feature extraction
β”‚   β”œβ”€β”€ preprocessor.py             # Mono conversion, resampling, normalization, pad/trim
β”‚   └── features.py                 # Handcrafted DSP/behavioral features
β”‚
β”œβ”€β”€ inference/                      # AI detection engine
β”‚   └── detector.py                 # HuggingFace Wav2Vec2/RawNetLite ensemble detector
β”‚
β”œβ”€β”€ pipeline/                       # Multi-threaded pipeline orchestration
β”‚   β”œβ”€β”€ orchestrator.py             # Capture β†’ VAD β†’ Detector orchestrator (threading + queue)
β”‚   └── temporal.py                 # Temporal confidence aggregation with hysteresis
β”‚
β”œβ”€β”€ ui/                             # Frontend dashboard
β”‚   └── dashboard.py                # Streamlit dashboard with live meter, logs, analytics
β”‚
β”œβ”€β”€ tools/                          # Testing, evaluation, and tuning utilities
β”‚   β”œβ”€β”€ run_tests.py                # Batch accuracy testing with confusion matrix
β”‚   β”œβ”€β”€ compare_models.py           # Side-by-side model comparison
β”‚   β”œβ”€β”€ evaluate_streaming.py       # Session-level streaming evaluation
β”‚   β”œβ”€β”€ tune_threshold.py           # Threshold optimization on labeled datasets
β”‚   β”œβ”€β”€ tune_streaming_threshold.py # Threshold tuning with temporal aggregation
β”‚   β”œβ”€β”€ download_online_samples.py  # Download public real/fake test samples
β”‚   └── generate_test_audio.py      # Generate synthetic test audio
β”‚
β”œβ”€β”€ tests/                          # Unit test suite
β”‚   └── test_core.py                # Tests for preprocessing, VAD, chunking, features
β”‚
β”œβ”€β”€ ARCHITECTURAL_BLUEPRINT.md      # Detailed architecture and 6-day execution plan
β”œβ”€β”€ ROBUSTNESS_UPGRADE_PLAN.md      # Roadmap for ensemble expansion and robustness
β”œβ”€β”€ CONTRIBUTING.md                 # Development and contribution guidelines
└── LICENSE                         # MIT License

πŸš€ Getting Started

Prerequisites

  • Python 3.9 or later
  • A working microphone (for live capture) or audio files for file-based analysis
  • ~500 MB disk space for model download (cached after first run)

Installation

# Clone the repository
git clone https://github.com/WHENKEY2007/True-Tone.git
cd True-Tone

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate        # Linux/macOS
# venv\Scripts\activate         # Windows

# Install dependencies
pip install -r requirements.txt

Launch the Streamlit Dashboard

streamlit run ui/dashboard.py

Select an audio source (Microphone / System Audio / WAV File) and press ▢️ Start to begin real-time detection.

Terminal Mode

# Live microphone detection
python app.py --mode terminal --source mic

# System audio loopback
python app.py --mode terminal --source system --device 0

# Analyze a single audio file
python app.py --mode file --source-file path/to/audio.wav

πŸ“‹ Requirements Fulfillment

Functional Requirements

Requirement Status Implementation
Real-time microphone capture βœ… audio/mic_capture.py β€” sounddevice with overlapping windows
WAV audio upload βœ… Streamlit file uploader in dashboard sidebar
Sliding-window chunk generation βœ… 3-second chunks with configurable overlap (default 2s = 1s stride)
Speech detection (Silero VAD) βœ… audio/vad.py β€” SileroSpeechGate, HybridSpeechGate, EnergySpeechGate
Noise filtering & silence removal βœ… Energy gate pre-filter + Silero VAD speech/silence classification
Audio normalization & 16 kHz resampling βœ… processing/preprocessor.py β€” mono, resample, peak-normalize
AI probability score (0.0–1.0) βœ… inference/detector.py β€” per-chunk and aggregated scores
CPU-only execution βœ… Default CPU inference, auto-detects CUDA if available
Real-time probability meter βœ… Large score display with color coding in dashboard
Waveform visualization βœ… Live matplotlib waveform plot in dashboard
Historical probability graph βœ… Score history chart with threshold line
Warning alerts βœ… Red/yellow/green alerts based on configurable threshold
Detection event logging βœ… Event log table + session analytics in dashboard
False-positive reduction βœ… pipeline/temporal.py β€” EMA, rolling average, hysteresis

Technical Requirements

Requirement Status Implementation
Python + Streamlit βœ… Python 3.9+, Streamlit dashboard
CPU hardware, no GPU βœ… All inference on CPU by default
Audio as NumPy arrays βœ… float32 mono arrays throughout
3-second chunks (48,000 samples at 16 kHz) βœ… Configurable via --chunk-seconds
Silero VAD integration βœ… SileroSpeechGate in audio/vad.py
Wav2Vec2 / RawNetLite models βœ… HuggingFace audio-classification pipeline
1–2 second inference latency βœ… Measured latency per chunk displayed in dashboard
Multi-threaded (threading + queue) βœ… pipeline/orchestrator.py β€” capture thread + inference thread
Separate threads for capture, VAD, inference, UI βœ… Orchestrator manages thread lifecycle

πŸ›‘οΈ Technologies Used

Technology Purpose
Python 3.9+ Core application language
Streamlit Interactive dashboard frontend
PyTorch Neural network inference runtime
Torchaudio Audio processing and transforms
NumPy Array operations and audio data handling
SciPy Signal processing, resampling, spectral analysis
SoundDevice Real-time microphone audio capture
SoundCard System audio loopback capture
PyAudio Optional fallback audio capture support
Silero VAD Neural voice activity detection
Hugging Face Transformers Pre-trained model loading and inference pipeline
Wav2Vec2 Primary speech representation model for detection
RawNetLite Lightweight waveform-based countermeasure model
Threading & Queue Multi-threaded pipeline orchestration
Matplotlib Waveform and score visualization
Librosa Pitch estimation (YIN) and audio feature extraction

πŸ” In Scope

  • βœ… Real-time microphone audio capture
  • βœ… WAV audio ingestion
  • βœ… Speech activity detection and filtering (Silero VAD)
  • βœ… AI-based synthetic speech detection
  • βœ… Live probability scoring (0.0–1.0)
  • βœ… Streamlit visualization dashboard
  • βœ… Real-time warning alerts
  • βœ… CPU-only execution support
  • βœ… Modular audio processing pipeline
  • βœ… Demo-ready live detection workflow
  • βœ… False-positive reduction through temporal aggregation

🚫 Out of Scope

  • Training or fine-tuning AI models
  • GPU acceleration and TensorRT optimization
  • Kubernetes or cloud-native deployment
  • Zoom SDK or Recall.ai integration
  • Enterprise-scale distributed infrastructure
  • Speaker diarization
  • Custom dataset creation
  • Mobile application support
  • WebSocket-based streaming infrastructure
  • Multi-cloud deployment environments

πŸ—ΊοΈ Future Enhancements

See ROBUSTNESS_UPGRADE_PLAN.md for the detailed roadmap.

  • πŸ“± Mobile platform integration
  • πŸ“Š Enhanced waveform visualization
  • 🎯 Advanced probability meter UI
  • πŸ“§ Email-based alert notifications
  • 🌍 Multi-language voice detection
  • 🌐 Real-time browser audio monitoring
  • πŸ”€ Advanced AI ensemble detection models (AASIST, RawNet2, WavLM, Whisper, HuBERT)
  • πŸ›‘οΈ Voice spoof attack classification
  • πŸŒ™ Dark mode dashboard support

πŸ“ Conclusion

TRUE TONE is a lightweight and practical AI voice deepfake detection platform designed for real-time synthetic speech analysis. The project emphasizes modular architecture, CPU-efficient execution, and rapid deployment while delivering meaningful live AI detection capabilities. Its primary strengths lie in real-time responsiveness, simplified deployment, and effective integration of modern audio classification pipelines. Future growth opportunities include improving detection accuracy, expanding supported audio sources, and integrating advanced AI ensemble techniques for stronger real-world resilience.


πŸ“„ License

This project is licensed under the MIT License β€” see the LICENSE file for details.

🀝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages