Speech-to-phoneme recognition system built around the XLS-R-300M wav2vec2 model. The project has evolved through three phases: a C++ real-time inference engine, a Qt6/QML GUI, and a fine-tuning pipeline for adapting the model to child and adult speech corpora.
The initial goal was to run wav2vec2 phoneme recognition in real time from a microphone using C++ and ONNX Runtime. The base model (facebook/wav2vec2-lv-60-espeak-cv-ft) was exported to ONNX and paired with a miniaudio-based audio capture loop. A chunk-size experiment tool was also built to determine the minimum viable audio segment for inference (400 samples, due to the wav2vec2 convolutional receptive field).
A desktop GUI was added using Qt6/QML with a PySide6 Python bridge. The app spawns the C++ auto_resample binary as a subprocess and parses its tagged output (TR: lines) to display live transcriptions.
The current focus is adapting the base model to domain-specific speech using CTC fine-tuning:
-
UXTD fine-tuning — Fine-tuned on the UXTD child speech corpus to improve phoneme recognition for children's speech. Text prompts are converted to IPA phonemes via
phonemizer(espeak backend, no stress marks). The CNN feature encoder is frozen; only the transformer and CTC head are trained. -
TaL80 fine-tuning — Continued fine-tuning from the UXTD checkpoint on the TaL80 adult speech corpus (48 kHz source audio, resampled to 16 kHz). Includes fixes for CTC NaN loss (filtering utterances >10s), bf16 training, lower learning rate (3e-5), and gradient clipping.
-
3-model benchmark — A benchmark script compares Phoneme Error Rate (PER) and inference time across the original, UXTD-finetuned, and TaL80-finetuned models using ONNX inference on the UXTD test set.
- C++17 compiler
- CMake
- Python 3 (for scripts and GUI)
- Qt6 + PySide6 (for GUI only)
- GPU + CUDA (for fine-tuning)
espeak-ng(for phonemizer)
pip install -r scripts/requirements.txtCore packages: torch, transformers, onnxruntime, optimum, soundfile, numpy.
Fine-tuning packages: datasets, evaluate, phonemizer, librosa, pandas, wandb.
mkdir -p build && cd build
cmake ..
make| Target | Description |
|---|---|
auto_resample |
Real-time microphone capture + inference |
chunk_experiment |
Batch WAV processing at various chunk sizes |
InferenceRunner |
Qt6 GUI app (only built if Qt6 is found) |
# Real-time microphone inference
./build/auto_resample
# Batch chunk-size experiment (results -> output/experiment_results_raw.csv)
./build/chunk_experiment
# GUI (requires Qt6 + PySide6)
cd app && python main.pypython scripts/finetune_wav2vec2_uxtd.py- Base model:
facebook/wav2vec2-lv-60-espeak-cv-ft(IPA phoneme CTC, 393 vocab tokens) - Dataset: UXTD child speech corpus. Utterances from
tests/utterances_by_length.csv; speaker splits from UltraSuite docs. - Audio: 22050 Hz source, resampled to 16 kHz. Max duration 10s.
- Training: batch 2 x 8 grad accum, lr 3e-4, 500 warmup steps, 30 epochs, fp16, gradient checkpointing. Feature encoder frozen.
- Metric: Phoneme Error Rate (PER) via WER on phoneme sequences. Best model by lowest dev PER.
- Output:
wav2vec2-uxtd-finetuned/
python scripts/finetune_wav2vec2_tal80.py- Base model: UXTD fine-tuned checkpoint (
wav2vec2-uxtd-finetuned/checkpoint-784) - Dataset: TaL80 adult speech corpus. 48 kHz source audio, resampled to 16 kHz.
- Training: bf16, lr 3e-5, gradient clipping. Utterances >10s filtered to prevent NaN loss.
- Output:
scripts/wav2vec2-tal80-finetuned/ - Logging: Weights & Biases (dataset stats, sample predictions, GPU memory, grad norms)
python scripts/benchmark_original_vs_finetuned.pyCompares original, UXTD-finetuned, and TaL80-finetuned models (ONNX) on the UXTD test set. Outputs per-utterance and summary CSVs to output/.
src/realtime_autoresample.cpp— Real-time engine. Captures audio via miniaudio, buffers with mutex, runs ONNX inference every ~1s.src/chunk_experiment.cpp— Batch WAV processing at chunk sizes 50–5000ms.app/— Qt6/QML GUI with PySide6 Python bridge.scripts/finetune_wav2vec2_uxtd.py— UXTD child speech fine-tuning pipeline.scripts/finetune_wav2vec2_tal80.py— TaL80 adult speech fine-tuning (from UXTD checkpoint).scripts/benchmark_original_vs_finetuned.py— 3-model PER and latency comparison.scripts/export_to_onnx.py— Model export to ONNX format.
| Path | Description |
|---|---|
onnx_output/model.onnx |
Base ONNX model (~1.2 GB) |
onnx_models/ |
All ONNX models (original, UXTD, TaL80) for benchmarking |
vocab/vocab.json |
Vocabulary (42 IPA phoneme tokens) |
libs/ |
Bundled libraries (miniaudio, nlohmann/json, ONNX Runtime v1.19.2) |
tests/ |
TIMIT test WAV files and UXTD utterance CSV |
- Audio normalization is done per-chunk: zero-mean, unit-variance with epsilon 1e-5.
- CTC greedy decoding: argmax over logits, collapse consecutive duplicates, skip special tokens (
[PAD],<pad>,<s>,</s>,<unk>), replace|with space. - The base model has a known off-by-one where the space token (id=392) exceeds
vocab_size(392); fine-tuning scripts detect and fix this by resizinglm_head. - Linux is the primary platform; macOS and Windows are supported.