Fast Rust mel spectrogram and VAD primitives for ASR systems.
Release note: 0.4.0 moves CPU mel and Kaldi fbank computation onto sparse
filterbank projection derived from the same dense reference matrices. On the
Parakeet/NeMo JFK benchmark, the pure Rust frontend is now close to C/libtorch
CPU trace performance while preserving fixture parity.
mel-spec is built around the parts of speech pipelines that need to be cheap,
predictable, and easy to embed: STFT, Whisper-compatible log-mel features,
Kaldi-style filterbanks, TGA spectrogram interchange, and a lightweight VAD that
reuses the same mel/STFT features.
| Feature | What it is for |
|---|---|
| Whisper-compatible mel | Log-mel spectrograms aligned with whisper.cpp, PyTorch, and librosa. |
| Kaldi-compatible fbank | 80-bin Kaldi-style filterbank features for speaker and audio models. |
| Streaming STFT | Overlap-and-save STFT for live audio pipelines. |
| Model-free VAD | Fast speech/non-speech decisions and timestamps from mel spectrogram structure. |
| TGA mel images | Store and pass quantized mel spectrograms as simple 8-bit TGA files. |
| Local Whisper WASM | Hush uses mel-spec mel tensors/TGA segments for fully local browser Whisper transcription. |
| Native GPU backends | Experimental CUDA and wgpu paths for batched native mel generation. |
| Browser worker demo | WASM worker and SharedArrayBuffer example for live browser audio. |
use mel_spec::prelude::*;
let samples = vec![0.0_f32; 16_000];
let mel_frames = Spectrogram::compute_mel_spectrogram_cpu(
&samples,
400,
160,
80,
16_000.0,
);
println!("frames={}", mel_frames.len());The focused API examples are kept in the example READMEs so the top-level README stays readable:
| Example | Description |
|---|---|
| browser | Stream microphone or WAV audio to a WASM mel worker. |
| mel_tga | Convert raw audio to TGA mel spectrogram images. |
| tga_whisper | Transcribe precomputed TGA mel spectrograms with whisper.cpp. |
| stream_whisper | Stream ffmpeg audio through mel, VAD, and Whisper. |
| vad_ten_eval | Evaluate mel-spec VAD against the vendored TEN-VAD testset. |
mel-spec includes a lightweight, model-free VAD. It does not load a neural VAD
runtime; it looks for speech-like Sobel edge structure in mel spectrogram frames
and can attach STFT-derived timestamps to each decision.
Current balanced default on the checked-in TEN-VAD testset:
| System | Setting | Macro precision | Macro recall | Macro F1 | Macro FPR | RTFx |
|---|---|---|---|---|---|---|
mel-spec |
balanced default | 0.8751 | 0.8785 | 0.8566 | 0.3946 | 819.6 |
mel-spec |
high-F1 sweep result | 0.8165 | 0.9635 | 0.8769 | 0.6459 | 828.9 |
| Silero | tuned threshold 0.13 |
0.8897 | 0.9388 | 0.9088 | 0.3602 | 110.3 |
| Silero | default threshold 0.50 |
0.9379 | 0.8630 | 0.8826 | 0.1778 | 110.6 |
The balanced default is not trying to beat learned VADs at strict endpointing. It is a fast built-in option that reuses ASR mel features and avoids another model dependency. Tuned Silero is still more accurate overall; TEN-VAD is the source of the labels and upstream reports stronger precision/recall than Silero and WebRTC on the same testset.
Detailed method, provenance, commands, speed notes, and per-file results are in doc/vad/README.md.
TGA spectrograms are useful when you want a simple interchange format for mel features. They can be inspected as images, spliced, stored, and passed to the Whisper examples without keeping the original audio around.
This path is now live in Hush as local browser ASR. The browser uses
mel-spec's Whisper-compatible log-mel output, stores captured speech segments
as compact 8-bit TGA images, decodes them back to an 80-mel Float32Array, and
passes that tensor directly to a custom whisper.cpp WASM binding via
whisper_set_mel. The active Hush deployment verifies that local WASM Whisper
can transcribe from the mel tensor without posting microphone audio to a server.
This uses the direct-mel endpoint/entry point we PR'd against whisper.cpp,
not the stock browser example that feeds PCM audio into whisper.wasm.
Mel spectrograms are also robust under heavy quantization. Whisper does not need high-precision PCM once the signal has been projected into mel space: 8-bit TGA images preserve the information the model sees, and even coarse rounding of mel values can retain useful transcription quality.
Original: [0.158, 0.266, 0.076, 0.196, 0.167, ...]
Rounded: [0.2, 0.3, 0.1, 0.2, 0.2, ...]
(top: original mel values, bottom: values rounded to 1.0e-1 before image
quantization)
Benchmarks on Apple M1 Pro, single-threaded release build:
| Audio Length | Frames | Time | Throughput |
|---|---|---|---|
| 10s | 997 | 21ms | 476x realtime |
| 60s | 5997 | 124ms | 484x realtime |
| 300s | 29997 | 622ms | 482x realtime |
mel() and the Kaldi filterbank builder still produce dense filterbank
matrices for reference, fixture comparison, and interchange with other
toolchains. Runtime mel/fbank computation derives sparse projection tables from
those dense matrices, so the executed math is checked against the same reference
weights instead of maintaining a separate filterbank definition.
asr-api also uses mel-spec filterbanks in its Parakeet/TDT frontend. We
benchmarked that Rust frontend against a CPU TorchScript trace of the original
NeMo Parakeet featurizer (featurizer_cpu.pt) on the JFK sample
(11s, mono 16 kHz) on the same M1 Mac.
The benchmark is useful for two reasons:
- It catches frontend contract drift: the first run found a one-frame mismatch
(
128x1100vs128x1101) caused by dropping NeMo's final centered/padded frame. - After matching the frame count, the Rust features are numerically very close to the traced NeMo frontend.
| Featurizer | Shape | Mean | p50 | p95 | RTFx |
|---|---|---|---|---|---|
Rust Parakeet frontend using mel-spec |
128x1101 |
2.341ms |
2.334ms |
2.406ms |
4699.62 |
| TorchScript CPU trace | 128x1101 |
2.244ms |
2.206ms |
2.813ms |
4902.22 |
Feature comparison across the full tensor:
| Metric | Value |
|---|---|
| MAE | 0.001183 |
| RMSE | 0.023699 |
| Max absolute error | 3.965733 |
| Correlation | 0.999719 |
The Rust Parakeet path uses BatchLogMelSpectrogram from the existing mel
module. It keeps FFT scratch buffers alive between calls and applies the mel
filterbank sparsely. This brings the pure Rust frontend close to the traced CPU
frontend while avoiding the libtorch/PyTorch runtime dependency. The comparison
harness lives in asr-torch as parakeet_featurizer_bench.
The CPU path is the default and is already fast enough for many streaming and batch workloads. Experimental native GPU backends are available behind feature flags:
| Feature | Backend |
|---|---|
cuda |
NVIDIA-only cuFFT plus CUDA mel projection. |
wgpu |
Native Rust GPU backend for Metal, Vulkan, and DX12 systems. |
The top-level library tests do not automatically build every standalone example crate. Use these commands when changing examples:
cargo test --release
cargo build --release --manifest-path examples/mel_tga/Cargo.toml
cargo build --release --manifest-path examples/stream_whisper/Cargo.toml
cargo build --release --manifest-path examples/tga_whisper/Cargo.toml
cargo run --release --manifest-path examples/vad_ten_eval/Cargo.toml
(cd examples/browser && npm ci && npm test)The Whisper examples compile against the wavey-ai/whisper-rs fork and require
a GGML Whisper model to run inference.
The Hush live browser demo is active at:
https://wavey.ai/code/hush/?v=20260515-35
The source remains at wavey-ai/hush. With the current tuned settings it works as a live browser VAD, spectrogram debugging view, and local Whisper WASM transcription demo. It exposes mel structure, Sobel edges, ridge tracks, candidate speech regions, and the local transcript in real time. The VAD itself should still be treated as experimental rather than a drop-in replacement for a learned VAD; one strong use case is as a browser-side feature/debugging front end or cheap prefilter before a stronger VAD/ASR model.

