Mel Spec

Fast Rust mel spectrogram and VAD primitives for ASR systems.

Release note: 0.4.0 moves CPU mel and Kaldi fbank computation onto sparse filterbank projection derived from the same dense reference matrices. On the Parakeet/NeMo JFK benchmark, the pure Rust frontend is now close to C/libtorch CPU trace performance while preserving fixture parity.

mel-spec is built around the parts of speech pipelines that need to be cheap, predictable, and easy to embed: STFT, Whisper-compatible log-mel features, Kaldi-style filterbanks, TGA spectrogram interchange, and a lightweight VAD that reuses the same mel/STFT features.

Main Features

Feature	What it is for
Whisper-compatible mel	Log-mel spectrograms aligned with whisper.cpp, PyTorch, and librosa.
Kaldi-compatible fbank	80-bin Kaldi-style filterbank features for speaker and audio models.
Streaming STFT	Overlap-and-save STFT for live audio pipelines.
Model-free VAD	Fast speech/non-speech decisions and timestamps from mel spectrogram structure.
TGA mel images	Store and pass quantized mel spectrograms as simple 8-bit TGA files.
Local Whisper WASM	Hush uses `mel-spec` mel tensors/TGA segments for fully local browser Whisper transcription.
Native GPU backends	Experimental CUDA and `wgpu` paths for batched native mel generation.
Browser worker demo	WASM worker and SharedArrayBuffer example for live browser audio.

Quick Start

use mel_spec::prelude::*;

let samples = vec![0.0_f32; 16_000];
let mel_frames = Spectrogram::compute_mel_spectrogram_cpu(
    &samples,
    400,
    160,
    80,
    16_000.0,
);

println!("frames={}", mel_frames.len());

The focused API examples are kept in the example READMEs so the top-level README stays readable:

Example	Description
browser	Stream microphone or WAV audio to a WASM mel worker.
mel_tga	Convert raw audio to TGA mel spectrogram images.
tga_whisper	Transcribe precomputed TGA mel spectrograms with whisper.cpp.
stream_whisper	Stream ffmpeg audio through mel, VAD, and Whisper.
vad_ten_eval	Evaluate `mel-spec` VAD against the vendored TEN-VAD testset.

Voice Activity Detection

mel-spec includes a lightweight, model-free VAD. It does not load a neural VAD runtime; it looks for speech-like Sobel edge structure in mel spectrogram frames and can attach STFT-derived timestamps to each decision.

Current balanced default on the checked-in TEN-VAD testset:

System	Setting	Macro precision	Macro recall	Macro F1	Macro FPR	RTFx
`mel-spec`	balanced default	0.8751	0.8785	0.8566	0.3946	819.6
`mel-spec`	high-F1 sweep result	0.8165	0.9635	0.8769	0.6459	828.9
Silero	tuned threshold `0.13`	0.8897	0.9388	0.9088	0.3602	110.3
Silero	default threshold `0.50`	0.9379	0.8630	0.8826	0.1778	110.6

The balanced default is not trying to beat learned VADs at strict endpointing. It is a fast built-in option that reuses ASR mel features and avoids another model dependency. Tuned Silero is still more accurate overall; TEN-VAD is the source of the labels and upstream reports stronger precision/recall than Silero and WebRTC on the same testset.

Detailed method, provenance, commands, speed notes, and per-file results are in doc/vad/README.md.

TGA Spectrograms

TGA spectrograms are useful when you want a simple interchange format for mel features. They can be inspected as images, spliced, stored, and passed to the Whisper examples without keeping the original audio around.

This path is now live in Hush as local browser ASR. The browser uses mel-spec's Whisper-compatible log-mel output, stores captured speech segments as compact 8-bit TGA images, decodes them back to an 80-mel Float32Array, and passes that tensor directly to a custom whisper.cpp WASM binding via whisper_set_mel. The active Hush deployment verifies that local WASM Whisper can transcribe from the mel tensor without posting microphone audio to a server. This uses the direct-mel endpoint/entry point we PR'd against whisper.cpp, not the stock browser example that feeds PCM audio into whisper.wasm.

"the quest for peace."

Mel spectrograms are also robust under heavy quantization. Whisper does not need high-precision PCM once the signal has been projected into mel space: 8-bit TGA images preserve the information the model sees, and even coarse rounding of mel values can retain useful transcription quality.

Original: [0.158, 0.266, 0.076, 0.196, 0.167, ...]
Rounded:  [0.2,   0.3,   0.1,   0.2,   0.2,   ...]

(top: original mel values, bottom: values rounded to 1.0e-1 before image quantization)

Performance

Benchmarks on Apple M1 Pro, single-threaded release build:

Audio Length	Frames	Time	Throughput
10s	997	21ms	476x realtime
60s	5997	124ms	484x realtime
300s	29997	622ms	482x realtime

mel() and the Kaldi filterbank builder still produce dense filterbank matrices for reference, fixture comparison, and interchange with other toolchains. Runtime mel/fbank computation derives sparse projection tables from those dense matrices, so the executed math is checked against the same reference weights instead of maintaining a separate filterbank definition.

Parakeet/NeMo Frontend Check

asr-api also uses mel-spec filterbanks in its Parakeet/TDT frontend. We benchmarked that Rust frontend against a CPU TorchScript trace of the original NeMo Parakeet featurizer (featurizer_cpu.pt) on the JFK sample (11s, mono 16 kHz) on the same M1 Mac.

The benchmark is useful for two reasons:

It catches frontend contract drift: the first run found a one-frame mismatch (128x1100 vs 128x1101) caused by dropping NeMo's final centered/padded frame.
After matching the frame count, the Rust features are numerically very close to the traced NeMo frontend.

Featurizer	Shape	Mean	p50	p95	RTFx
Rust Parakeet frontend using `mel-spec`	`128x1101`	`2.341ms`	`2.334ms`	`2.406ms`	`4699.62`
TorchScript CPU trace	`128x1101`	`2.244ms`	`2.206ms`	`2.813ms`	`4902.22`

Feature comparison across the full tensor:

Metric	Value
MAE	`0.001183`
RMSE	`0.023699`
Max absolute error	`3.965733`
Correlation	`0.999719`

The Rust Parakeet path uses BatchLogMelSpectrogram from the existing mel module. It keeps FFT scratch buffers alive between calls and applies the mel filterbank sparsely. This brings the pure Rust frontend close to the traced CPU frontend while avoiding the libtorch/PyTorch runtime dependency. The comparison harness lives in asr-torch as parakeet_featurizer_bench.

The CPU path is the default and is already fast enough for many streaming and batch workloads. Experimental native GPU backends are available behind feature flags:

Feature	Backend
`cuda`	NVIDIA-only cuFFT plus CUDA mel projection.
`wgpu`	Native Rust GPU backend for Metal, Vulkan, and DX12 systems.

Build Checks

The top-level library tests do not automatically build every standalone example crate. Use these commands when changing examples:

cargo test --release
cargo build --release --manifest-path examples/mel_tga/Cargo.toml
cargo build --release --manifest-path examples/stream_whisper/Cargo.toml
cargo build --release --manifest-path examples/tga_whisper/Cargo.toml
cargo run --release --manifest-path examples/vad_ten_eval/Cargo.toml
(cd examples/browser && npm ci && npm test)

The Whisper examples compile against the wavey-ai/whisper-rs fork and require a GGML Whisper model to run inference.

Hush Demo

The Hush live browser demo is active at:

https://wavey.ai/code/hush/?v=20260515-35

The source remains at wavey-ai/hush. With the current tuned settings it works as a live browser VAD, spectrogram debugging view, and local Whisper WASM transcription demo. It exposes mel structure, Sobel edges, ridge tracks, candidate speech regions, and the local transcript in real time. The VAD itself should still be treated as experimental rather than a drop-in replacement for a learned VAD; one strong use case is as a browser-side feature/debugging front end or cheap prefilter before a stronger VAD/ASR model.

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
.github/workflows		.github/workflows
doc		doc
examples		examples
src		src
testdata		testdata
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.toml		Cargo.toml
LICENSE-MIT		LICENSE-MIT
README.md		README.md
build.rs		build.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mel Spec

Main Features

Quick Start

Voice Activity Detection

TGA Spectrograms

Performance

Parakeet/NeMo Frontend Check

Build Checks

Hush Demo

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mel Spec

Main Features

Quick Start

Voice Activity Detection

TGA Spectrograms

Performance

Parakeet/NeMo Frontend Check

Build Checks

Hush Demo

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages