Rust-first tooling for MOSS-Audio-Tokenizer-Nano RVQ artifacts, with a small WASM layer for browser playback through ONNX Runtime Web.
The crate owns the deterministic codec-adjacent work:
.mossnanocontainer parsing and writing- 10-bit RVQ token packing and unpacking
- metadata, duration, and bitrate accounting
- streaming decode chunk scheduling
- ONNX token layout conversion from
[quantizer, frame]to[time, quantizer] - decoded PCM accumulation
- PCM16 WAV writing
wasm-bindgenexports for browser and Node.js use
The neural model still runs through the official MOSS Nano ONNX graphs using
onnxruntime-web. This keeps the Rust/WASM surface small and predictable while
preserving the path to browser playback.
This is an early experimental repo. Decode playback uses the official
moss_audio_tokenizer_decode_step.onnx graph with transformer offsets and
attention cache tensors carried across chunks. The cache position tensors must
start at -1, matching the native model reset path.
moss_audio_tokenizer_decode_full.onnx is still useful for whole-file reference
decodes. Do not reset that full graph independently for each playback chunk:
that creates audible boundary artifacts and does not match native output.
Tested locally with MOSS-Audio-Tokenizer-Nano RVQ16 stereo artifacts at 48 kHz.
MOSS Nano emits one RVQ token frame per 3,840 decoded samples. At 48 kHz this is an 80 ms quantum, so second-based chunk targets must snap to whole token frames.
| Target | Token frames | Actual duration |
|---|---|---|
| 1.333 s | 17 | 1.36 s |
| 1.8 s | 23 | 1.84 s |
The WASM API exposes MossNanoDecodeStream:
const stream = new MossNanoDecodeStream(artifactBytes, 17);
while (stream.hasNext()) {
const start = stream.nextStartFrame();
const tokenFrames = stream.nextTokenFrames();
const codes = stream.nextCodesTqI32();
// Run decode_step.onnx with:
// audio_codes: [1, tokenFrames, quantizers], int32
// audio_code_lengths: [1], int32
// plus the carried transformer/attention state tensors
stream.pushDecodedPlanar(decodedPlanarF32, channels, decodedFrames);
}
const wavBytes = stream.finishPcm16Wav();Rust handles chunk scheduling, token slicing, token transposition, decoded audio assembly, and WAV writing. JavaScript loads ONNX Runtime Web, invokes the stateful decoder graph for each chunk, and feeds every state output back into the next chunk.
The current .mossnano container is intentionally tiny:
| Bytes | Field |
|---|---|
| 0..8 | ASCII magic MOSSNANO |
| 8..12 | sample_rate, little-endian u32 |
| 12..16 | channels, little-endian u32 |
| 16..20 | original_samples, little-endian u32 |
| 20..24 | quantizers, little-endian u32 |
| 24..28 | frames, little-endian u32 |
| 28..32 | codebook_size, little-endian u32 |
| 32.. | LSB-first packed RVQ codes |
For MOSS Nano RVQ16, codebook_size = 1024, so each token is packed into 10
bits. The packed code order is [quantizer, frame].
Inspect a .mossnano artifact:
cargo run -- info path/to/file.mossnanoUnpack codes to little-endian u16 values:
cargo run -- unpack-u16le path/to/file.mossnano target/codes.u16leInstall Rust and Node dependencies:
rustup target add wasm32-unknown-unknown
cargo install wasm-bindgen-cli --version 0.2.121
cd web
npm install
cd ..Build the WASM package:
scripts/build-wasm.shRun the Rust/WASM smoke test:
node scripts/wasm-smoke.mjsWeights are intentionally not committed. Download the official
browser-oriented ONNX bundle into weights/:
scripts/download-onnx.shExpected files:
moss_audio_tokenizer_decode_full.onnxmoss_audio_tokenizer_decode_step.onnxmoss_audio_tokenizer_decode_shared.datamoss_audio_tokenizer_encode.onnxmoss_audio_tokenizer_encode.datacodec_browser_onnx_meta.json
Decode-only playback needs the decoder graph and shared decoder data, about 45 MB total. Encode plus decode needs about 90 MB.
Decode a .mossnano artifact with the 1.333-second target. This uses the
stateful decode_step graph by default:
node scripts/decode-node.mjs \
--input path/to/file.mossnano \
--output target/decoded.wav \
--chunk-seconds 1.333Decode with the 1.8-second target:
node scripts/decode-node.mjs \
--input path/to/file.mossnano \
--output target/decoded-1p8.wav \
--chunk-seconds 1.8You can also pass exact token-frame chunks:
node scripts/decode-node.mjs --input path/to/file.mossnano --chunk-frames 23For a whole-file reference pass through decode_full.onnx, pass the full token
frame count and opt into the full decoder:
node scripts/decode-node.mjs \
--input path/to/file.mossnano \
--output target/decoded-full.wav \
--chunk-frames 50 \
--decoder fullCompare a chunked output against a reference and inspect chunk joins:
node scripts/compare-wav-boundaries.mjs \
--reference target/decoded-full.wav \
--candidate target/decoded.wav \
--chunk-frames 17Start a local server:
cd web
npm run serveOpen http://localhost:8765/web/, choose a .mossnano file, and leave the
model root as:
../weights/MOSS-Audio-Tokenizer-Nano-ONNX/
The page loads the Rust WASM package, fetches the ONNX decoder graph and shared external data, decodes chunk by chunk, and creates a playable WAV blob in the browser.
Run the native tests:
cargo testRun formatting and JS syntax checks:
cargo fmt --check
node --check scripts/decode-node.mjs
node --check web/mossnano-player.jsGenerated files and downloaded weights are ignored by git:
target/weights/web/node_modules/web/pkg/
MIT