Streaming HTTP + WebSocket server wrapping NVIDIA Parakeet Realtime EOU (120M) via parakeet-rs. Cache-aware streaming ASR with end-of-utterance detection, designed to drop into any app that needs low-latency transcription.
- ~160 ms chunked streaming over WebSocket, partial text emitted per chunk
[EOU]end-of-utterance detection →finalevents, decoder cache resets- Plain HTTP endpoint for batch WAV transcription
- Single ~5 MB Rust binary, dynamically loads onnxruntime
- CUDA 12.x acceleration on Windows x64
- Download
parakeet-realtime-server-v0.1.0-win-x64.zipfrom the latest release. - Unzip into a folder of your choice (call it
<install>). - Clone this repo anywhere and copy
scripts/into<install>/scripts/(or just download the four PowerShell scripts individually). - In
<install>:
.\scripts\fetch-cuda-deps.ps1 # one-time, ~2 GB of NVIDIA DLLs
.\scripts\download-models.ps1 # one-time, ~480 MB of ONNX weights
.\parakeet-realtime-server.exe --model-dir models --port 9005Health check: curl http://127.0.0.1:9005/health should return {"ready":true} after warm-up.
- Rust stable — install via rustup (
rustup default stable-x86_64-pc-windows-msvc) - Visual Studio 2022 Build Tools with the "Desktop development with C++" workload
- CUDA 12.x toolkit (set
CUDA_PATH) - cuDNN 9.x DLLs (the
fetch-cuda-deps.ps1script grabs these) - 7-Zip on PATH (needed by
fetch-cuda-deps.ps1to extract one archive)
git clone https://github.com/pauldaywork/parakeet-realtime-server
cd parakeet-realtime-server
.\scripts\build.ps1Output lands in dist/. Then:
.\scripts\fetch-cuda-deps.ps1
.\scripts\download-models.ps1
.\dist\parakeet-realtime-server.exe --model-dir dist\models --port 9005Readiness check. {"ready": false} while the model warms up, {"ready": true} once a test transcription completes.
Multipart WAV upload. Field name: file. Returns {"text": "..."}.
curl -F file=@audio.wav http://127.0.0.1:9005/v1/audio/transcriptionsConnect to ws://127.0.0.1:9005/stream. Send binary frames containing raw int16 LE PCM at 16 kHz. The server accumulates into 160 ms chunks and emits JSON text frames:
{"type": "partial", "text": "hello wor"}
{"type": "partial", "text": "hello world"}
{"type": "final", "text": "hello world"}Send {"type":"end"} to flush remaining audio. The server resets its decoder cache after each final event so multiple utterances can be streamed on one connection.
| Flag | Default | Description |
|---|---|---|
--model-dir |
(required) | Folder containing encoder.onnx, decoder_joint.onnx, tokenizer.json |
--host |
127.0.0.1 |
Bind address |
--port |
9005 |
TCP port |
--chunk-ms |
160 |
Chunk size in ms (affects latency) |
--device |
cuda |
cuda or cpu |
examples/mic-streaming.html— zero-dependency single-file demo; open in any browser after starting the server.examples/web-app/— a full Vite/React app with mic picker, live transcript, raw event log, and health badge.
| Symptom | Fix |
|---|---|
exit code 0xc0000135 at startup |
Missing CUDA/cuDNN DLLs. Run scripts\fetch-cuda-deps.ps1. |
Could not locate cudnn_graph64_9.dll |
cuDNN 9.x sub-DLL missing. Same fix. |
/health returns {"ready": false} forever |
Model files not found. Check --model-dir points at a folder with all three ONNX files. |
| Partial events stop mid-stream | Client stopped sending PCM but didn't send {"type":"end"}. Send it before closing. |
MIT. See LICENSE.
- altunenes/parakeet-rs — the underlying Rust parakeet inference crate this server wraps.
- NVIDIA Parakeet Realtime EOU 120M — the model.
- whisper.cpp — repo-structure inspiration.