A local-first desktop daemon that runs open-source speech-to-text models on your own NVIDIA GPU and exposes them over a small HTTP + WebSocket API.
Bring your own GPU, pick a model from the UI, and your apps talk to
http://localhost:8000 instead of a cloud endpoint. One-shot file
transcription via REST, live streaming via WebSocket, no data leaves the
machine.
- Desktop app (Electrobun) — a small window with a model toggle list, GPU memory bar, and settings for cache path / API port / HuggingFace token. Downloads models on demand, stashes them under your cache dir, and starts/stops the inference daemons as you enable each model.
- HTTP + WebSocket gateway on
http://localhost:8000— REST for one-shot file uploads, WS for live mic streaming. Same wire shape as OpenAI's transcription API (close enough that existing clients work). - TypeScript client library in
clients/typescript— zero-dep, works in browsers / Node / Bun / Deno. See its own README for the full API. - Browser demo in
examples/test-web-app— a Vite + React page that records from your mic and renders transcripts live against a single model. Doubles as the canonical example of how to use the client. - Compare-models demo in
examples/compare-models— a Vite + React grid that fans one mic stream out to every ready WS-capable model simultaneously, with per-lane metrics and Markdown session reports. See its own README.
Under the hood there are two live paths:
- VAD-segmented models (
moonshine-base,parakeet-tdt-v2,cohere-transcribe): Silero VAD running on the server chunks incoming mic audio into speech segments, each segment goes through a model daemon for transcription, and the client gets onetranscript.finalper utterance. - Native-streaming models (
parakeet-realtime-eou): the daemon does its own end-of-utterance detection and emits continuous partials plus utterance finals. The server just pipes PCM straight to it.
GPU memory footprint is held in one place per model so enabling three models actually costs three models' worth of VRAM — no surprises.
All models live in src/bun/model-manager/registry.ts.
Adding a new sherpa-onnx model is a registry edit, not a code edit (the
sherpa-onnx CLI args are data-driven off that file).
| Model id | Backend | Capabilities | ~VRAM | Notes |
|---|---|---|---|---|
whisper-large-v3-turbo |
whisper.cpp | REST | ~3 GB | Q4_K quant, best speed/accuracy balance |
parakeet-tdt-v2 |
sherpa-onnx | REST + WS (vad) | ~2 GB | NVIDIA Parakeet TDT, fast, English only |
moonshine-base |
sherpa-onnx | REST + WS (vad) | ~1 GB | Lightweight realtime model |
cohere-transcribe |
sherpa-onnx | REST + WS (vad) | ~5 GB | 2B params, 14 languages |
parakeet-realtime-eou |
parakeet-rs | REST + WS (streaming) | ~1.5 GB | Native streaming with partial hypotheses + on-device EOU |
rest— model can be hit viaPOST /v1/audio/transcriptions.ws— model can be streamed to viaWS /v1/audio/transcriptions/stream. Server-side Silero VAD chops the stream into utterances; client sees onetranscript.finalper segment.ws-streaming— same endpoint, but the daemon owns streaming and emits continuoustranscript.partialframes and utterance-alignedtranscript.finalframes. The server forwards PCM without running VAD.
Clients don't pick a mode — they just connect; the server dispatches to the VAD path or the streaming path based on the model's capabilities.
- Windows with an NVIDIA GPU (Linux/macOS are not currently supported — the bundled binaries are Windows-only).
- Bun for running and building.
- 7-Zip on
PATH(needed byscripts/setup-binaries.tsto extract ffmpeg and the CUDA archive). - HuggingFace token (optional) — only needed for gated models.
bun install
bun run scripts/setup-binaries.ts # one-time: downloads native backends + CUDA DLLs
bun run dev # electrobun dev — builds + opens the desktop windowsetup-binaries.ts populates binaries/win32/ with:
whisper-server/— whisper.cppwhisper-server.exe+ ggml DLLssherpa-onnx/—sherpa-onnx-offline-websocket-server.exe+ onnxruntimeparakeet-realtime/—parakeet-realtime-server.exe(pulled from its standalone repo's GitHub release)ffmpeg/—ffmpeg.exe+ ffmpeg DLLscuda-runtime/— CUDA 12.x + cuDNN 9.x DLLs, shared by all three GPU backends. Every backend adapter prepends this directory toPATHat spawn time rather than each shipping its own copy.
The first dev run creates config/settings.json with sensible defaults.
Toggle a model on in the UI and it'll download (progress bar shows in
the model card), load into VRAM, and start serving immediately. The
API is live at http://localhost:8000 as soon as the main process boots.
cd examples/test-web-app
bun install
bunx vite
# open http://localhost:5173, pick moonshine-base, click RecordOr run the compare-models grid against every ready WS model at once:
cd examples/compare-models
bun install
bunx vite
# open http://localhost:5174, click Recordbun run build # electrobun build — produces a distributable under build/Base URL: http://localhost:8000 (configurable via the Settings tab).
| Method | Path | Purpose |
|---|---|---|
GET |
/v1/models |
List every registered model with current status and capabilities |
GET |
/v1/models/:id |
Single model, includes download_progress while downloading |
GET |
/v1/gpu/status |
GPU name + VRAM used/total/free |
POST |
/v1/audio/transcriptions |
Multipart file + model fields. Returns { text }. Accepts any format ffmpeg can decode; max 100 MB. |
WS |
/v1/audio/transcriptions/stream?model=<id> |
Live mic streaming. Send int16 LE PCM at 16 kHz mono as binary frames. Server sends JSON text frames: transcript.partial, transcript.final, error. Send {"type":"end"} to flush and close. |
Errors use an OpenAI-compatible envelope:
{ "error": { "message": "...", "type": "invalid_request_error", "code": "invalid_model" } }All error codes are listed in clients/typescript/src/errors.ts.
src/bun/ Backend daemon (the thing `electrobun dev` boots)
├── index.ts Entry: config → lifecycle → gateway → Electrobun window
├── types.ts Shared types (ModelId, ModelConfig, Settings, Capability, ...)
├── rpc.ts RPC schema for the desktop UI ↔ backend bridge
├── paths.ts Static paths: APP_ROOT, BINARIES_DIR, ASSETS_DIR, CUDA_RUNTIME_DIR
├── gateway/
│ ├── server.ts Elysia app: REST + WS routes, VAD vs streaming dispatch, queue
│ └── errors.ts OpenAI-compatible error envelope helper
├── adapters/
│ ├── adapter.ts The Adapter interface (start/stop/transcribe/streamingTranscribe?)
│ ├── whisper-cpp.ts whisper.cpp subprocess adapter
│ ├── sherpa-onnx.ts sherpa-onnx daemon adapter (registry-driven CLI args)
│ ├── parakeet-rs.ts parakeet-realtime-server adapter (native streaming WS)
│ ├── vad.ts Silero VAD stream wrapper (per server-side VAD session)
│ └── ffmpeg.ts Audio → 16 kHz WAV conversion for REST uploads
├── model-manager/
│ ├── registry.ts Static MODELS dict (the source of truth)
│ ├── lifecycle.ts State machine per model (disabled → ready → error)
│ ├── orchestrator.ts enable/disable/retry model flows
│ ├── downloader.ts HuggingFace + GitHub release download + extract
│ └── vram.ts nvidia-smi wrapper for GPU status
└── config/
├── store.ts settings.json read/write
└── credentials.ts HuggingFace token storage
src/views/main-ui/ Desktop UI (React, runs in the Electrobun WebView)
├── index.tsx Root
├── rpc.ts Typed stub talking back to src/bun
├── types.ts UI-local types
├── components/ ModelsTab, SettingsTab, ModelCard, VramBar, ...
└── hooks/ useModels, useGpu, useSettings — polls + RPC
clients/typescript/ Published client library (zero-dep)
├── src/
│ ├── client.ts AudioModelsClient (list/get/transcribe/streamTranscribe)
│ ├── stream.ts TranscriptionStream (native ReadableStream iterator)
│ ├── errors.ts AudioModelsApiError + AudioModelsStreamError
│ ├── types.ts Exported public types
│ └── index.ts Barrel
├── test-integration.ts End-to-end test against a real in-process gateway
└── README.md Full API docs
examples/test-web-app/ Single-model browser demo (Vite + React)
├── src/
│ ├── App.tsx Top-level layout + state wiring
│ ├── config.ts Base URL
│ ├── components/ Controls, TranscriptPanel, EventLogPanel, ...
│ ├── hooks/ useStreamRecorder, useModels, useMicDevices, ...
│ └── audio/
│ └── pcmStream.ts Owns the Web Audio API lifecycle (getUserMedia,
│ AudioContext, AudioWorklet) and yields chunks
└── public/
└── pcm-worklet.js Audio worklet: buffers 128-sample quanta into
2048-sample frames for the hook to forward
examples/compare-models/ Fan-out browser demo — every ready WS model at once
├── src/ Reducer-driven grid, per-lane metrics + event log
├── reports/ Markdown session reports written on each Stop
└── README.md Run + manual test plan
tests/ Unit + integration tests (bun test)
├── adapters/ ffmpeg, sherpa-onnx arg builder, vad, whisper-cpp
├── config/ settings store
├── gateway/ server + errors envelope (hits the live Elysia app)
├── model-manager/ lifecycle, registry, orchestrator, downloader
├── integration/ Real-daemon tests (skipped when models aren't cached)
└── rpc.test.ts UI ↔ backend RPC contract
scripts/ One-off tooling
├── setup-binaries.ts Download + extract all native backends and CUDA DLLs
├── api-smoke-test.ts Boots the full app, enables a model, exercises every route
├── e2e-test.ts Full-stack server + client check
└── verify-concurrent-ws.ts Opens N concurrent WS streams across different models
binaries/win32/ Pre-built native executables (populated by setup-binaries.ts)
├── whisper-server/ whisper.cpp whisper-server.exe + ggml DLLs
├── sherpa-onnx/ sherpa-onnx-offline-websocket-server.exe + onnxruntime
├── parakeet-realtime/ parakeet-realtime-server.exe (from standalone repo release)
├── ffmpeg/ ffmpeg.exe + ffmpeg DLLs
└── cuda-runtime/ Shared CUDA 12.x + cuDNN 9.x DLLs (prepended to PATH per-backend)
assets/ Small bundled assets (Silero VAD model, test WAVs)
One-shot file upload (POST /v1/audio/transcriptions):
Client → Elysia handler → validate form → ffmpeg (if not WAV) → 16 kHz WAV
→ p-queue for the target model (concurrency 1)
→ Adapter.transcribe(wav)
└── whisper.cpp: spawn whisper-server, read stdout text
└── sherpa-onnx: short-lived WS to local daemon, send PCM, read JSON
└── parakeet-rs: POST multipart to daemon's /v1/audio/transcriptions
→ { text } response
The queue enforces backpressure with queue.size + queue.pending >= 5
(429 queue_full with Retry-After: 5) and a 60 s per-task timeout
(504 queue_timeout). See src/bun/gateway/server.ts.
Live streaming, VAD-segmented (WS …/stream?model=moonshine-base etc.):
Browser mic → AudioContext(16 kHz) → AudioWorklet → Float32 chunks
→ client.streamTranscribe()
→ WS binary frames (int16 LE PCM)
→ server: VadStream per connection
→ Silero VAD detects silence boundaries
→ Adapter.transcribeSamples(float32, 16000)
→ ws.send({ type: "transcript.final", text })
A connection holds one VadStream instance. On {type:"end"} or close,
the VAD is flushed one last time so in-flight audio still gets a final
transcript before the socket goes away.
Live streaming, native (WS …/stream?model=parakeet-realtime-eou):
Browser mic → AudioContext(16 kHz) → AudioWorklet → Float32 chunks
→ client.streamTranscribe()
→ WS binary frames (int16 LE PCM)
→ server: ParakeetRsAdapter.streamingTranscribe(pcm, signal)
→ opens WS to parakeet-realtime-server daemon
→ forwards PCM chunks as-is
→ relays partial + final frames (daemon's own EOU)
→ ws.send({ type: "transcript.partial" | "transcript.final", text })
No server-side VAD. The daemon emits partials continuously during speech and a final when it detects end-of-utterance on-device.
- One API in one place. Elysia (
.ws()and REST on the same chain) replaced a hand-rolledBun.serverouter + separate websocket config. Routing, CORS, and upgrade validation live in one file. - Registry-driven model config. Everything about a model — download
source, expected files, VAD thresholds, sherpa CLI flags, VRAM
estimate, capabilities — is data in
registry.ts. - Capability-driven streaming dispatch. The gateway picks the VAD
path or the native-streaming path from the model's
capabilitiesarray. No per-model switch statements in the handler. - Shared CUDA runtime. CUDA 12.x + cuDNN 9.x DLLs live in one place
(
binaries/win32/cuda-runtime/) rather than being duplicated next to each backend. Each adapter prepends it toPATHwhen spawning its subprocess, so Windows' DLL search picks them up transparently. - p-queue per model, concurrency 1. A GPU daemon is never asked to process two REST requests at once. Concurrency is bounded at the gateway.
- VAD lives server-side (for VAD models). Clients send raw PCM; the server decides when a speech segment has ended. The client library is stateless about audio semantics — it just forwards bytes. Native streaming models bypass this entirely.
- Zero-dep client library. The TypeScript client only uses
fetch+WebSocketglobals and nativeReadableStreamasync iteration, so one package works in browsers, Node 18+, Bun, and Deno without a bundler dance.
bun test # all unit + integration (some skip without cached models)
bun test tests/gateway/ # just the gateway (fast, pure in-process)
bun run typecheck # tsc --noEmitEnd-to-end checks against real downloaded models:
bun run scripts/e2e-test.ts # spins up the full server + client
bun run scripts/api-smoke-test.ts # exercises every REST route + WS stream
bun run scripts/verify-concurrent-ws.ts # N concurrent WS streams, different modelsThe e2e scripts reuse .e2e-cache/ so they share downloaded weights
across runs.
- The main entry point for the backend is
src/bun/index.ts. Read it top-to-bottom and you'll see how config, lifecycle, adapters, and the gateway are wired. - Everything the gateway exposes to the network lives in
src/bun/gateway/server.ts— one file. - The desktop UI is a plain React app living under
src/views/main-ui/and talks to the backend over a typed RPC channel defined insrc/bun/rpc.ts. - The parakeet-realtime daemon is developed in a separate repo
(
pauldaywork/parakeet-realtime-server).setup-binaries.tspulls its release zip; this repo just wires the adapter around it. docs/superpowers/specs/anddocs/superpowers/plans/hold the design documents the project was built against — useful historical context if you want to understand why something is shaped the way it is.