Skip to content

pauldaywork/audio-models-api-desktop-app

Repository files navigation

Audio Models API — Desktop App

A local-first desktop daemon that runs open-source speech-to-text models on your own NVIDIA GPU and exposes them over a small HTTP + WebSocket API.

Bring your own GPU, pick a model from the UI, and your apps talk to http://localhost:8000 instead of a cloud endpoint. One-shot file transcription via REST, live streaming via WebSocket, no data leaves the machine.

What's in the box

  • Desktop app (Electrobun) — a small window with a model toggle list, GPU memory bar, and settings for cache path / API port / HuggingFace token. Downloads models on demand, stashes them under your cache dir, and starts/stops the inference daemons as you enable each model.
  • HTTP + WebSocket gateway on http://localhost:8000 — REST for one-shot file uploads, WS for live mic streaming. Same wire shape as OpenAI's transcription API (close enough that existing clients work).
  • TypeScript client library in clients/typescript — zero-dep, works in browsers / Node / Bun / Deno. See its own README for the full API.
  • Browser demo in examples/test-web-app — a Vite + React page that records from your mic and renders transcripts live against a single model. Doubles as the canonical example of how to use the client.
  • Compare-models demo in examples/compare-models — a Vite + React grid that fans one mic stream out to every ready WS-capable model simultaneously, with per-lane metrics and Markdown session reports. See its own README.

Under the hood there are two live paths:

  • VAD-segmented models (moonshine-base, parakeet-tdt-v2, cohere-transcribe): Silero VAD running on the server chunks incoming mic audio into speech segments, each segment goes through a model daemon for transcription, and the client gets one transcript.final per utterance.
  • Native-streaming models (parakeet-realtime-eou): the daemon does its own end-of-utterance detection and emits continuous partials plus utterance finals. The server just pipes PCM straight to it.

GPU memory footprint is held in one place per model so enabling three models actually costs three models' worth of VRAM — no surprises.

Supported models

All models live in src/bun/model-manager/registry.ts. Adding a new sherpa-onnx model is a registry edit, not a code edit (the sherpa-onnx CLI args are data-driven off that file).

Model id Backend Capabilities ~VRAM Notes
whisper-large-v3-turbo whisper.cpp REST ~3 GB Q4_K quant, best speed/accuracy balance
parakeet-tdt-v2 sherpa-onnx REST + WS (vad) ~2 GB NVIDIA Parakeet TDT, fast, English only
moonshine-base sherpa-onnx REST + WS (vad) ~1 GB Lightweight realtime model
cohere-transcribe sherpa-onnx REST + WS (vad) ~5 GB 2B params, 14 languages
parakeet-realtime-eou parakeet-rs REST + WS (streaming) ~1.5 GB Native streaming with partial hypotheses + on-device EOU

Capability legend

  • rest — model can be hit via POST /v1/audio/transcriptions.
  • ws — model can be streamed to via WS /v1/audio/transcriptions/stream. Server-side Silero VAD chops the stream into utterances; client sees one transcript.final per segment.
  • ws-streaming — same endpoint, but the daemon owns streaming and emits continuous transcript.partial frames and utterance-aligned transcript.final frames. The server forwards PCM without running VAD.

Clients don't pick a mode — they just connect; the server dispatches to the VAD path or the streaming path based on the model's capabilities.

Requirements

  • Windows with an NVIDIA GPU (Linux/macOS are not currently supported — the bundled binaries are Windows-only).
  • Bun for running and building.
  • 7-Zip on PATH (needed by scripts/setup-binaries.ts to extract ffmpeg and the CUDA archive).
  • HuggingFace token (optional) — only needed for gated models.

Getting started

bun install
bun run scripts/setup-binaries.ts    # one-time: downloads native backends + CUDA DLLs
bun run dev                          # electrobun dev — builds + opens the desktop window

setup-binaries.ts populates binaries/win32/ with:

  • whisper-server/ — whisper.cpp whisper-server.exe + ggml DLLs
  • sherpa-onnx/sherpa-onnx-offline-websocket-server.exe + onnxruntime
  • parakeet-realtime/parakeet-realtime-server.exe (pulled from its standalone repo's GitHub release)
  • ffmpeg/ffmpeg.exe + ffmpeg DLLs
  • cuda-runtime/ — CUDA 12.x + cuDNN 9.x DLLs, shared by all three GPU backends. Every backend adapter prepends this directory to PATH at spawn time rather than each shipping its own copy.

The first dev run creates config/settings.json with sensible defaults. Toggle a model on in the UI and it'll download (progress bar shows in the model card), load into VRAM, and start serving immediately. The API is live at http://localhost:8000 as soon as the main process boots.

Point a browser at it

cd examples/test-web-app
bun install
bunx vite
# open http://localhost:5173, pick moonshine-base, click Record

Or run the compare-models grid against every ready WS model at once:

cd examples/compare-models
bun install
bunx vite
# open http://localhost:5174, click Record

Build a packaged app

bun run build        # electrobun build — produces a distributable under build/

The HTTP + WebSocket API

Base URL: http://localhost:8000 (configurable via the Settings tab).

Method Path Purpose
GET /v1/models List every registered model with current status and capabilities
GET /v1/models/:id Single model, includes download_progress while downloading
GET /v1/gpu/status GPU name + VRAM used/total/free
POST /v1/audio/transcriptions Multipart file + model fields. Returns { text }. Accepts any format ffmpeg can decode; max 100 MB.
WS /v1/audio/transcriptions/stream?model=<id> Live mic streaming. Send int16 LE PCM at 16 kHz mono as binary frames. Server sends JSON text frames: transcript.partial, transcript.final, error. Send {"type":"end"} to flush and close.

Errors use an OpenAI-compatible envelope:

{ "error": { "message": "...", "type": "invalid_request_error", "code": "invalid_model" } }

All error codes are listed in clients/typescript/src/errors.ts.

Project layout

src/bun/                  Backend daemon (the thing `electrobun dev` boots)
├── index.ts              Entry: config → lifecycle → gateway → Electrobun window
├── types.ts              Shared types (ModelId, ModelConfig, Settings, Capability, ...)
├── rpc.ts                RPC schema for the desktop UI ↔ backend bridge
├── paths.ts              Static paths: APP_ROOT, BINARIES_DIR, ASSETS_DIR, CUDA_RUNTIME_DIR
├── gateway/
│   ├── server.ts         Elysia app: REST + WS routes, VAD vs streaming dispatch, queue
│   └── errors.ts         OpenAI-compatible error envelope helper
├── adapters/
│   ├── adapter.ts        The Adapter interface (start/stop/transcribe/streamingTranscribe?)
│   ├── whisper-cpp.ts    whisper.cpp subprocess adapter
│   ├── sherpa-onnx.ts    sherpa-onnx daemon adapter (registry-driven CLI args)
│   ├── parakeet-rs.ts    parakeet-realtime-server adapter (native streaming WS)
│   ├── vad.ts            Silero VAD stream wrapper (per server-side VAD session)
│   └── ffmpeg.ts         Audio → 16 kHz WAV conversion for REST uploads
├── model-manager/
│   ├── registry.ts       Static MODELS dict (the source of truth)
│   ├── lifecycle.ts      State machine per model (disabled → ready → error)
│   ├── orchestrator.ts   enable/disable/retry model flows
│   ├── downloader.ts     HuggingFace + GitHub release download + extract
│   └── vram.ts           nvidia-smi wrapper for GPU status
└── config/
    ├── store.ts          settings.json read/write
    └── credentials.ts    HuggingFace token storage

src/views/main-ui/        Desktop UI (React, runs in the Electrobun WebView)
├── index.tsx             Root
├── rpc.ts                Typed stub talking back to src/bun
├── types.ts              UI-local types
├── components/           ModelsTab, SettingsTab, ModelCard, VramBar, ...
└── hooks/                useModels, useGpu, useSettings — polls + RPC

clients/typescript/       Published client library (zero-dep)
├── src/
│   ├── client.ts         AudioModelsClient (list/get/transcribe/streamTranscribe)
│   ├── stream.ts         TranscriptionStream (native ReadableStream iterator)
│   ├── errors.ts         AudioModelsApiError + AudioModelsStreamError
│   ├── types.ts          Exported public types
│   └── index.ts          Barrel
├── test-integration.ts   End-to-end test against a real in-process gateway
└── README.md             Full API docs

examples/test-web-app/    Single-model browser demo (Vite + React)
├── src/
│   ├── App.tsx           Top-level layout + state wiring
│   ├── config.ts         Base URL
│   ├── components/       Controls, TranscriptPanel, EventLogPanel, ...
│   ├── hooks/            useStreamRecorder, useModels, useMicDevices, ...
│   └── audio/
│       └── pcmStream.ts  Owns the Web Audio API lifecycle (getUserMedia,
│                          AudioContext, AudioWorklet) and yields chunks
└── public/
    └── pcm-worklet.js    Audio worklet: buffers 128-sample quanta into
                           2048-sample frames for the hook to forward

examples/compare-models/  Fan-out browser demo — every ready WS model at once
├── src/                  Reducer-driven grid, per-lane metrics + event log
├── reports/              Markdown session reports written on each Stop
└── README.md             Run + manual test plan

tests/                    Unit + integration tests (bun test)
├── adapters/             ffmpeg, sherpa-onnx arg builder, vad, whisper-cpp
├── config/               settings store
├── gateway/              server + errors envelope (hits the live Elysia app)
├── model-manager/        lifecycle, registry, orchestrator, downloader
├── integration/          Real-daemon tests (skipped when models aren't cached)
└── rpc.test.ts           UI ↔ backend RPC contract

scripts/                  One-off tooling
├── setup-binaries.ts      Download + extract all native backends and CUDA DLLs
├── api-smoke-test.ts      Boots the full app, enables a model, exercises every route
├── e2e-test.ts            Full-stack server + client check
└── verify-concurrent-ws.ts  Opens N concurrent WS streams across different models

binaries/win32/           Pre-built native executables (populated by setup-binaries.ts)
├── whisper-server/        whisper.cpp whisper-server.exe + ggml DLLs
├── sherpa-onnx/           sherpa-onnx-offline-websocket-server.exe + onnxruntime
├── parakeet-realtime/     parakeet-realtime-server.exe (from standalone repo release)
├── ffmpeg/                ffmpeg.exe + ffmpeg DLLs
└── cuda-runtime/          Shared CUDA 12.x + cuDNN 9.x DLLs (prepended to PATH per-backend)
assets/                   Small bundled assets (Silero VAD model, test WAVs)

How a request actually flows

One-shot file upload (POST /v1/audio/transcriptions):

Client → Elysia handler → validate form → ffmpeg (if not WAV) → 16 kHz WAV
      → p-queue for the target model (concurrency 1)
      → Adapter.transcribe(wav)
        └── whisper.cpp:   spawn whisper-server, read stdout text
        └── sherpa-onnx:   short-lived WS to local daemon, send PCM, read JSON
        └── parakeet-rs:   POST multipart to daemon's /v1/audio/transcriptions
      → { text } response

The queue enforces backpressure with queue.size + queue.pending >= 5 (429 queue_full with Retry-After: 5) and a 60 s per-task timeout (504 queue_timeout). See src/bun/gateway/server.ts.

Live streaming, VAD-segmented (WS …/stream?model=moonshine-base etc.):

Browser mic → AudioContext(16 kHz) → AudioWorklet → Float32 chunks
           → client.streamTranscribe()
           → WS binary frames (int16 LE PCM)
           → server: VadStream per connection
                     → Silero VAD detects silence boundaries
                     → Adapter.transcribeSamples(float32, 16000)
                     → ws.send({ type: "transcript.final", text })

A connection holds one VadStream instance. On {type:"end"} or close, the VAD is flushed one last time so in-flight audio still gets a final transcript before the socket goes away.

Live streaming, native (WS …/stream?model=parakeet-realtime-eou):

Browser mic → AudioContext(16 kHz) → AudioWorklet → Float32 chunks
           → client.streamTranscribe()
           → WS binary frames (int16 LE PCM)
           → server: ParakeetRsAdapter.streamingTranscribe(pcm, signal)
                     → opens WS to parakeet-realtime-server daemon
                     → forwards PCM chunks as-is
                     → relays partial + final frames (daemon's own EOU)
                     → ws.send({ type: "transcript.partial" | "transcript.final", text })

No server-side VAD. The daemon emits partials continuously during speech and a final when it detects end-of-utterance on-device.

Key architectural choices

  • One API in one place. Elysia (.ws() and REST on the same chain) replaced a hand-rolled Bun.serve router + separate websocket config. Routing, CORS, and upgrade validation live in one file.
  • Registry-driven model config. Everything about a model — download source, expected files, VAD thresholds, sherpa CLI flags, VRAM estimate, capabilities — is data in registry.ts.
  • Capability-driven streaming dispatch. The gateway picks the VAD path or the native-streaming path from the model's capabilities array. No per-model switch statements in the handler.
  • Shared CUDA runtime. CUDA 12.x + cuDNN 9.x DLLs live in one place (binaries/win32/cuda-runtime/) rather than being duplicated next to each backend. Each adapter prepends it to PATH when spawning its subprocess, so Windows' DLL search picks them up transparently.
  • p-queue per model, concurrency 1. A GPU daemon is never asked to process two REST requests at once. Concurrency is bounded at the gateway.
  • VAD lives server-side (for VAD models). Clients send raw PCM; the server decides when a speech segment has ended. The client library is stateless about audio semantics — it just forwards bytes. Native streaming models bypass this entirely.
  • Zero-dep client library. The TypeScript client only uses fetch + WebSocket globals and native ReadableStream async iteration, so one package works in browsers, Node 18+, Bun, and Deno without a bundler dance.

Running tests

bun test                     # all unit + integration (some skip without cached models)
bun test tests/gateway/      # just the gateway (fast, pure in-process)
bun run typecheck            # tsc --noEmit

End-to-end checks against real downloaded models:

bun run scripts/e2e-test.ts              # spins up the full server + client
bun run scripts/api-smoke-test.ts        # exercises every REST route + WS stream
bun run scripts/verify-concurrent-ws.ts  # N concurrent WS streams, different models

The e2e scripts reuse .e2e-cache/ so they share downloaded weights across runs.

Contributing / hacking on this

  • The main entry point for the backend is src/bun/index.ts. Read it top-to-bottom and you'll see how config, lifecycle, adapters, and the gateway are wired.
  • Everything the gateway exposes to the network lives in src/bun/gateway/server.ts — one file.
  • The desktop UI is a plain React app living under src/views/main-ui/ and talks to the backend over a typed RPC channel defined in src/bun/rpc.ts.
  • The parakeet-realtime daemon is developed in a separate repo (pauldaywork/parakeet-realtime-server). setup-binaries.ts pulls its release zip; this repo just wires the adapter around it.
  • docs/superpowers/specs/ and docs/superpowers/plans/ hold the design documents the project was built against — useful historical context if you want to understand why something is shaped the way it is.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors