Skip to content

hosamsh/okvad

Repository files navigation

okvad

okvad is a unified multi-engine Voice Activity Detection (VAD) library for the web.

Source repository: https://github.com/hosamsh/okvad

Supports:

  • Three VAD algorithms: DSP (pure JavaScript), WebRTC (WebAssembly), Silero (ONNX neural network)
  • Automatic recording: Built-in audio recording with pre-roll capture and automatic segment splitting
  • Output formatting: Pre-configured presets for popular APIs (OpenAI, Azure, etc.)
  • Multiple build variants: ESM for bundlers (Vite, Webpack), IIFE for plain HTML. Works with vanilla JavaScript, React, Vue, Next.js, and any bundler
  • TypeScript ready: type definitions included

Installation

With Bundlers (Vite, Webpack, Next.js, etc.)

npm install hosamsh/okvad
# or
npm install git+https://github.com/hosamsh/okvad.git

okvad is currently distributed via GitHub. The commands above pull the library straight from https://github.com/hosamsh/okvad.

import { Vad, USE_CASES, ALGOS } from 'okvad';

const vad = new Vad({
  algo: ALGOS.WEBRTC,
  useCase: USE_CASES.STREAMING,
  onUtteranceEnd: (segment) => {
    console.log('Captured speech:', segment.duration.toFixed(2), 'seconds');
  }
});

await vad.start();

Plain HTML (No Build Step)

<script src="path/to/okvad-webrtc.browser.min.js"></script>
<script>
  const vad = new OkVad.Vad({ algo: 'webrtc' });
  vad.start();
</script>

Build Variants

Choose the variant that matches your needs:

Variant Size Algorithms Best For
okvad/core 15 KB DSP only Minimal bundle size
okvad/webrtc 75 KB + WebRTC (WASM) General purpose
okvad/silero 25 KB + CDN + Silero (ONNX) Highest accuracy
okvad 90 KB All three Choice flexibility

Quick Start

Basic Usage

import { Vad, ALGOS } from 'okvad';

const vad = new Vad({
  algo: ALGOS.WEBRTC,
  onUtteranceStart: () => console.log('Utterance detected'),
  onUtteranceEnd: (segment) => {
    console.log('Complete utterance:', segment);
    // segment contains: audioData, blob, url, duration, sampleRate
  }
});

await vad.start();
// ...later
await vad.stop();

Use Case Presets

Choose a preset optimized for your use case:

import { Vad, USE_CASES, ALGOS } from 'okvad';

// Real-time streaming (fast response, tolerates breathing pauses)
const vad1 = new Vad({
  algo: ALGOS.DSP,
  useCase: USE_CASES.STREAMING
});

// Speech transcription (captures complete sentences)
const vad2 = new Vad({
  algo: ALGOS.WEBRTC,
  useCase: USE_CASES.TRANSCRIPTION
});

// Voice commands (instant response for short utterances)
const vad3 = new Vad({
  algo: ALGOS.SILERO,
  useCase: USE_CASES.COMMANDS
});

Auto-Formatted Output for APIs

import { Vad, PRESETS } from 'okvad';

// OpenAI Realtime API (PCM16, 24kHz, base64)
const vad = new Vad({
  preset: PRESETS.OPENAI_REALTIME,
  onFrame: (result) => {
    if (result.smoothedSpeech) {
      websocket.send(JSON.stringify({
        type: 'input_audio_buffer.append',
        audio: result.audio  // Already formatted
      }));
    }
  }
});

Available delivery presets: PRESETS.OPENAI_REALTIME, PRESETS.OPENAI_REALTIME_MULAW_8K, PRESETS.OPENAI_REALTIME_TRANSCRIBE, PRESETS.OPENAI_TRANSCRIBE_BATCH, PRESETS.CUSTOM

Automatic Recording

const vad = new Vad({
  algo: 'webrtc',
  maxSegmentMs: 180000,  // Auto-split at 3 minutes
  onUtteranceEnd: (segment) => {
    // Download recording
    const link = document.createElement('a');
    link.href = segment.url;
    link.download = 'speech.wav';
    link.click();
  }
});

await vad.start();

Recording is automatically enabled whenever you provide onUtteranceEnd or onUtteranceChunk callbacks (unless you explicitly set enableRecording: false).

Debug Logging

Enable verbose logs (prefixed with [VAD]) by either:

  • Passing debug: true to the Vad constructor for instance-scoped logging.
  • Calling setDebug(true) to enable logging globally (exported from okvad), and setDebug(false) to silence it again.

API Reference

Constructor Options

new Vad({
  algo: 'dsp' | 'webrtc' | 'silero',     // Default: 'dsp'
  useCase: 'streaming' | 'transcription' | 'commands',  // Default: 'streaming'
  preset: PRESETS.OPENAI_REALTIME,        // Delivery preset (see PRESETS constants)
  sampleRate: number,                     // Default: 16000
  maxSegmentMs: number,                   // Default: 180000 (3 minutes)
  
  // Timing (milliseconds, auto-configured per algorithm)
  speechHangoverMs: number,               // Silence duration before ending speech
  preSpeechPadMs: number,                 // Required speech before detection
  
  // Callbacks
  onUtteranceStart: () => void,
  onUtteranceEnd: (segment) => void,      // Auto-enables recording
  onUtteranceChunk: (chunk) => void,      // Auto-enables recording for streaming STT
  onFrame: (result) => void,
  onError: (error) => void,
  
  // Algorithm-specific
  webrtcMode: 0 | 1 | 2 | 3,             // WebRTC aggressiveness
  positiveSpeechThreshold: number,       // Silero sensitivity
  
  // Advanced
  output: { format, encoding, sampleRate },
  debug: boolean
});

Methods

await vad.start()          // Start listening for speech
await vad.stop()           // Stop listening and cleanup
await vad.destroy()        // Release all resources
await vad.getPreRoll()     // Get audio captured before speech start
vad.createWavSegment()     // Manual recording encoder
vad.updateSettings({})     // Update settings dynamically

Properties

vad.running               // Boolean: currently processing audio
vad.recording             // Boolean: currently recording (auto-recording only)
vad.vadAlgo              // String: active algorithm ('dsp'|'webrtc'|'silero')

Frame Result

{
  isSpeech: boolean,           // Raw VAD result
  smoothedSpeech: boolean,     // Smoothed result (after hangover)
  samples: Float32Array,       // Audio samples for frame
  audio: ArrayBuffer | string | Blob, // Processed audio in configured format
  probability: number,         // 0-1 (Silero only)
  energy: number               // Frame energy (DSP only)
}

Utterance Segment (Recording)

{
  audioData: Float32Array,    // Raw samples
  blob: Blob,                 // WAV file
  url: string,                // Object URL (download/playback)
  duration: number,           // Seconds
  sampleRate: number,         // Hz
  channels: number,           // Always 1
  samples: number             // Total count
}

TypeScript Support

Full type definitions included:

import type { VadOptions, UtteranceSegment, FrameResult } from 'okvad';

const options: VadOptions = {
  algo: 'webrtc',
  onUtteranceEnd: (segment: UtteranceSegment) => {}
};

Live Demo

Check the demos/ folder:

  • demos/record.html - Record speech with different algorithms
  • demos/stream.html - Real-time streaming example

Development

npm install
npm run build           # Build all variants
npm run build:types    # Generate TypeScript definitions
npm run lint           # Check code quality
npm run lint:fix       # Auto-fix issues

Browser Support

Requires:

  • Modern browser with Web Audio API (Chrome, Firefox, Safari, Edge)
  • HTTPS or localhost (microphone access required)

License

MIT License - See LICENSE

Third-Party Components

This library incorporates code from:

See THIRD-PARTY-NOTICES.md and PATENTS.md for details.

Resources

About

Unified Voice Activity Detection (VAD) APIs with DSP, WebRTC, and Silero engines

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors