Skip to content

viam-modules/filtered-audio

Repository files navigation

Module filtered-audio

filtered-audio module provides a model to filter audio input from a source microphone based on wake words.

Supported Platforms

  • Darwin ARM64
  • Linux x64
  • Linux ARM64

Models

This module provides the following model(s):

  • [viam:filtered-audio:wake-word-filter]

Model viam:filtered-audio:wake-word-filter

Configuration

The following attribute template can be used to configure this model:

For vosk:

{
  "source_microphone" : <AUDIO_IN NAME>,
  "wake_words": ["<word>"]
}

For openwakeword:

{
  "source_microphone": <AUDIO_IN NAME>,
  "detection_engine": "openwakeword",
  "oww_model_path": "<path or URL to .onnx model>"
}

Configuration Attributes

The following attributes are available for the viam:filtered-audio:wake-word-filter model:

Name Type Inclusion Description
source_microphone string Required Name of a Viam AudioIn component to recieve and filter audio from.
detection_engine string Optional Wake word detection engine to use. Options: vosk, openwakeword. Default: vosk.
vad_aggressiveness int Optional Sensitivity of the webRTC VAD (voice activity detection). A higher number is more restrictive in reporting speech, and missed detection rates go up. A lower number is less restrictive but may report background noise as speech. Range: 0-3. Default: 3.
silence_duration_ms int Optional Milliseconds of continuous silence needed before speech is considered finished. Default: 900
min_speech_ms int Optional The minimum length (in milliseconds) a speech segment must be before it is treated as valid speech. Shorter sounds are ignored. Default: 300

Vosk Attributes

Name Type Inclusion Description
wake_words string array Required Wake words to filter speech. All speech segments said after the wake words will be returned from get_audio.
vosk_model string Optional Vosk model to use for speech recognition. Accepts a model name, directory path, or zip file path. Default: vosk-model-small-en-us-0.15. See list of available models. For models larger than 1GB, download manually and provide the file path.
use_grammar bool Optional When true, Vosk uses grammar-constrained recognition limited to wake words for better accuracy with short wake words. When false, uses full transcription mode which has higher accuracy for longer wake phrases (3+ words). Default: true
vosk_grammar_confidence float Optional Minimum confidence threshold (0.0-1.0) for wake word recognition. Lower confidence matches will be rejected. Default: 0.7
fuzzy_threshold int Optional Enable fuzzy wake word matching. The threshold (0-5) is the maximum number of character edits (insertions, deletions, substitutions) allowed between the transcript and wake word. If not set, exact matching is used. Note use_grammar must be set to false to use fuzzy matching.

OpenWakeWord Attributes

These attributes apply when detection_engine is set to openwakeword.

Name Type Inclusion Description
oww_model_path string Required Path or URL to a custom .onnx wakeword model file. Local paths and HTTP/HTTPS URLs are supported. URL models are downloaded and cached in VIAM_MODULE_DATA.
oww_threshold float Optional Detection confidence threshold (0.0-1.0). A higher value requires more confidence before triggering, reducing false positives. Default: 0.5

Source Microphone Requirements

The source microphone must provide audio in the following format:

Requirement Value Description
Codec PCM16 16-bit PCM audio format
Sample Rate 16000 Hz Required for Vosk model
Channels 1 (Mono) Stereo audio is not supported

Example configuration for source microphone:

{
  "name": "my-microphone",
  "type": "audio_in",
  "model": "...",
  "attributes": {
    "sample_rate": 16000,
    "channels": 1
  }
}

Recommended Source Microphone: Use the viam:system-audio module, which supports resampling and can output 16 kHz mono PCM16 audio from any system microphone.

Training a Custom OpenWakeWord Model

To use detection_engine: openwakeword you need a custom .onnx model trained on your wake word. Use the openWakeWord automatic training notebook to generate one.

Once trained, set oww_model_path to the local path or a URL pointing to the .onnx file.

Fuzzy Wake Word Matching

The wake word filter supports fuzzy matching using Levenshtein distance (edit distance) via the rapidfuzz library. This improves accuracy when speech recognition produces slight variations (e.g., "hey robot" transcribed as "the robot").

Enabling Fuzzy Matching

To enable fuzzy wake word matching, add fuzzy_threshold to your configuration:

{
  "source_microphone": "mic",
  "wake_words": ["hey robot"],
  "fuzzy_threshold": 2
}

How It Works

Fuzzy matching compares the wake phrase against the first few words of the transcript. It measures how many character edits (insertions, deletions, substitutions) are needed to transform one into the other. This handles common speech-to-text errors like "the robot" being transcribed instead of "hey robot" (2 character changes: t→h, h→e→y):

Transcribed Wake Word Distance Match (threshold=2)
"the robot say something" "hey robot" 2
"hey Robert what time" "hey robot" 2
"a robot turn on lights" "hey robot" 3
"please hey robot do it" "hey robot" - ✗ (not at start)

Threshold Guidelines

Threshold Use Case
1 Very strict - for short wake words or quiet environments
2-3 Recommended for most wake words
4-5 Lenient - for noisy environments (may increase false positives)

get_audio()

The wake word filter implements the AudioIn get_audio() method:

Parameters

  • codec: Must be "pcm16". Other codecs are not supported.
  • duration_seconds: Use 0 for continuous streaming
  • previous_timestamp_ns: Use 0 to start from current time.

Stream Behavior

The filter returns a continuous stream that:

  1. Monitors continuously for wake words using VAD (Voice Activity Detection) and Vosk speech recognition
  2. Only yields chunks when a wake word is detected followed by speech
  3. Uses empty chunks to signal speech segment boundaries

Stream Protocol:

  • Normal chunks: Contain audio data (16kHz mono PCM16) for detected speech segments
  • Empty chunks: Signal the end of a speech segment (audio_data has length 0)

After yielding a speech segment and empty chunk, the filter resumes listening for the next wake word automatically.

Example Usage

Basic accumulation and processing:

# Get continuous stream
audio_stream = await filter.get_audio("pcm16", 0, 0)

segment = bytearray()

async for chunk in audio_stream:
    audio_data = chunk.audio.audio_data

    if len(audio_data) == 0:
        # Empty chunk = segment ended
        if segment:
            process_speech_segment(bytes(segment))
            segment.clear()
    else:
        # Normal chunk - accumulate audio
        segment.extend(audio_data)

Clients should continue consuming chunks even while processing previous segments to avoid stream disconnection.

See examples/ directory for complete usage examples.

Do command

The wake word filter supports do_command() for pausing and resuming detection. This is useful for voice assistants that need to prevent the filter from detecting its own TTS (text-to-speech) output.

Supported Commands

Command Description
pause_detection Pauses wake word detection. Audio is still consumed but not processed.
resume_detection Resumes wake word detection.

Example Usage

# Pause detection before playing TTS audio
await filter.do_command({"pause_detection": None})

await audio_output.play(audio_data)

# Resume detection after TTS finishes
await filter.do_command({"resume_detection": None})

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors