Built to take advantage of Intel Core Ultra capabilities through OpenVINO:
- voice activity detection runs on CPU
- speech-to-text transcription can run through OpenVINO Whisper on GPU
- video detection can run on the NPU
- the LLM can run on GPU, NPU, or through an OpenAI-compatible API
The core of the project is robot.py. From an interactive console it can:
- load local LLMs from Hugging Face using OpenVINO GenAI
- use an external OpenAI-compatible model
- transcribe microphone input with classic Whisper or OpenVINO Whisper
- preload Whisper on startup to reduce first-use delay
- speak responses through multiple TTS backends
- run continuous auto-listen with Silero VAD
- react to camera presence events
- benchmark models and record metrics
- expose an OpenAI-compatible endpoint at
http://0.0.0.0:1311/v1/chat/completions
robot.py works as an interactive REPL:
- It loads configuration from
robot_config.json. - It loads model catalogs from
~/ov_models. - It preloads the configured Whisper backend.
- It tries to restore the previously used LLM.
- It waits for commands (
/models,/panel,/listen,/config, etc.) or regular prompts. - When it receives text:
- it repeats it through TTS if
repeat=true - otherwise it sends it to the active LLM and plays the response if audio is enabled
- it repeats it through TTS if
It also supports:
- manual
/listenmode withSPACEstart/stop andESCexit - continuous
/auto_listen onmode with Silero VAD - an optional
/panel opencvor/panel qtwindow for rendering the robot avatar, camera preview, toggles, and VAD bars - headless camera/vision processing even when the panel is closed
The optional control panel shows a robot avatar, camera area, runtime switches, and audio/VAD bars.
With a face detection model enabled, the assistant can:
- detect when people appear in the camera
- greet people when they arrive
- say contextual lines when the number of visible people changes
- say lines when it is left alone
- interrupt its own audio and say
me cayoif everyone disappears from the camera while it is speaking
The camera worker is independent from the panel. That means /camera on, /vision on, and /vision_events on can keep running without rendering the panel window.
- Local via
openvino_genai.LLMPipeline - External via an OpenAI-compatible API
openai-whisperopenvino_genai.WhisperPipeline- Silero VAD for auto-listen segmentation
- Windows SAPI on Windows
- OpenVINO Text2SpeechPipeline
- Kokoro ONNX
- BabelVox
- eSpeak NG
- Hume TADA voice-conditioned TTS with reference audio
- Reference voice capture from the microphone for Hume TADA
- Local LLM chat through OpenVINO GenAI on CPU, GPU, NPU, or AUTO
- External LLM chat through an OpenAI-compatible endpoint
- Classic Whisper STT and OpenVINO Whisper STT
- Whisper preload on startup
- Continuous auto-listen with Silero VAD
- Streaming TTS while the LLM is still generating
- Multiple TTS backends: Windows SAPI, OpenVINO, Kokoro, BabelVox, eSpeak NG, Hume TADA
- Experimental Hume TADA backend that can synthesize new text conditioned on a reference voice clip plus transcript
- Optional control panel with robot avatar, camera preview, switches, and VAD bars
- Camera presence detection and reactive voice behavior
- Face detection through OpenVINO vision models
- Vision event logging and throttled console debugging
- Headless camera/vision processing without opening the panel
- Benchmarking and per-device compatibility tracking
- OpenAI-compatible local server on port
1311 - OS-specific install scripts and requirements for Windows and Linux
robot.py: main applicationrobot_config.json: persisted configurationAGENTS.md: repository context for coding agentsvision_models.json: vision model catalogov_models/models.json: LLM model catalogrequirements-tada.txt: optional Hume TADA stack
TADA note:
requirements-tada.txtis optional on purpose- it installs the extra stack needed for the Hume TADA backend
Example of the application after loading the Phi-4 model on the Intel NPU.
Example showing NPU usage while the assistant is running a model.
The model selection list used to choose which LLM to load.
Example of a chat session in the interactive console.
The repository ships OS-specific dependency files:
Also:
espeakngrequires theespeak-ngexecutable- Linux also needs the system libraries required by
sounddeviceand PortAudio - gated or private Hugging Face models use
~/ov_models/hf_auth.json
Expected Hugging Face token format:
{"hf_token":"hf_xxx"}Create and activate a compatible Python environment, install the dependencies for your OS, and then run the app.
pip install -r .\requirements-windows.txt
python .\robot.pyInstall espeak-ng and the PortAudio development/runtime packages with your package manager first, then:
pip install -r ./requirements-linux.txt
python ./robot.pyFor a first session, the recommended flow is:
- Run
/models - Choose a local LLM or configure
/llm_backend external - Adjust audio and STT settings with
/config - Optionally run
/panel - Optionally enable
/camera on,/vision on, and/vision_events on - Try
/listen,/auto_listen on, or type prompts directly
/help
/models
/add_model
/delete
/config
/voices
/llm_backend local|external
/tts_backend windows|openvino|kokoro|babelvox|espeakng|tada
/audio <on|off>
/audio_inputs
/audio_input_select
/audio_monitor <on|off>
/panel
/camera <on|off>
/vision <on|off>
/vision_events <on|off>
/vision_models
/vision_select
/vision_model
/vision_labels
/vision_device <name>
/log <on|off|seconds>
/repeat <true|false>
/listen
/auto_listen <on|off>
/whisper_models
/whisper_add
/whisper_select
/openvino_tts_models
/openvino_tts_add
/openvino_tts_select
/kokoro_models
/kokoro_select
/babelvox_models
/babelvox_select
/stats
/all_models
/clear_stats
/benchmark
/start_server
/exit
The main configuration lives in robot_config.json. Among other things, it stores:
- LLM backend (
localorexternal) - current model and device (
CPU,GPU,NPU,AUTO) - TTS backend
- Whisper backend and model settings
- audio on/off
- TTS streaming
- system prompt
max_new_tokens- camera, panel, and vision options
- auto-listen and Silero VAD settings
Catalogs, metrics, and compatibility data live under ~/ov_models.
The automated test suite lives under tests and uses pytest.
Run:
pytest




