Skip to content

Jermalk/stormvino

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

302 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mentioned in Awesome OpenVINO

Stormvino

OpenAI-compatible LLM server for Intel Arc GPUs. Runs local inference via OpenVINO. Speaks the OpenAI API — drop it behind any client that accepts a base_url. No NVIDIA required.


Hardware compatibility

GPU VRAM Status Notes
Arc B60 24 GB ✅ Production EnvyStorm reference machine
Arc B50 16 GB 🔜 Testing TinyB — install in progress
Arc B65 TBD 🔜 Planned Next after B50 confirmed
Arc B70 TBD 🔜 Planned
Other Arc any ⚙️ Auto-tuned VRAM detected at runtime

OS: Linux Mint 22.x / Ubuntu 24.04 (Noble). Kernel: linux-oem-24.04 required for Battlemage (B-series) GPUs. System RAM: 16 GB minimum. Disk: 50 GB+ for a useful model set.


Install paths — pick one

🤖 Claude Code (recommended for single machine)

Fully automated. CC asks 3 questions, handles everything including the mandatory kernel reboot. You watch.

Step 1 — Install Claude Code if you haven't:

npm install -g @anthropic-ai/claude-code

Step 2 — Clone the repo and start CC in it:

git clone https://github.com/Jermalk/stormvino.git /opt/ov_server
cd /opt/ov_server
claude

Step 3 — In the CC chat, type exactly:

Run the Stormvino installation runbook. @CC_INSTALL.md

The @CC_INSTALL.md mention loads the runbook directly — no file dragging needed. CC reads it and takes over. Answer the 3 questions it asks, then watch.

→ See CC_INSTALL.md for what CC does at each phase.

⚙️ Ansible (recommended for multiple machines / repeatable deploys)

One command installs on any number of Arc machines simultaneously. Detects GPU VRAM at runtime and tunes config automatically. Fully headless — handles reboots without human intervention.

git clone https://github.com/Jermalk/stormvino.git
cd stormvino
# edit vars/main.yml (3 lines) — then:
ansible-playbook -i hosts.yml stormvino.yml

→ See ANSIBLE.md for the full plan and current implementation status.

📖 Manual (full control, learn every step)

Step-by-step guide with a verification test between every phase. Covers kernel, drivers, Python env, PostgreSQL, models, and systemd services.

git clone https://github.com/Jermalk/stormvino.git
cd stormvino
./install.sh    # detects hardware, routes to the right path

→ See INSTALL.md.


What you get

Endpoint Description
POST /v1/chat/completions OpenAI-compatible chat, streaming supported
POST /v1/embeddings Sentence embeddings (multilingual-e5-large)
GET /v1/models List discovered models
POST /v1/images/generations Image generation (SDXL, optional)
POST /v1/audio/transcriptions Speech-to-text (Whisper, optional)
POST /v1/audio/speech Text-to-speech (Kokoro / Piper, optional)
GET /health Server health + loaded models + VRAM stats
GET /monitor Web dashboard — live VRAM, throughput, request log

Default port: 11435. Accessible over LAN.

Tested models (B60 / 24 GB VRAM)

Model VRAM Role
qwen3-14b-int4-ov 9.1 GB Default — reasoning, coding, chat
qwen3-8b-int4-ov 4.6 GB Agent turns, fast responses
multilingual-e5-large-int8 563 MB Embeddings + task routing
whisper-large-v3-int8-ov ~2 GB Speech-to-text
qwen2.5-vl-7b-int4-ov ~5 GB Vision — image understanding

→ See MODELS.md for conversion instructions and VRAM budget tables.


Quick health check

curl -s http://localhost:11435/health | python3 -m json.tool
curl -s http://localhost:11435/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"qwen3-8b-int4-ov","messages":[{"role":"user","content":"Hello"}]}'

Libraries stack

Inference (server runtime)

Library Version
openvino 2026.1.0
openvino-genai 2026.1.0.0
openvino-tokenizers 2026.1.0.0
optimum-intel 1.27.0
optimum 2.1.0
transformers 4.57.6
tokenizers 0.22.2

Model conversion (offline, via optimum-cli)

Library Version
nncf 3.1.0
onnx 1.21.0
onnxruntime 1.25.0
safetensors 0.7.0
huggingface_hub 0.36.2

Configuration

Runtime settings live in config.json. Key settings auto-patched by the installers based on detected GPU VRAM:

Key Description
device OpenVINO device — auto-detected (e.g. GPU.1)
kv_cache_size_gb KV cache per model — tuned to VRAM tier
max_loaded_models Models held in VRAM simultaneously
default_model Model used when client doesn't specify
embedding_model Embedding model directory name
postgres_dsn Observability database connection string

Full reference: INSTALL.md § Phase 7.


Architecture

Layer Component
HTTP FastAPI + Uvicorn, single worker
LLM inference openvino_genai.LLMPipeline, executor-offloaded
VLM inference openvino_genai.VLMPipeline
Embeddings OVModelForFeatureExtraction (optimum-intel)
Task routing Embedding similarity + signal detection
STT openvino_genai.WhisperPipeline
TTS Kokoro-ONNX (EN) + Piper (PL)
Observability PostgreSQL 16 + pgvector
Monitor UI Svelte + uPlot

Hardware reports welcome

Tested Stormvino on a GPU not in the compatibility table? Open a hardware report issue — GPU model, VRAM, kernel version, tokens/sec. Builds the matrix for everyone.


Origin

Stormvino grew out of Shangri-Lab — a personal lab built by an IT architect from Silesia who had no Python background, a pair of Intel Arc GPUs, and a firm belief that local inference shouldn't require Nvidia hardware or magic frameworks.

The philosophy is unchanged: build the simplest thing that gives full visibility first, tune quality only after you can observe it.

Built with Claude Code.

About

OpenAI API server for OpenVino - when OVMS is too big

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors