OpenAI-compatible LLM server for Intel Arc GPUs.
Runs local inference via OpenVINO. Speaks the OpenAI API — drop it behind any
client that accepts a base_url. No NVIDIA required.
| GPU | VRAM | Status | Notes |
|---|---|---|---|
| Arc B60 | 24 GB | ✅ Production | EnvyStorm reference machine |
| Arc B50 | 16 GB | 🔜 Testing | TinyB — install in progress |
| Arc B65 | TBD | 🔜 Planned | Next after B50 confirmed |
| Arc B70 | TBD | 🔜 Planned | |
| Other Arc | any | ⚙️ Auto-tuned | VRAM detected at runtime |
OS: Linux Mint 22.x / Ubuntu 24.04 (Noble). Kernel:
linux-oem-24.04required for Battlemage (B-series) GPUs. System RAM: 16 GB minimum. Disk: 50 GB+ for a useful model set.
Fully automated. CC asks 3 questions, handles everything including the mandatory kernel reboot. You watch.
Step 1 — Install Claude Code if you haven't:
npm install -g @anthropic-ai/claude-codeStep 2 — Clone the repo and start CC in it:
git clone https://github.com/Jermalk/stormvino.git /opt/ov_server
cd /opt/ov_server
claudeStep 3 — In the CC chat, type exactly:
Run the Stormvino installation runbook. @CC_INSTALL.md
The @CC_INSTALL.md mention loads the runbook directly — no file dragging needed.
CC reads it and takes over. Answer the 3 questions it asks, then watch.
→ See CC_INSTALL.md for what CC does at each phase.
One command installs on any number of Arc machines simultaneously. Detects GPU VRAM at runtime and tunes config automatically. Fully headless — handles reboots without human intervention.
git clone https://github.com/Jermalk/stormvino.git
cd stormvino
# edit vars/main.yml (3 lines) — then:
ansible-playbook -i hosts.yml stormvino.yml→ See ANSIBLE.md for the full plan and current implementation status.
Step-by-step guide with a verification test between every phase. Covers kernel, drivers, Python env, PostgreSQL, models, and systemd services.
git clone https://github.com/Jermalk/stormvino.git
cd stormvino
./install.sh # detects hardware, routes to the right path→ See INSTALL.md.
| Endpoint | Description |
|---|---|
POST /v1/chat/completions |
OpenAI-compatible chat, streaming supported |
POST /v1/embeddings |
Sentence embeddings (multilingual-e5-large) |
GET /v1/models |
List discovered models |
POST /v1/images/generations |
Image generation (SDXL, optional) |
POST /v1/audio/transcriptions |
Speech-to-text (Whisper, optional) |
POST /v1/audio/speech |
Text-to-speech (Kokoro / Piper, optional) |
GET /health |
Server health + loaded models + VRAM stats |
GET /monitor |
Web dashboard — live VRAM, throughput, request log |
Default port: 11435. Accessible over LAN.
| Model | VRAM | Role |
|---|---|---|
qwen3-14b-int4-ov |
9.1 GB | Default — reasoning, coding, chat |
qwen3-8b-int4-ov |
4.6 GB | Agent turns, fast responses |
multilingual-e5-large-int8 |
563 MB | Embeddings + task routing |
whisper-large-v3-int8-ov |
~2 GB | Speech-to-text |
qwen2.5-vl-7b-int4-ov |
~5 GB | Vision — image understanding |
→ See MODELS.md for conversion instructions and VRAM budget tables.
curl -s http://localhost:11435/health | python3 -m json.toolcurl -s http://localhost:11435/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"qwen3-8b-int4-ov","messages":[{"role":"user","content":"Hello"}]}'Inference (server runtime)
| Library | Version |
|---|---|
| openvino | 2026.1.0 |
| openvino-genai | 2026.1.0.0 |
| openvino-tokenizers | 2026.1.0.0 |
| optimum-intel | 1.27.0 |
| optimum | 2.1.0 |
| transformers | 4.57.6 |
| tokenizers | 0.22.2 |
Model conversion (offline, via optimum-cli)
| Library | Version |
|---|---|
| nncf | 3.1.0 |
| onnx | 1.21.0 |
| onnxruntime | 1.25.0 |
| safetensors | 0.7.0 |
| huggingface_hub | 0.36.2 |
Runtime settings live in config.json. Key settings auto-patched by the installers
based on detected GPU VRAM:
| Key | Description |
|---|---|
device |
OpenVINO device — auto-detected (e.g. GPU.1) |
kv_cache_size_gb |
KV cache per model — tuned to VRAM tier |
max_loaded_models |
Models held in VRAM simultaneously |
default_model |
Model used when client doesn't specify |
embedding_model |
Embedding model directory name |
postgres_dsn |
Observability database connection string |
Full reference: INSTALL.md § Phase 7.
| Layer | Component |
|---|---|
| HTTP | FastAPI + Uvicorn, single worker |
| LLM inference | openvino_genai.LLMPipeline, executor-offloaded |
| VLM inference | openvino_genai.VLMPipeline |
| Embeddings | OVModelForFeatureExtraction (optimum-intel) |
| Task routing | Embedding similarity + signal detection |
| STT | openvino_genai.WhisperPipeline |
| TTS | Kokoro-ONNX (EN) + Piper (PL) |
| Observability | PostgreSQL 16 + pgvector |
| Monitor UI | Svelte + uPlot |
Tested Stormvino on a GPU not in the compatibility table? Open a hardware report issue — GPU model, VRAM, kernel version, tokens/sec. Builds the matrix for everyone.
Stormvino grew out of Shangri-Lab — a personal lab built by an IT architect from Silesia who had no Python background, a pair of Intel Arc GPUs, and a firm belief that local inference shouldn't require Nvidia hardware or magic frameworks.
The philosophy is unchanged: build the simplest thing that gives full visibility first, tune quality only after you can observe it.
Built with Claude Code.