Ollama-like LLM experience on Intel Arc GPUs using OpenVINO
Features β’ Quick Start β’ Installation β’ Usage β’ Docker β’ Docs
illama (intel lamma) provides an Ollama-like experience for running LLMs locally on Intel Arc GPUs. It uses OpenVINO for optimized inference and provides both a CLI and OpenAI-compatible API.
- π Ollama-like CLI - Familiar commands:
illama pull,illama run,illama ps - π OpenAI-compatible API - Works with OpenWebUI and other OpenAI clients
- πΎ Single-model loading - Optimized for consumer GPUs with limited VRAM
- β±οΈ Idle auto-eviction - Frees GPU memory when not in use
- π¦ INT4/INT8/FP16 quantization - Flexible precision options
- π³ Docker support - Ready for Portainer stack deployment
- Intel Arc GPU (B50, A770, A750, A380)
- Ubuntu 24.04+ (or compatible Linux)
- Intel GPU drivers (Level Zero runtime)
- 8GB+ GPU VRAM recommended
# Install
pip install -e .
# Check system
illama doctor
# Pull a model (converts to OpenVINO INT4)
illama pull Qwen/Qwen3-8B --weight-format int4
# Run interactively
illama run Qwen3-8B
>>> Hello, how are you?
# Or start the API server
illama serve-
Intel GPU drivers:
sudo apt update sudo apt install -y intel-gpu-tools level-zero
-
Verify GPU:
sudo intel_gpu_top
# Clone the repository
git clone https://github.com/mkhomutskyi/illama.git
cd illama
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install
pip install -e .
# For development
pip install -e ".[dev]"Option A: Symlink (recommended for development)
# After pip install, create symlink to the venv's illama
sudo ln -s $(pwd)/venv/bin/illama /usr/local/bin/illamaOption B: Standalone binary with PyInstaller
# Install PyInstaller
pip install pyinstaller
# Build standalone binary
pyinstaller --onefile --name illama illama_cli/__main__.py
# Install to system
sudo mv dist/illama /usr/local/bin/
sudo chmod +x /usr/local/bin/illama
# Verify
illama --versionOption C: pip install globally (not recommended - may conflict with system packages)
sudo pip install .Some models require authentication:
export HF_TOKEN="your-huggingface-token"
# or
huggingface-cli login| Command | Description |
|---|---|
illama pull <model> |
Download, convert, and register a model |
illama rm <model> |
Remove model from registry |
illama list |
List registered models |
illama ps |
Show loaded model status |
illama run <model> [prompt] |
Chat with a model |
illama run <model> -v |
Chat with performance metrics |
illama serve |
Start the API server |
illama doctor |
System diagnostics |
# From HuggingFace (auto-converts to OpenVINO)
illama pull microsoft/Phi-4-mini-reasoning --weight-format int4
# Different quantization
illama pull Qwen/Qwen3-8B --weight-format int8# Interactive mode
illama run Phi-4-mini-reasoning
>>> What is the capital of France?
# Single prompt
illama run Qwen3-8B "Explain quantum computing"
# With performance metrics (non-streaming)
illama run Qwen3-8B "Hello" -v
# Output: Hello! How can I help you?
# eval: 42 tokens | prompt: 10 | 28.50 t/s | gen: 1.47s# Start server
illama serve --port 11434
# Test
curl http://localhost:11434/v1/models | jq
# Chat completion
curl -X POST http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-8B",
"messages": [{"role": "user", "content": "Hello!"}]
}'cd docker
docker compose up -d- Go to Portainer β Stacks β Add Stack
- Paste contents of
docker/docker-compose.yml - Set environment variable
HF_TOKEN - Deploy
After deploying, access OpenWebUI at http://your-server:3000.
Configuration:
- Settings β Connections
- OpenAI API Base URL:
http://illama-manager:11434/v1
Environment variables:
| Variable | Default | Description |
|---|---|---|
ILLAMA_DEVICE |
GPU |
Device: GPU, CPU, AUTO |
ILLAMA_ONE_MODEL |
1 |
Single-model policy |
ILLAMA_IDLE_TTL_SEC |
600 |
Idle timeout (seconds) |
ILLAMA_PORT |
11434 |
API server port |
HF_TOKEN |
- | HuggingFace token |
- Architecture - System design and components
- Model Compatibility - Supported models and mappings
- Quantization - INT4/INT8/FP16 guide
- Troubleshooting - Common issues and solutions
Recommended: Use INT4 for best performance/memory on Intel Arc.
| Format | VRAM (7B model) | Speed |
|---|---|---|
| FP16 | ~14 GB | Baseline |
| INT8 | ~7 GB | Faster |
| INT4 | ~3.5 GB | Fastest |
- GGUF not supported - Use llama.cpp SYCL for GGUF models
- VLMs experimental - Vision models need extra setup
- Gated models - Require HuggingFace acceptance (Llama, Gemma, etc.)
See CONTRIBUTING.md for guidelines.
Apache 2.0 - see LICENSE
- OpenVINO - Intel's inference toolkit
- Ollama - Inspiration for CLI/UX
- OpenWebUI - Web interface
- Optimum Intel - Model conversion