illama

Ollama-like LLM experience on Intel Arc GPUs using OpenVINO

Features • Quick Start • Installation • Usage • Docker • Docs

illama (intel lamma) provides an Ollama-like experience for running LLMs locally on Intel Arc GPUs. It uses OpenVINO for optimized inference and provides both a CLI and OpenAI-compatible API.

Features

🚀 Ollama-like CLI - Familiar commands: illama pull, illama run, illama ps
🔌 OpenAI-compatible API - Works with OpenWebUI and other OpenAI clients
💾 Single-model loading - Optimized for consumer GPUs with limited VRAM
⏱️ Idle auto-eviction - Frees GPU memory when not in use
📦 INT4/INT8/FP16 quantization - Flexible precision options
🐳 Docker support - Ready for Portainer stack deployment

Hardware Requirements

Intel Arc GPU (B50, A770, A750, A380)
Ubuntu 24.04+ (or compatible Linux)
Intel GPU drivers (Level Zero runtime)
8GB+ GPU VRAM recommended

Quick Start

# Install
pip install -e .

# Check system
illama doctor

# Pull a model (converts to OpenVINO INT4)
illama pull Qwen/Qwen3-8B --weight-format int4

# Run interactively
illama run Qwen3-8B
>>> Hello, how are you?

# Or start the API server
illama serve

Installation

Prerequisites

Intel GPU drivers:

sudo apt update
sudo apt install -y intel-gpu-tools level-zero

Verify GPU:
```
sudo intel_gpu_top
```

Install illama

# Clone the repository
git clone https://github.com/mkhomutskyi/illama.git
cd illama

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install
pip install -e .

# For development
pip install -e ".[dev]"

System-wide Installation (Optional)

Option A: Symlink (recommended for development)

# After pip install, create symlink to the venv's illama
sudo ln -s $(pwd)/venv/bin/illama /usr/local/bin/illama

Option B: Standalone binary with PyInstaller

# Install PyInstaller
pip install pyinstaller

# Build standalone binary
pyinstaller --onefile --name illama illama_cli/__main__.py

# Install to system
sudo mv dist/illama /usr/local/bin/
sudo chmod +x /usr/local/bin/illama

# Verify
illama --version

Option C: pip install globally (not recommended - may conflict with system packages)

sudo pip install .

Configure HuggingFace Token

Some models require authentication:

export HF_TOKEN="your-huggingface-token"
# or
huggingface-cli login

Usage

CLI Commands

Command	Description
`illama pull <model>`	Download, convert, and register a model
`illama rm <model>`	Remove model from registry
`illama list`	List registered models
`illama ps`	Show loaded model status
`illama run <model> [prompt]`	Chat with a model
`illama run <model> -v`	Chat with performance metrics
`illama serve`	Start the API server
`illama doctor`	System diagnostics

Pull a Model

# From HuggingFace (auto-converts to OpenVINO)
illama pull microsoft/Phi-4-mini-reasoning --weight-format int4

# Different quantization
illama pull Qwen/Qwen3-8B --weight-format int8

Chat

# Interactive mode
illama run Phi-4-mini-reasoning
>>> What is the capital of France?

# Single prompt
illama run Qwen3-8B "Explain quantum computing"

# With performance metrics (non-streaming)
illama run Qwen3-8B "Hello" -v
# Output: Hello! How can I help you?
# eval: 42 tokens | prompt: 10 | 28.50 t/s | gen: 1.47s

API Server

# Start server
illama serve --port 11434

# Test
curl http://localhost:11434/v1/models | jq

# Chat completion
curl -X POST http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-8B",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Docker

Quick Deploy with Docker Compose

cd docker
docker compose up -d

Portainer Stack

Go to Portainer → Stacks → Add Stack
Paste contents of docker/docker-compose.yml
Set environment variable HF_TOKEN
Deploy

OpenWebUI Integration

After deploying, access OpenWebUI at http://your-server:3000.

Configuration:

Settings → Connections
OpenAI API Base URL: http://illama-manager:11434/v1

Configuration

Environment variables:

Variable	Default	Description
`ILLAMA_DEVICE`	`GPU`	Device: GPU, CPU, AUTO
`ILLAMA_ONE_MODEL`	`1`	Single-model policy
`ILLAMA_IDLE_TTL_SEC`	`600`	Idle timeout (seconds)
`ILLAMA_PORT`	`11434`	API server port
`HF_TOKEN`	-	HuggingFace token

Documentation

Architecture - System design and components
Model Compatibility - Supported models and mappings
Quantization - INT4/INT8/FP16 guide
Troubleshooting - Common issues and solutions

Quantization

Recommended: Use INT4 for best performance/memory on Intel Arc.

Format	VRAM (7B model)	Speed
FP16	~14 GB	Baseline
INT8	~7 GB	Faster
INT4	~3.5 GB	Fastest

Limitations

GGUF not supported - Use llama.cpp SYCL for GGUF models
VLMs experimental - Vision models need extra setup
Gated models - Require HuggingFace acceptance (Llama, Gemma, etc.)

Contributing

See CONTRIBUTING.md for guidelines.

License

Apache 2.0 - see LICENSE

Acknowledgments

OpenVINO - Intel's inference toolkit
Ollama - Inspiration for CLI/UX
OpenWebUI - Web interface
Optimum Intel - Model conversion

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docker		docker
docs		docs
examples		examples
illama_cli		illama_cli
illama_manager		illama_manager
scripts		scripts
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

illama

Features

Hardware Requirements

Quick Start

Installation

Prerequisites

Install illama

System-wide Installation (Optional)

Configure HuggingFace Token

Usage

CLI Commands

Pull a Model

Chat

API Server

Docker

Quick Deploy with Docker Compose

Portainer Stack

OpenWebUI Integration

Configuration

Documentation

Quantization

Limitations

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

illama

Features

Hardware Requirements

Quick Start

Installation

Prerequisites

Install illama

System-wide Installation (Optional)

Configure HuggingFace Token

Usage

CLI Commands

Pull a Model

Chat

API Server

Docker

Quick Deploy with Docker Compose

Portainer Stack

OpenWebUI Integration

Configuration

Documentation

Quantization

Limitations

Contributing

License

Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages