Stop hand-writing --tensor-split, -ot, and KV-cache flags. llm-server
is an auto-tuned launcher for GGUF models: it measures your GPUs, RAM, and PCIe
topology, picks the right backend (llama.cpp or the faster ik_llama.cpp fork),
computes multi-GPU and MoE expert placement, and serves an OpenAI-compatible
API — one command from GGUF file to running endpoint.
llm-server model.gguf # local GGUF → served
llm-server unsloth/Qwen3.6-27B-GGUF --download # HF repo → hardware-matched quant → served
llm-server model.gguf --ai-tune # benchmark flag sets, cache the fastestBuilt for machines Ollama serves poorly: mismatched multi-GPU rigs (24GB + 12GB + 12GB), big MoE models split across VRAM and RAM, and anyone who wants llama.cpp's full flag surface with measured — not guessed — defaults.
Recommended self-contained setup on Linux/macOS:
curl -fsSL https://raw.githubusercontent.com/raketenkater/llm-server/main/setup.sh | bashNative Windows setup from PowerShell:
iwr -useb https://raw.githubusercontent.com/raketenkater/llm-server/main/install.ps1 | iexNative Windows NVIDIA CUDA setup:
iwr -useb https://raw.githubusercontent.com/raketenkater/llm-server/main/install.ps1 -OutFile install.ps1
powershell -ExecutionPolicy Bypass -File .\install.ps1 -Backend cudaSince v3.0.0, prebuilt release bundles
(Linux CPU/Vulkan, macOS arm64 Metal, Windows x86_64 CPU) are downloaded and verified against the
published SHA256SUMS — no compile needed. Linux CUDA/ik_llama.cpp installs build from source
for your exact GPU architecture. Windows NVIDIA CUDA installs use a native llama.cpp CUDA backend,
either from an optional llm-server-windows-x86_64-cuda.zip release asset or by building it locally.
From a clone:
git clone https://github.com/raketenkater/llm-server.git
cd llm-server
./setup.shThis creates a clean app home under ~/llm-server:
llm-server CLI launcher
llm-server-gui terminal GUI launcher
models/ GGUF models and downloaded vision projectors
.bin/ Go binary, tools, and bundled backend when available
.config/ local config loaded by the launcher
.cache/ AI Tune and model index cache
.logs/ setup and server logs
.src/ backend source/build fallback
Use it with:
~/llm-server/llm-server-gui
~/llm-server/llm-server <repo/name> --downloadClassic install to ~/.local/bin:
curl -fsSL https://raw.githubusercontent.com/raketenkater/llm-server/main/install.sh | bashInstaller controls:
LLM_INSTALL_MODE=release ./install.sh # require a prebuilt bundle
LLM_INSTALL_MODE=build ./install.sh # force source build
LLM_INSTALL_BACKEND=skip ./install.sh # install launcher/tools only
LLM_INSTALL_PY_DEPS=skip ./install.sh # skip downloader Python deps
LLM_INSTALL_PREFIX=/usr/local/bin ./install.shExisting Bash installs are treated as legacy. The installer preserves them as
llm-server-bash when replacing the primary llm-server command with Go v3.
Typical raw command for a heterogeneous 3-GPU box:
llama-server -m model.gguf --ctx-size 32768 --tensor-split 24,12,12 --split-mode layer --cache-type-k q4_0 --cache-type-v q4_0 --threads 8 --threads-batch 8 -b 8192 -ub 1024 --jinja --flash-attn on --port 8081With llm-server:
llm-server model.ggufvs raw llama.cpp. Upstream recently gained --fit (auto GPU layers,
tensor-split, context targeting ~85-90% VRAM, and some MoE tensor
overrides) — if that is all you need,
raw llama.cpp may be enough. llm-server goes further: it selects the backend
(ik_llama.cpp is meaningfully faster on CUDA for many models, but its flag
dialect differs), chooses KV-cache quantization and batch sizes from measured
probes, finds and validates vision projectors and speculative-decoding drafts,
benchmarks candidate flag sets against each other (--ai-tune), and restarts
or falls back to mainline on crashes. Unknown flags pass straight through, so
nothing upstream is ever out of reach.
vs Ollama. Ollama optimizes for one-command simplicity on common hardware and has a far larger ecosystem. llm-server targets the machines where Ollama's conservative heuristics leave performance behind: mismatched multi-GPU rigs, MoE models split across VRAM/RAM, ik_llama.cpp speed, and full llama.cpp flag access. If you have one GPU and want zero configuration, use Ollama.
vs llama-swap. llama-swap is a proxy that hot-swaps between configured
model commands; you still write each model's flags yourself. llm-server
computes those flags. They compose well: point llama-swap entries at
llm-server dry-run output, or use llm-server daemon (control API with
/reload) for single-model swapping.
| Capability | raw llama.cpp | llm-server |
|---|---|---|
| Multi-GPU placement | --fit (recent) |
automatic, PCIe/bandwidth-weighted |
| Heterogeneous GPU split | --fit (recent) |
automatic |
| MoE expert placement | --fit/manual -ot (recent) |
automatic, backend-aware, with --n-cpu-moe fallback |
| Backend selection (ik_llama vs mainline vs Vulkan) | manual | automatic, dialect-aware |
| KV-cache type / batch sizing | manual | probe-measured |
| AI Tune (measured flag search) | no | yes, cached per model+hardware |
| Hardware-matched quant download | no | yes (HF search + intelligence ranking) |
| Vision projector lookup | manual | automatic local/HF lookup |
| Speculative decoding | manual | validated, backend-aware modes |
| Crash recovery / backend fallback | no | yes |
- Go v3 command line with Linux, macOS, native Windows, and cross-build support.
- Backend-aware launch flags for ik_llama.cpp and llama.cpp.
- Multi-GPU placement using VRAM, free memory, and PCIe weighting.
- MoE-aware expert placement with
-ot/--n-cpu-moefallback paths. - AI Tune: benchmarks candidate flag sets and caches the fastest valid result.
- Community tune pool: first launches reuse configs measured by others on the
same GPU set, sanitized to safe performance flags (
LLM_COMMUNITY_TUNES=offto opt out). - Model downloader that searches Hugging Face GGUF repos and picks a hardware-aware quant.
- GUI recommended-download fast path ranked by intelligence signal after hardware fit.
- Vision projector lookup and validation for multimodal GGUF models.
- Speculative decoding modes for MTP, EAGLE-3, validated draft models, and explicit ngram modes.
- Startup update checks for interactive users, with rollback on failed updates.
- Terminal UI via
llm-server-guiorllm-server gui.
# Launch a local model
llm-server ~/models/model.gguf
# Download a GGUF from Hugging Face, then launch it
llm-server unsloth/Qwen3.6-27B-GGUF --download
# Run AI Tune once, then reuse the cached result
llm-server model.gguf --ai-tune
# Print the backend command without launching
llm-server model.gguf --dry-run
# Run the terminal UI
llm-server-gui# Backends
llm-server --backend ik_llama model.gguf
llm-server --backend llama model.gguf
llm-server --backend vulkan model.gguf
# Placement and memory
llm-server model.gguf --gpus 0,1
llm-server model.gguf --ram-budget 90G
llm-server model.gguf --ctx-size 32768
llm-server model.gguf --kv-quality mid
llm-server model.gguf --kv-placement gpu
# Vision
llm-server model.gguf --vision
llm-server model.gguf --mmproj /path/to/mmproj.gguf
# Tuning and cached configs
llm-server model.gguf --ai-tune
llm-server model.gguf --ai-tune --retune
llm-server --show-configs
llm-server model.gguf --tune-cache ~/.cache/llm-server/tune.json
# Speculative decoding
llm-server model.gguf --spec auto
llm-server model.gguf --spec mtp
llm-server model.gguf --spec eagle3
llm-server model.gguf --spec draft
llm-server model.gguf --spec ngram-mod
# Maintenance
llm-server --update
llm-server model.gguf --benchmarkUnknown flags are passed through to llama-server, so upstream options remain
available without wrapper changes.
--ai-tune starts from the launcher heuristic, benchmarks it, tests candidate
flag sets, and stores the best successful result in the local cache. The served
model can propose candidate flags, but the launcher validates them against
backend help, memory headroom, crash behavior, and benchmark results before a
cache entry is reused.
Release claims should be tied to benchmark artifacts. See docs/performance.md for the benchmark format and the current speculative-decoding findings.
--spec auto only enables a real validated path:
- MTP when the target GGUF has NextN/MTP metadata and the backend supports it.
- EAGLE-3 when a matching speculator is available and the backend advertises it.
- A compatible draft GGUF found locally or through Hugging Face search.
- Off when no validated path exists.
Ngram modes are explicit because they are workload-sensitive. Recent local tests showed a large gain on structured/repetitive output and a regression on code continuation with low draft acceptance. See docs/speculative-decoding.md.
Compare raw backend launch against Go v3:
scripts/bench-v3-comparison.sh model.gguf --server-bin /path/to/llama-server --ctx-size 32768 --rounds 3Optional historical comparison against an installed Bash v2 launcher:
scripts/bench-v3-comparison.sh model.gguf --server-bin /path/to/llama-server --bash-bin ~/.local/bin/llm-server-bashThe script writes JSON logs and a Markdown summary under .benchmarks/. Generated
benchmark runs are ignored by Git; commit only curated summaries.
Methodology: "raw" rows run the same backend binary with its own default
flags at the same context size on the same hardware — the gains come from
placement, KV/batch selection, and tuned flags, not from comparing different
backends. Every number is reproducible with
scripts/bench-v3-comparison.sh (JSON +
Markdown artifacts); the exact commands for each row are in
docs/performance.md. If your numbers differ, open an
issue with the artifact — regressions against these tables are treated as bugs.
Measured on 2026-06-10 on Linux with RTX 3090 Ti 24GB, RTX 3060 12GB, RTX 4070 12GB, 128GB RAM, and i7-10700K. Context was 32k for every row. Dense-model rows use the long prompt profile with 512 generated tokens and the median of three rounds. MoE rows use 256 generated tokens because the 95GB split model has multi-minute startup/repack time.
| Model | Backend / mode | Decode tok/s | Result |
|---|---|---|---|
| Qwen3.5 4B Q4_K_M | raw IK CUDA backend | 122.66 | baseline |
| Qwen3.5 4B Q4_K_M | v3 IK CUDA default | 156.21 | +27% vs raw |
| Qwen3.5 4B Q4_K_M | v3 IK CUDA AI-tune | 183.85 | +50% vs raw |
| Qwen3.5 4B Q4_K_M | llama.cpp CPU | 11.26 | CPU fallback |
| Qwen3.5 4B Q4_K_M | llama.cpp Vulkan default | 158.42 | cross-platform GPU path |
| Qwen3.5 4B Q4_K_M | llama.cpp Vulkan AI-tune | 169.61 | -ub 512, +7% vs Vulkan default |
| Qwen3.6 27B Q5_K_M | raw IK CUDA backend | failed | OOM on this 24GB primary GPU setup |
| Qwen3.6 27B Q5_K_M | v3 IK CUDA default | 37.68 | stable 32k context |
| Qwen3.6 27B Q5_K_M | v3 IK CUDA AI-tune | 37.69 | baseline correctly kept |
| Qwen3.6 27B Q5_K_M | llama.cpp Vulkan default | 37.72 | stable 32k context |
| Qwen3.6 27B Q5_K_M | llama.cpp Vulkan AI-tune | 37.67 | baseline correctly kept |
| Qwen3.6 27B Q5_K_M | IK CUDA speculative auto | 11.16 | not a release default; draft acceptance ~18% |
| MiniMax M2.7 UD-Q3_K_XL | v3 IK CUDA MoE default | 11.27 | 230B/A10B MoE, 95GB split GGUF |
| MiniMax M2.7 UD-Q3_K_XL | MoE AI-tune -b 1024 |
11.29 | below 1% tune threshold |
| MiniMax M2.7 UD-Q3_K_XL | MoE AI-tune -b 1536 |
11.30 | below 1% tune threshold |
AI-tune uses a 1% noise floor before replacing the default config. That is why small MoE differences are reported but not selected. The MoE path currently prioritizes stability, full context, and safe expert placement over chasing tiny single-run gains. The next major MoE tuning target is persistent in-process candidate testing, because relaunching a 95GB split model dominates tune time.
- CUDA/NVIDIA: ik_llama.cpp source build by default, or a manually published CUDA release bundle.
- Vulkan: llama.cpp Vulkan build.
- Metal/macOS: llama.cpp Metal build or release bundle.
- Windows: native x86_64 CPU release bundle; NVIDIA CUDA via
install.ps1 -Backend cudaor a customLLAMA_SERVER. - CPU: llama.cpp CPU build or release bundle.
Linux:
curl,git,python3cmake,make, and a compiler when building a backend from source- NVIDIA driver and CUDA toolkit for CUDA source builds
vulkaninfofor Vulkan auto-detection
macOS:
- Apple Silicon recommended
- Xcode command line tools when building from source
- Metal-capable llama.cpp build or release bundle
Windows:
- Windows 10/11 x86_64
- PowerShell 5+
- Python available as
python3,python, orpyfor GGUF parsing/downloader helpers - Native CPU release bundle is produced automatically
- NVIDIA driver, CUDA Toolkit with
nvcc, CMake, Git, and Visual Studio C++ Build Tools forinstall.ps1 -Backend cuda - Windows Vulkan is not a supported llm-server target
- Architecture
- Performance
- Launch performance tables
- Speculative decoding
- Model recommendations
- Release checklist
- Repository hygiene
- Changelog
MIT
