A from-scratch CUDA inference engine for NVIDIA Blackwell.
Single-GPU, sm_120a, native NVFP4 weight + KV — ~200 tok/s decode on Qwen3.6-35B-A3B NVFP4 MoE.
Every line written by Claude Code.
A from-scratch CUDA inference engine for one architecture: NVIDIA Blackwell, compute capability 12.0. No portability layer, no FP16 dequant fallback in the hot path, no wrapper around llama.cpp or vLLM. imp ships its own GGUF and SafeTensors loaders, BPE tokenizer, paged KV cache, attention kernels, MoE routing, Gated DeltaNet, CUDA Graphs, and an OpenAI- + Anthropic-compatible HTTP server.
Every line was generated by an AI coding agent (Claude Code) — a long-running experiment in how far that approach scales on serious systems work.
- Optimized for one chip family. Hot-path kernels use Blackwell-specific features: PDL, Green Contexts, NVFP4 block-scaled MMA, FP8 MMA
kind::f8f6f4, TMA warp-specialized grouped GEMM, packedcvt.e4m3x2. No SM80 / SM90 fallback paths. - Native NVFP4 weight + KV. SafeTensors NVFP4 prequant (NVIDIA Model Optimizer + llm-compressor) loads directly into NVFP4 tensor-core kernels. NVFP4 KV cache (
--kv-nvfp4) compresses context 3.9× at parity decode tok/s — Qwen3-8B Q8 fits 40k ctx in the same VRAM that holds 16k as FP16. - Gated DeltaNet + Mamba2 + MoE hybrids. Qwen3.5, Qwen3.6, Nemotron-H run as full hybrids (GDN / Mamba2 + attention + MoE) with fused multi-token recurrent scans and register-cached state.
- OpenAI + Anthropic API surface.
imp-serverspeaks/v1/chat/completionsand/v1/messages(streaming + non-streaming) with prefix caching, JSON-schema constraining, and tool calling.
Experimental. The codebase is single-author / single-target / single-GPU. There are open bugs (see TODO.md), some quantization paths are coherent only on specific model families, and prefill numbers vary up to 2.6× across container restarts because of cuBLAS autotuning. Don't deploy this anywhere it matters.
Everything runs in Docker; no local CUDA toolkit needed.
# 1. Clone
git clone https://github.com/kekzl/imp.git && cd imp
# 2. Drop a GGUF or SafeTensors model into ./models/
mkdir -p models
# (Example: any *.gguf or NVFP4 prequant SafeTensors directory)
# 3. Build the server image
docker compose build imp-server
# 4. Serve it
docker run --gpus all -v ./models:/models -p 8080:8080 \
imp:latest --model /models/your-model.gguf
# 5. Hit the OpenAI-compatible endpoint
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Hello!"}],"max_tokens":64}'See docs/usage.md for the full CLI reference, server flags, and C-API embedding guide.
Target. NVIDIA Blackwell sm_120 family — GeForce RTX 5090 / 5080 / 5070 Ti, RTX PRO 6000 Blackwell.
Tested. RTX 5090 only (GB202, 32 GB GDDR7). Every perf number in this repo is from one machine.
Fatbin layout.
arch=compute_120a, code=sm_120a— SASS for GB202 (RTX 5090, RTX PRO 6000). Architecture-specific feature set (NVFP4 block-scaled MMA, FP8 MMAkind::f8f6f4, TMA warp-specialized grouped GEMM).arch=compute_120f, code=compute_120f— family-portable PTX, JIT-compiled on load for GB203 (RTX 5080, 5070 Ti). Loses TMA-WS grouped GEMM tactics; CUTLASS picks the next-best fallback automatically, but the NVFP4 fast-path on Mamba2 shapes is degraded. Disable with-DIMP_DISABLE_120F_FALLBACK=ONfor a smaller, RTX 5090-only fatbin.
Out of scope. No support for Hopper, Ada, Ampere, or earlier. No AMD / Apple / CPU paths.
| Family | Variants | Quantizations |
|---|---|---|
| Qwen3 / Qwen3-MoE | dense + MoE | Q4_K_M, Q6_K, Q8_0, NVFP4, MXFP4 |
| Qwen3.5 / Qwen3.6 | GDN + attention (+ MoE) | Q4_K_M, Q8_0, NVFP4 |
| Gemma-4 (26B-A4B MoE) | MoE | Q4_K_M, Q5_K_M, Q8_0, NVFP4 |
| Llama / Mistral / Mixtral / DeepSeek | dense + MoE | GGUF (Q*_K, Q8_0), FP8 |
| Gemma-3 | text + vision (SigLIP) | GGUF |
| Nemotron-H | Mamba2 + Attention + MoE | GGUF |
Tested-and-verified models with VRAM and decode tok/s: docs/supported-models.md.
Decode highlights (greedy, 256 output tokens, 3-rep average, RTX 5090, refreshed 2026-05-10):
- Llama-3.2-3B Q8_0: 306 tok/s
- Nemotron-3-Nano-30B-A3B NVFP4 (hybrid Mamba2+MoE+attention): 325 tok/s
- Qwen3.6-35B-A3B Q4_K_M (MoE): 243 tok/s with
IMP_EXPERT_OVERHEAD_PCT=10 - Qwen3-Coder-30B-A3B NVFP4: 261 tok/s
Long-context prefill (pp=8192) consistently ahead of llama.cpp on dense models: ×1.13 to ×1.70 across the models in docs/performance.md. NVFP4 prequant decode (Qwen3.6, Gemma-4, Qwen3-Coder) ranges 200–260 tok/s.
Full numbers, methodology, and the tests/perf_baseline.json regression gate: docs/performance.md.
Caveats. Numbers are from one machine, one run series. Prefill (
pp512) shows up to ±2.6× variance across container restarts due to cuBLAS algo selection — the docs use decode (tg256) for any A/B comparison. A different RTX 5090, different driver, different CUDA build, or different llama.cpp commit will produce different numbers.
# Inside the dev container, or with CUDA 13.2+ on the host:
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)Full build options, test commands, and verify-gate setup: docs/usage.md.
| Document | Description |
|---|---|
| Usage & reference | Build, server, CLI, C API |
| Supported models | Tested model families with VRAM + tok/s |
| Quantization | GGUF Q*_K, NVFP4, MXFP4, FP8 KV — formats, pipelines, trade-offs |
| Performance | Decode + prefill throughput, methodology |
| imp.conf reference | All runtime configuration keys |
| sm_120a kernels | Kernel optimization notes |
| Roadmap | Open bugs and in-flight performance work |
| Changelog | Per-release notes |
See CONTRIBUTING.md for build, test, and PR workflow.
MIT — see LICENSE.
Built by @kekzl with Claude Code as a long-running experiment.
Stands on the shoulders of llama.cpp — the GGUF format, the GGML quantization schemes, and most of the practical conventions for local LLM inference were established there.
Heavy use of CUTLASS for SM120 FMHA, NVFP4 / MXFP4 GEMM, and grouped MoE kernels. Other references: Flash Attention 2, EAGLE, NVIDIA Model Optimizer, llm-compressor.