GitHub - kekzl/imp: High-performance LLM inference engine in C++/CUDA for NVIDIA Blackwell GeForce / RTX PRO (RTX 5090/5080/5070 Ti, RTX PRO 6000; sm_120). 200 tok/s decode on Qwen3.6-35B-A3B-NVFP4 MoE (RTX 5090).

A from-scratch CUDA inference engine for NVIDIA Blackwell.
Single-GPU, sm_120a, native NVFP4 weight + KV — ~200 tok/s decode on Qwen3.6-35B-A3B NVFP4 MoE.
Every line written by Claude Code.

What it is

A from-scratch CUDA inference engine for one architecture: NVIDIA Blackwell, compute capability 12.0. No portability layer, no FP16 dequant fallback in the hot path, no wrapper around llama.cpp or vLLM. imp ships its own GGUF and SafeTensors loaders, BPE tokenizer, paged KV cache, attention kernels, MoE routing, Gated DeltaNet, CUDA Graphs, and an OpenAI- + Anthropic-compatible HTTP server.

Every line was generated by an AI coding agent (Claude Code) — a long-running experiment in how far that approach scales on serious systems work.

Why it might be interesting

Optimized for one chip family. Hot-path kernels use Blackwell-specific features: PDL, Green Contexts, NVFP4 block-scaled MMA, FP8 MMA kind::f8f6f4, TMA warp-specialized grouped GEMM, packed cvt.e4m3x2. No SM80 / SM90 fallback paths.
Native NVFP4 weight + KV. SafeTensors NVFP4 prequant (NVIDIA Model Optimizer + llm-compressor) loads directly into NVFP4 tensor-core kernels. NVFP4 KV cache (--kv-nvfp4) compresses context 3.9× at parity decode tok/s — Qwen3-8B Q8 fits 40k ctx in the same VRAM that holds 16k as FP16.
Gated DeltaNet + Mamba2 + MoE hybrids. Qwen3.5, Qwen3.6, Nemotron-H run as full hybrids (GDN / Mamba2 + attention + MoE) with fused multi-token recurrent scans and register-cached state.
OpenAI + Anthropic API surface. imp-server speaks /v1/chat/completions and /v1/messages (streaming + non-streaming) with prefix caching, JSON-schema constraining, and tool calling.

Status

Experimental. The codebase is single-author / single-target / single-GPU. There are open bugs (see TODO.md), some quantization paths are coherent only on specific model families, and prefill numbers vary up to 2.6× across container restarts because of cuBLAS autotuning. Don't deploy this anywhere it matters.

Quickstart

Everything runs in Docker; no local CUDA toolkit needed.

# 1. Clone
git clone https://github.com/kekzl/imp.git && cd imp

# 2. Drop a GGUF or SafeTensors model into ./models/
mkdir -p models
# (Example: any *.gguf or NVFP4 prequant SafeTensors directory)

# 3. Build the server image
docker compose build imp-server

# 4. Serve it
docker run --gpus all -v ./models:/models -p 8080:8080 \
  imp:latest --model /models/your-model.gguf

# 5. Hit the OpenAI-compatible endpoint
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello!"}],"max_tokens":64}'

See docs/usage.md for the full CLI reference, server flags, and C-API embedding guide.

Supported hardware

Target. NVIDIA Blackwell sm_120 family — GeForce RTX 5090 / 5080 / 5070 Ti, RTX PRO 6000 Blackwell.

Tested. RTX 5090 only (GB202, 32 GB GDDR7). Every perf number in this repo is from one machine.

Fatbin layout.

arch=compute_120a, code=sm_120a — SASS for GB202 (RTX 5090, RTX PRO 6000). Architecture-specific feature set (NVFP4 block-scaled MMA, FP8 MMA kind::f8f6f4, TMA warp-specialized grouped GEMM).
arch=compute_120f, code=compute_120f — family-portable PTX, JIT-compiled on load for GB203 (RTX 5080, 5070 Ti). Loses TMA-WS grouped GEMM tactics; CUTLASS picks the next-best fallback automatically, but the NVFP4 fast-path on Mamba2 shapes is degraded. Disable with -DIMP_DISABLE_120F_FALLBACK=ON for a smaller, RTX 5090-only fatbin.

Out of scope. No support for Hopper, Ada, Ampere, or earlier. No AMD / Apple / CPU paths.

Supported models

Family	Variants	Quantizations
Qwen3 / Qwen3-MoE	dense + MoE	Q4_K_M, Q6_K, Q8_0, NVFP4, MXFP4
Qwen3.5 / Qwen3.6	GDN + attention (+ MoE)	Q4_K_M, Q8_0, NVFP4
Gemma-4 (26B-A4B MoE)	MoE	Q4_K_M, Q5_K_M, Q8_0, NVFP4
Llama / Mistral / Mixtral / DeepSeek	dense + MoE	GGUF (Q*_K, Q8_0), FP8
Gemma-3	text + vision (SigLIP)	GGUF
Nemotron-H	Mamba2 + Attention + MoE	GGUF

Tested-and-verified models with VRAM and decode tok/s: docs/supported-models.md.

Performance

Decode highlights (greedy, 256 output tokens, 3-rep average, RTX 5090, refreshed 2026-05-10):

Llama-3.2-3B Q8_0: 306 tok/s
Nemotron-3-Nano-30B-A3B NVFP4 (hybrid Mamba2+MoE+attention): 325 tok/s
Qwen3.6-35B-A3B Q4_K_M (MoE): 243 tok/s with IMP_EXPERT_OVERHEAD_PCT=10
Qwen3-Coder-30B-A3B NVFP4: 261 tok/s

Long-context prefill (pp=8192) consistently ahead of llama.cpp on dense models: ×1.13 to ×1.70 across the models in docs/performance.md. NVFP4 prequant decode (Qwen3.6, Gemma-4, Qwen3-Coder) ranges 200–260 tok/s.

Full numbers, methodology, and the tests/perf_baseline.json regression gate: docs/performance.md.

Caveats. Numbers are from one machine, one run series. Prefill (pp512) shows up to ±2.6× variance across container restarts due to cuBLAS algo selection — the docs use decode (tg256) for any A/B comparison. A different RTX 5090, different driver, different CUDA build, or different llama.cpp commit will produce different numbers.

Building from source

# Inside the dev container, or with CUDA 13.2+ on the host:
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

Full build options, test commands, and verify-gate setup: docs/usage.md.

Documentation

Document	Description
Usage & reference	Build, server, CLI, C API
Supported models	Tested model families with VRAM + tok/s
Quantization	GGUF Q*_K, NVFP4, MXFP4, FP8 KV — formats, pipelines, trade-offs
Performance	Decode + prefill throughput, methodology
imp.conf reference	All runtime configuration keys
sm_120a kernels	Kernel optimization notes
Roadmap	Open bugs and in-flight performance work
Changelog	Per-release notes

Contributing

See CONTRIBUTING.md for build, test, and PR workflow.

License

MIT — see LICENSE.

Acknowledgements

Built by @kekzl with Claude Code as a long-running experiment.

Stands on the shoulders of llama.cpp — the GGUF format, the GGML quantization schemes, and most of the practical conventions for local LLM inference were established there.

Heavy use of CUTLASS for SM120 FMHA, NVFP4 / MXFP4 GEMM, and grouped MoE kernels. Other references: Flash Attention 2, EAGLE, NVIDIA Model Optimizer, llm-compressor.

Name		Name	Last commit message	Last commit date
Latest commit History 631 Commits
.claude/skills		.claude/skills
.github		.github
bench		bench
cmake		cmake
docs		docs
include/imp		include/imp
monitoring		monitoring
profiles		profiles
prompts		prompts
scripts		scripts
src		src
tests		tests
third_party/stb		third_party/stb
tools		tools
.clang-format		.clang-format
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
docker-entrypoint.sh		docker-entrypoint.sh
imp		imp
imp.conf.example		imp.conf.example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What it is

Why it might be interesting

Status

Quickstart

Supported hardware

Supported models

Performance

Building from source

Documentation

Contributing

License

Acknowledgements

About

Uh oh!

Releases 8

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What it is

Why it might be interesting

Status

Quickstart

Supported hardware

Supported models

Performance

Building from source

Documentation

Contributing

License

Acknowledgements

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages