Skip to content

kekzl/imp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

631 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

imp

A from-scratch CUDA inference engine for NVIDIA Blackwell.
Single-GPU, sm_120a, native NVFP4 weight + KV — ~200 tok/s decode on Qwen3.6-35B-A3B NVFP4 MoE.
Every line written by Claude Code.

License CUDA 13.2+ C++20 Status: experimental


What it is

A from-scratch CUDA inference engine for one architecture: NVIDIA Blackwell, compute capability 12.0. No portability layer, no FP16 dequant fallback in the hot path, no wrapper around llama.cpp or vLLM. imp ships its own GGUF and SafeTensors loaders, BPE tokenizer, paged KV cache, attention kernels, MoE routing, Gated DeltaNet, CUDA Graphs, and an OpenAI- + Anthropic-compatible HTTP server.

Every line was generated by an AI coding agent (Claude Code) — a long-running experiment in how far that approach scales on serious systems work.

Why it might be interesting

  • Optimized for one chip family. Hot-path kernels use Blackwell-specific features: PDL, Green Contexts, NVFP4 block-scaled MMA, FP8 MMA kind::f8f6f4, TMA warp-specialized grouped GEMM, packed cvt.e4m3x2. No SM80 / SM90 fallback paths.
  • Native NVFP4 weight + KV. SafeTensors NVFP4 prequant (NVIDIA Model Optimizer + llm-compressor) loads directly into NVFP4 tensor-core kernels. NVFP4 KV cache (--kv-nvfp4) compresses context 3.9× at parity decode tok/s — Qwen3-8B Q8 fits 40k ctx in the same VRAM that holds 16k as FP16.
  • Gated DeltaNet + Mamba2 + MoE hybrids. Qwen3.5, Qwen3.6, Nemotron-H run as full hybrids (GDN / Mamba2 + attention + MoE) with fused multi-token recurrent scans and register-cached state.
  • OpenAI + Anthropic API surface. imp-server speaks /v1/chat/completions and /v1/messages (streaming + non-streaming) with prefix caching, JSON-schema constraining, and tool calling.

Status

Experimental. The codebase is single-author / single-target / single-GPU. There are open bugs (see TODO.md), some quantization paths are coherent only on specific model families, and prefill numbers vary up to 2.6× across container restarts because of cuBLAS autotuning. Don't deploy this anywhere it matters.

Quickstart

Everything runs in Docker; no local CUDA toolkit needed.

# 1. Clone
git clone https://github.com/kekzl/imp.git && cd imp

# 2. Drop a GGUF or SafeTensors model into ./models/
mkdir -p models
# (Example: any *.gguf or NVFP4 prequant SafeTensors directory)

# 3. Build the server image
docker compose build imp-server

# 4. Serve it
docker run --gpus all -v ./models:/models -p 8080:8080 \
  imp:latest --model /models/your-model.gguf

# 5. Hit the OpenAI-compatible endpoint
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello!"}],"max_tokens":64}'

See docs/usage.md for the full CLI reference, server flags, and C-API embedding guide.

Supported hardware

Target. NVIDIA Blackwell sm_120 family — GeForce RTX 5090 / 5080 / 5070 Ti, RTX PRO 6000 Blackwell.

Tested. RTX 5090 only (GB202, 32 GB GDDR7). Every perf number in this repo is from one machine.

Fatbin layout.

  • arch=compute_120a, code=sm_120a — SASS for GB202 (RTX 5090, RTX PRO 6000). Architecture-specific feature set (NVFP4 block-scaled MMA, FP8 MMA kind::f8f6f4, TMA warp-specialized grouped GEMM).
  • arch=compute_120f, code=compute_120f — family-portable PTX, JIT-compiled on load for GB203 (RTX 5080, 5070 Ti). Loses TMA-WS grouped GEMM tactics; CUTLASS picks the next-best fallback automatically, but the NVFP4 fast-path on Mamba2 shapes is degraded. Disable with -DIMP_DISABLE_120F_FALLBACK=ON for a smaller, RTX 5090-only fatbin.

Out of scope. No support for Hopper, Ada, Ampere, or earlier. No AMD / Apple / CPU paths.

Supported models

Family Variants Quantizations
Qwen3 / Qwen3-MoE dense + MoE Q4_K_M, Q6_K, Q8_0, NVFP4, MXFP4
Qwen3.5 / Qwen3.6 GDN + attention (+ MoE) Q4_K_M, Q8_0, NVFP4
Gemma-4 (26B-A4B MoE) MoE Q4_K_M, Q5_K_M, Q8_0, NVFP4
Llama / Mistral / Mixtral / DeepSeek dense + MoE GGUF (Q*_K, Q8_0), FP8
Gemma-3 text + vision (SigLIP) GGUF
Nemotron-H Mamba2 + Attention + MoE GGUF

Tested-and-verified models with VRAM and decode tok/s: docs/supported-models.md.

Performance

Decode highlights (greedy, 256 output tokens, 3-rep average, RTX 5090, refreshed 2026-05-10):

  • Llama-3.2-3B Q8_0: 306 tok/s
  • Nemotron-3-Nano-30B-A3B NVFP4 (hybrid Mamba2+MoE+attention): 325 tok/s
  • Qwen3.6-35B-A3B Q4_K_M (MoE): 243 tok/s with IMP_EXPERT_OVERHEAD_PCT=10
  • Qwen3-Coder-30B-A3B NVFP4: 261 tok/s

Long-context prefill (pp=8192) consistently ahead of llama.cpp on dense models: ×1.13 to ×1.70 across the models in docs/performance.md. NVFP4 prequant decode (Qwen3.6, Gemma-4, Qwen3-Coder) ranges 200–260 tok/s.

Full numbers, methodology, and the tests/perf_baseline.json regression gate: docs/performance.md.

Caveats. Numbers are from one machine, one run series. Prefill (pp512) shows up to ±2.6× variance across container restarts due to cuBLAS algo selection — the docs use decode (tg256) for any A/B comparison. A different RTX 5090, different driver, different CUDA build, or different llama.cpp commit will produce different numbers.

Building from source

# Inside the dev container, or with CUDA 13.2+ on the host:
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

Full build options, test commands, and verify-gate setup: docs/usage.md.

Documentation

Document Description
Usage & reference Build, server, CLI, C API
Supported models Tested model families with VRAM + tok/s
Quantization GGUF Q*_K, NVFP4, MXFP4, FP8 KV — formats, pipelines, trade-offs
Performance Decode + prefill throughput, methodology
imp.conf reference All runtime configuration keys
sm_120a kernels Kernel optimization notes
Roadmap Open bugs and in-flight performance work
Changelog Per-release notes

Contributing

See CONTRIBUTING.md for build, test, and PR workflow.

License

MIT — see LICENSE.

Acknowledgements

Built by @kekzl with Claude Code as a long-running experiment.

Stands on the shoulders of llama.cpp — the GGUF format, the GGML quantization schemes, and most of the practical conventions for local LLM inference were established there.

Heavy use of CUTLASS for SM120 FMHA, NVFP4 / MXFP4 GEMM, and grouped MoE kernels. Other references: Flash Attention 2, EAGLE, NVIDIA Model Optimizer, llm-compressor.

About

High-performance LLM inference engine in C++/CUDA for NVIDIA Blackwell GeForce / RTX PRO (RTX 5090/5080/5070 Ti, RTX PRO 6000; sm_120). 200 tok/s decode on Qwen3.6-35B-A3B-NVFP4 MoE (RTX 5090).

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors