All numbers come from one machine, one run series. Reproducing them on a different RTX 5090, driver, or imp:test build will give different numbers — sometimes meaningfully so. The CI gate in tests/perf_baseline.json is the authoritative regression check; everything below is descriptive.
| Hardware | Single NVIDIA RTX 5090, 32 GB GDDR7, Blackwell sm_120f, custom water loop |
| Toolchain | CUDA 13.2.1, CUTLASS v4.4.2, GCC 13, RelWithDebInfo or Release Docker build |
| imp config | NVFP4 decode cache + FP8 prefill (non-GDN) / FP16 prefill (GDN), CUDA Graphs on (where the model supports it) |
| llama.cpp | b8445+, flash attention on, full offload (-ngl 99) |
| Sampling | Greedy (temp = 0) |
| Repetitions | 3 (decode); pp512 numbers vary up to ±2.6× across container restarts due to cuBLAS algorithm selection |
| Reported | Mean of repetitions; decode (tg256) is the reliable A/B signal |
Refresh the CI baseline with scripts/gen_perf_baseline.sh after any intentional perf change.
imp numbers refreshed: 2026-05-10 (post PR #156 chunked-prefill-hybrid + PR #157 auto max_seq_len 16K cap). llama.cpp comparison columns are from earlier captures; treat as historical reference.
Tokens generated per second — the metric that determines how fast a model responds.
| Model | Params | Quant | imp | llama.cpp | Notes |
|---|---|---|---|---|---|
| Qwen3-4B | 4.0B | Q8_0 | 236 | 244 | |
| Qwen3-8B | 8.2B | Q8_0 | 149 | 157 | |
| Qwen3.5-4B (GDN) | 4.0B | Q8_0 | 222 | 180 | |
| Qwen3.5-9B (GDN) | 9.2B | Q8_0 | 142 | — | |
| Llama-3.2-3B | 3.2B | Q8_0 | 306 | — | |
| Qwen3-Coder-30B-A3B | 30B (3B active) | NVFP4 | 261 | — | post PR #88 (was 51 with --no-graphs) |
| Qwen3-Coder-30B-A3B | 30B (3B active) | Q6_K | 236 | — | post moe_expert_offload_fix |
| Qwen3.6-35B-A3B | 35B (3B active) | Q4_K_M | 243 | — | IMP_EXPERT_OVERHEAD_PCT=10 |
| Qwen3.6-35B-A3B | 35B (3B active) | NVFP4 | 225 | — | post PR #88 (was 117–142) |
| Nemotron-3-Nano-30B-A3B | 30B (3B active) | NVFP4 | 325 | — | hybrid Mamba2+MoE+attention |
| Gemma-4-26B-A4B-it | 26B (4B active) | Q4_K_M | 187 | 151 | |
| Gemma-4-26B-A4B-it | 26B (4B active) | NVFP4 | 205 | — | post PR #88 (was 157–180) |
| Gemma-4-26B-A4B-it | 26B (4B active) | Q5_K_M | 65 | — | not re-tested 2026-05-10 |
| Mistral-Small-3.2 | 24B | NVFP4 | 101 | — | not re-tested 2026-05-10 |
Tokens processed per second during the prompt ingestion phase.
| Model | Params | Quant | imp | llama.cpp | Notes |
|---|---|---|---|---|---|
| Qwen3-4B | 4.0B | Q8_0 | 23189 | 21337 | |
| Qwen3-8B | 8.2B | Q8_0 | 14453 | 14172 | |
| Qwen3.5-4B (GDN) | 4.0B | Q8_0 | 14091 | 11149 | |
| Qwen3.5-9B (GDN) | 9.2B | Q8_0 | 10162 | — | |
| Llama-3.2-3B | 3.2B | Q8_0 | 27041 | — | |
| Qwen3-Coder-30B-A3B | 30B (3B active) | NVFP4 | 1258 | — | cuBLAS variance ±2.6× |
| Qwen3-Coder-30B-A3B | 30B (3B active) | Q6_K | 5643 | — | |
| Qwen3.6-35B-A3B | 35B (3B active) | Q4_K_M | 3076 | — | IMP_EXPERT_OVERHEAD_PCT=10 |
| Qwen3.6-35B-A3B | 35B (3B active) | NVFP4 | 1092 | — | |
| Nemotron-3-Nano-30B-A3B | 30B (3B active) | NVFP4 | 690 | — | hybrid Mamba2+MoE+attention |
| Gemma-4-26B-A4B-it | 26B (4B active) | Q4_K_M | 1840 | 196 | |
| Gemma-4-26B-A4B-it | 26B (4B active) | NVFP4 | 1472 | — | |
| Gemma-4-26B-A4B-it | 26B (4B active) | Q5_K_M | 88 | — | not re-tested 2026-05-10 |
| Mistral-Small-3.2 | 24B | NVFP4 | 12804 | — | not re-tested 2026-05-10 |
Gemma-4 notes: CUDA Graphs are now enabled (PRs #11–#14 unified forward_decode_async, PR #20 rope_freqs fix, 2026-04-20 SWA long-context fix). Decode is now 1.21× llama.cpp on Q4_K_M. The previous gap was two separate bugs: pipeline kernel split-K only issued one 16-byte cp.async per load (missing half the data at head_dim=512 on global layers) and cuBLAS dispatch gate forced global layers through a broken FMHA fallback above n=1024. Prefill remains dominated by CUTLASS grouped-GEMM advantage vs llama.cpp's serial expert processing. Q5_K_M recommended when output quality matters on complex prompts — Q4_K_M can degenerate on code-gen.
Note: GDN models now use FP16 prefill weights (v0.5.1) instead of FP8 for numerical stability. This reduces prefill throughput by ~8% vs v0.5 FP8 numbers but fixes multi-turn chat degeneration.
| Scenario | imp v0.5 | imp v0.5.1 | llama.cpp |
|---|---|---|---|
| Single-turn | ✅ correct | ✅ correct | ✅ correct |
| 2-turn chat | ❌ degenerate | ✅ correct | ✅ correct |
| 5-turn chat (4B) | ❌ garbage | ✅ correct | ✅ correct |
| 7-turn chat (9B) | ❌ garbage | ✅ correct | ✅ correct |
v0.5.1 root cause: FP8 weight precision + chunked prefill state management.
v0.6 root cause (Qwen3.5 "broken" output): The Jinja2 engine lacked {% macro %} support. Qwen3.5's chat template uses a render_content macro for multimodal content handling — without macro support, user content rendered as "None", causing the model to ignore prompts. Fixed in v0.6 with full Jinja2 macro support.
Before v0.7 the FP8 FMHA prefill path at n > 1024 emitted NaN on every
attention layer due to a shared-memory S_tile overlap (fixed in PR #33). The
cliff was invisible to all pp512 / pp1024 benches because those lengths
dispatch to cuBLAS attention. Post-fix numbers below verify the path is not
only correct but also competitive with llama.cpp across the 2K–8K range.
All measurements: RTX 5090, greedy, 2-rep average, tokens/sec.
| Model | pp512 | pp1024 | pp2048 | pp4096 | pp8192 |
|---|---|---|---|---|---|
| Qwen3-4B Q8_0 — imp v0.7 | 22 984 | 27 115 | 18 880 | 13 568 | 13 566 |
| llama.cpp | 15 786 | 12 437 | 13 083 | 11 009 | 7 978 |
| speedup | ×1.46 | ×2.18 | ×1.44 | ×1.23 | ×1.70 |
| Qwen3-8B Q8_0 — imp v0.7 | 13 849 | 17 428 | 13 999 | 11 105 | 11 050 |
| llama.cpp | 11 349 | 11 172 | 10 079 | 8 755 | 6 749 |
| speedup | ×1.22 | ×1.56 | ×1.39 | ×1.27 | ×1.64 |
| Qwen3-32B Q4_K_M — imp v0.7 | 1 932 | 2 316 | 2 301 | 2 040 | 2 040 |
| llama.cpp | 3 094 | 2 929 | 2 684 | 2 302 | 1 802 |
| speedup | ×0.62 | ×0.79 | ×0.86 | ×0.89 | ×1.13 |
| Mistral-24B Q6_K — imp v0.7 | 2 092 | 2 906 | 3 312 | 3 591 | 3 595 |
| llama.cpp | 3 914 | 3 855 | 3 683 | 3 469 | 3 058 |
| speedup | ×0.53 | ×0.75 | ×0.90 | ×1.04 | ×1.18 |
| Qwen3.5-4B GDN Q8_0 — imp v0.7 | 13 494 | 14 778 | 13 487 | 13 016 | 13 090 |
Observations:
- pp=8192 is imp's strongest point — ×1.13 to ×1.70 faster than llama.cpp on every model tested. Pre-v0.7 this range was garbage.
- Qwen3-4B/8B show a 1024→2048 throughput dip (27 k → 19 k tok/s on 4B) because the dispatcher switches from cuBLAS attention to FP8 FMHA at n=1024. Output remains correct; smoothing the cliff is future work (raise the cuBLAS cap or tune the FP8-FMHA kernel).
- Qwen3-32B Q4_K_M is weight-bound — throughput is flat across lengths because the dense GEMMs dominate over attention cost.
- GDN (Qwen3.5-4B) is flat by design — O(n) prefill, not O(n²); only 8 of 32 layers are attention, so the FMHA fix barely shows up in these numbers but is still required for correct output.
- pp=512 on large dense models (Qwen3-32B, Mistral-24B) is ~0.5–0.6× llama.cpp — a known cuBLAS autotuning / launch-overhead issue unrelated to this release.
| KV Cache | Decode tg256 | Prefill pp512 | tg @ 5K ctx | tg @ 20K ctx | VRAM |
|---|---|---|---|---|---|
| FP16 (default since PR #51) | 319 | 25808 | 213 | 156 | 100% |
FP8 E4M3 (--kv-fp8) |
319 | 25808 | 213 | 156 | 50% |
INT4 (--kv-int4) |
305 | 16272 | 190 | 122 | 25% |
TurboQuant (--kv-turboquant) |
256 | 23376 | — | 70 | ~15% |
| TurboQuant Lite | 258 | 15097 | — | 91 | ~12% |
Note: Default is FP16 as of PR #51 (was implicit auto-FP8). FP8 has
perf parity with FP16 on Qwen3 and Qwen3.5/3.6 GDN with the
auto-deterministic-cuBLAS gate from PR #52, but breaks Llama, Mistral-Small-3.1,
DeepSeek-R1-Distill out of the box. Use --kv-fp8 per-model after testing.
Memo: memory/kv_dtype_tradeoffs_2026_04_24.md.
- Qwen3.5 GDN: Gated DeltaNet hybrid architecture (24 GDN + 8 attention + 32 FFN layers). Output quality matches llama.cpp for both single-turn and multi-turn.
- TurboQuant: PolarQuant INT4 K directions + QJL sketch correction + INT4 V. MXFP4 variant available on sm_120+.
- Prefill variance: cuBLAS autotuning can cause up to 2.6x variance in prefill numbers between container restarts. Decode numbers are stable. Compare decode only for reliable A/B testing.
- MXFP4 Prefill: CUTLASS block-scaled GEMM for prefill (
--mxfp4-prefill). Currently ~10% slower than FP8 cuBLASLt for Q8_0 models due to activation quantization overhead. - Qwen3-Coder-30B-A3B: NVIDIA Model Optimizer NVFP4 prequant (128 experts, 8 active). Loaded from SafeTensors. Decode uses per-expert NVFP4 GEMV (serial dispatch); prefill uses CUTLASS NVFP4 GEMM for dense + per-expert NVFP4 GEMV for MoE. Multi-turn chat verified working.
- Gemma-4 output quality: Q4_K_M can degenerate on complex code-gen prompts (Fibonacci → backtick loop). Root cause is accumulated FP16 drift over 30 layers, not a single-layer bug. Q5_K_M and Q8_0 produce clean output — use those when quality matters. Long context up to ~11800 tokens supported with
--min-kv-tokens 14000(from 2026-04-20 KV-budget fix).
| Spec | Value |
|---|---|
| GPU (this benchmark) | NVIDIA GeForce RTX 5090 |
Other supported GPUs (same sm_120f) |
RTX PRO 5000 Blackwell (48 GB), RTX PRO 6000 Blackwell (96 GB) |
| Architecture | Blackwell (GB202), sm_120 |
| VRAM | 32 GB GDDR7, 512-bit, 1792 GB/s (RTX 5090) |
| SMs | 170 |
| CUDA Cores | 21,760 |
| Tensor Cores | 680 (5th gen) |
| FP4 Tensor | 3,354 TOPS |
| FP8 Tensor | 1,677 TFLOPS |
| L2 Cache | 96 MB |
| TDP | 575 W |
| Cooling | Custom water loop (no thermal throttling) |