Skip to content

Releases: kekzl/imp

v0.9.0 — NVFP4 prefill fast-path, NVFP4 KV cache, BitDecoding parity, NemotronH, sm_120a

10 May 19:16
2ce3093

Choose a tag to compare

First release after v0.8.0 — 60+ PRs of NVFP4 maturation, new architectures, and the hybrid-prefill correctness pass.

Highlights

  • NVFP4 MoE prefill fast-path (#160) — Qwen3-Coder-30B-A3B-NVFP4 pp512 1241 → 13046 tok/s (×10.5).
  • NVFP4 KV cache (#108, #125) — opt-in --kv-nvfp4; 16k → 40k tokens at the same VRAM, parity with FP16 baseline after vectorized PTX dequant.
  • BitDecoding TC paged decode (Phases 0-3) (#142, #145, #146, #147, #148, #149) — parity with FP16 baseline (193 vs 193 tok/s) on Qwen3-4B Q8 NVFP4-KV.
  • NemotronH hybrid Mamba2+MoE+Attention NVFP4 (#104, #109) — tg128 42 → 319 tok/s after dynamic NVFP4 MoE reserve.
  • sm_120fsm_120a build target (#105) — full RTX 5090 feature set (FP8 MMA .kind::f8f6f4, NVFP4 block-scaled MMA, TMA warp-specialized grouped GEMM tactics).
  • 1024 → 4096 prefill cliff closed (#110) — pp4096 +18-28% across Q8 baselines.
  • Chunked prefill correctness (#130) + hybrid GDN+MoE / Mamba2+MoE archs (#156).

Notable additions

  • Native SentencePiece (.model) parser (#128) — drops Python fallback for Mistral-family tokenizers.
  • Multimodal Qwen3.6-VL NVFP4 loader (#152).
  • Zero-config SafeTensors auto-detect (#116).
  • Server: tools + JSON-schema coordination (#103, #112, #119), opt-in --log-requests JSONL (#155).
  • CUDA 13.2 modernization (#131): cudaMemcpyWithAttributesAsync, add.f32x2.
  • GHCR release pipeline (#101).

Known issues (carry-over)

  • NVFP4 MoE prefill ceiling at ~16k tok/s warm vs vLLM single-seq 18.5k = 1.42× gap. Next-step memo in repo.
  • Spec-decode / MTP still off on NVFP4 decode-cache models.
  • CUTLASS NVFP4 sm_120 non-determinism under graph-replay (skip-guard retained for llm-compressor exports).
  • Prefill throughput shows up to 2.6× variance between container restarts due to cuBLAS autotuning. Compare decode-only for reliable A/B.

Full notes: CHANGELOG.md.

Docker image (auto-built on this release): ghcr.io/kekzl/imp:0.9.0, :0.9, :0, :latest.

v0.8.0 — NVFP4 prequant production-ready, native tool calling, public release cleanup

03 May 10:58
cac9955

Choose a tag to compare

NVFP4-prequant SafeTensors hits production: Mistral-3.2 / Gemma-4 /
Qwen3.6 / Qwen3-Coder all coherent on single-turn, sampling, multi-turn
and short long-context. FP8 KV warmup calibration fixed for Llama and
GDN families. CUDA Graphs lit up for prequant SafeTensors. Forty-plus
PRs since v0.7.0.

Server + tools (PR #97)

  • Native function calling for Gemma-4 + Qwen3.6 — root-cause was a
    tokenizer bug, not just missing parsers. encode_spm / encode_gpt2 /
    encode_gemma4 now run a longest-match pre-split pass against
    CONTROL-flagged added tokens before BPE. Multi-character markers like
    <|tool_call> (Gemma-4 token id 48) were being BPE'd as raw UTF-8
    bytes — the model never saw the trained marker in its prompt's
    tools-rendering and answered with markdown JSON code blocks instead
    of the native protocol. Fixed: token 48/49 round-trip as their
    assigned id. Added parse_tool_calls_gemma() for Gemma's non-JSON
    syntax (<|tool_call>call:NAME{key:value}<tool_call|> with
    <|"|>...<|"|> string escapes), and extended parse_tool_calls_chatml()
    to branch on body shape so Qwen3.6's XML-styled
    <function=...><parameter=...> payload parses too. End-to-end
    verified on Gemma-4 Q4_K_M (finish_reason=tool_calls, 19 tokens
    completion) and Qwen3.6-NVFP4 (finish_reason=tool_calls with
    reasoning_content alongside).
  • Faster cold start (24s → 18s on Qwen3.6 NVFP4) — skip MTP / vision
    -only SafeTensors shards when neither is wired up (~5s, 2.4 GiB of
    mmap + header parse + page-cache pressure avoided), MAP_POPULATE +
    MADV_WILLNEED on weight mmaps, pinned staging ring 2x64 MiB →
    4x128 MiB, Pass-2 expert upload re-arms cudaMemGetInfo cache so
    per-tensor checked_cuda_malloc skips ~15k sync calls on 128-expert
    MoE, concurrent SafeTensors shard parse (3 shards in parallel
    threads), exposed name_is_skipped() to deduplicate the shard-skip
    filter and translate_name's skip rules.
  • Server fixes (Open WebUI on Qwen3.6-NVFP4) — UTF-8 boundary walk
    in reasoning stream (German umlauts came out as f��r because the
    7-byte tail-overlap landed mid-multibyte), drop leaked stop tokens
    (<|im_end|> / <|endoftext|>) before the is_last gate, restrict
    "[Reasoning truncated]" notice to finish == "length", post-</think>
    grace 4 → 16 tokens, repetition_penalty default 1.0 → 1.05 to break
    multi-turn loop degeneration, workspace skips FP8 / MXFP4 scratch
    for paths we won't use (~6.4 GiB VRAM headroom on Qwen3.6 NVFP4 GDN).
  • Open WebUI tools enabled in docker-compose — DuckDuckGo web
    search (no API key), Pyodide code interpreter (browser-side, no
    sandbox service), URL fetch, native function calling toggleable
    per message via the chat-input icons.

Fixed

  • FP8 KV warmup-calibration bug (#89) — Engine::warmup() ran a forward
    pass with synthetic BOS tokens; the FP8 write path's online calibration
    treated this as the FIRST prefill, locked kv_scales_[layer] to a
    too-small absmax, and never recalibrated. Real generation then overflowed
    FP8 dynamic range on Llama-3.2-3B Q8_0 and Qwen3.5-4B GDN Q8_0 (output
    degenerated within ~30 tokens, e.g. " France, and, 2008, 201, 201, …").
    Fix: Engine::warmup() drops the kv_calibrated_ flags at end-of-warmup;
    the FP8 write path promotes the scale monotonically via std::max so
    the warmup observation survives if it's already wider, and real prefill
    widens it further when needed. Long generation (100 tokens) on
    Llama-3.2-3B FP8 KV now produces a clean factually-correct list of world
    capitals.
  • NVFP4 prequant CUTLASS prefill cache (#88) — Phase 0 promotes set
    Tensor.qtype = NVFP4 directly on the main weight tensors but Phase 3b
    (CUTLASS cache build) only iterated the legacy wcache_.nvfp4 map.
    Prequant SafeTensors prefill therefore fell through to gemm_nvfp4
    dequant→cuBLAS, allocating ~40 MiB FP16 scratch per layer per prefill —
    graph-incompatible AND noisy on SmoothQuant-calibrated Mistral-3.2-NVFP4.
    Phase 0b loop registers all dense + out_proj_ prequant tensors in
    cutlass_nvfp4 directly. Standard pp512/tg256 bench post-fix:
    Mistral-3.2-NVFP4 tg 81→101, Qwen3.6-NVFP4 tg 117–142→217,
    Gemma-4-NVFP4 tg 157–180→213, Qwen3-Coder-30B-A3B-NVFP4 tg 51→272
    (--no-cuda-graphs no longer needed). Mistral-3.2-NVFP4 long-context
    Lorem×11 numerical-hash garbage → coherent text.
  • NVFP4 prequant MoE decode fast-path (#85) — Qwen3.6-NVFP4 went 8.34 →
    117–142 tok/s (~14–17×); Gemma-4-NVFP4 went ~42 → 157–180 tok/s (~4×).
    Three bugs: can_decode_fast whitelist did not include NVFP4-prequant
    models; cache_moe_native_nvfp4 had to be added to build the contiguous
    per-expert NVFP4 buffer for SafeTensors per-expert layouts; per-layer
    free of per-expert allocations (32 GiB VRAM ceiling on 35B-A3B).
  • Six Qwen3.5/3.6-NVFP4 SafeTensors loader bugs (#81) blocking coherent
    decode: (1) RMSNorm 1+W convention now honoured via
    UploadCtx::arch_norm_offset, (2) GDN head layout HF-grouped vs
    GGUF-tiled with kernel grouped_layout flag, (3) partial_rotary_factor
    read from both top-level and nested rope_parameters,
    (4) rope_theta from nested rope_parameters.rope_theta,
    (5) A_log → -exp(A_log) transform applied to BF16/F16 SafeTensors path
    only, (6) fp32_scan y_buf populated outside debug_forward. Per-layer
    correlation vs GGUF Q4_K_M now ≥0.997 across all 40 layers; output
    matches the GGUF oracle for the standard verification prompt.
  • Qwen3.5 GDN Q8_0 α/β qtype mismatch (#59) — upload_weight pre-dequanted
    Q8 → FP16 without updating qtype. Dispatcher mis-interpreted bytes →
    state collapse ( my my my…).
  • MXFP4 GDN-fallback dequant (#58) — replaced buggy CPU path with GPU kernel.
  • MXFP4 FP16-fallback VRAM oversubscription diagnostic (#60) — clear error
    message for the Qwen3.5-27B-MXFP4 IMA-on-load case (was silent).
  • Qwen3.5-MXFP4 A_log from blk.X.ssm_dt.weight (#61).
  • MoE expert-offload auto-pick (#54) — defaults try 10 % overhead first
    before falling back to 30 %. Qwen3-Coder-30B Q6_K 77 → 234 tok/s.
  • Mistral-3.2-NVFP4 use_default_system_prompt (#78) — honour the
    tokenizer-config flag and skip the 600-token jinja default system prompt.
    "I am the capital of France?" → "Paris".
  • Server <channel|> swallowing answer body on Gemma-4 (#39).
  • Gemma-4 byte-fallback on common names (#37).
  • Server reasoning_content for chat-template-injected <think> (#86).
  • verify auto re-execs in imp:test when host CMake is missing (#70) —
    unblocks make verify-fast for clean-host workflows.

Added

  • KV-cache safety default flip (#51) — default KV dtype is now FP16; FP8
    is opt-in via --kv-fp8 / imp.conf:kv_cache.dtype="fp8". Fixes Mistral,
    DeepSeek, and Qwen3.5-GDN out of the box on first decode.
  • Auto-deterministic cuBLAS when FP8 KV active (#52) — pins cuBLAS algo
    selection to avoid quant-dequant noise → softmax NaN. Necessary fix; not
    sufficient for all archs (see docs/roadmap.md "FP8 KV stride bug").
  • CUDA Graph coverage expansion (#53) — speculative-verify graphs, SigLIP
    vision graph, default mem-pool retain, cudaGraphExecUpdate re-capture.
  • SM120 FMHA optimisation pass — Project B Stage 4 (#55, #56) — float4
    tile loads + HW FP4 conversion. +11–13 % prefill on Qwen3-4B Q8_0 at
    pp=8192. Stage 5 (mxf4nvf4.block_scale.scale_vec::4X.m16n8k64) layouts
    verified byte-exact, integration is the next open Project B item.
  • NVFP4 SafeTensors loader from llm-compressor (Phase 1, #63; Phase 2
    Item 1 Mistral3, #64; Phase 2 Item 2 Gemma-4 extras + per-row gemv
    bypass, #65). Mistral3-NVFP4 decode tg ≈ 81 tok/s post Phase 2 Item 1.
    Gemma-4-NVFP4 (llm-compressor) decodes coherent end-to-end at ~34 tok/s
    with default flags after #65 routes M>1 expert GEMV through gemm_nvfp4
    dequant→cuBLAS (legacy serial path's per-row gemv_nvfp4_kpar loop
    produced wrong output at Gemma-4 expert dimensions; M=1 decode path is
    unchanged).
  • Qwen3.6-NVFP4 SafeTensors plumbing (Phase 1 #71) — load-only.
    Decode lit up later via #85.
  • JSON config plumbing (#74, #77) — generation_config.json sampling
    defaults, special_tokens_map.json, Mistral V3 tokenizer-config flags.
  • Tokenizer-config use_default_system_prompt=false honoured (#78) — see
    Fixed.
  • Type-system + config refactor (#72) — unified QType, Tensor sidecars,
    imp.conf (TOML, ~50 former IMP_* env vars now keys). New top-level
    imp.conf.example. CLI --set kv_cache.dtype=fp8 for per-run overrides.
  • NVFP4 collapsed load-time scratch (#73) — single Model map.
  • FP32 attention S-matrix + Qwen3.5 QK-norm split (#66) — improves
    numerical headroom on long-context attention.
  • Diagnostic env vars for NVFP4 + attention (#79) — reproducer support
    for the long-context NVFP4 bug.
  • Anthropic /v1/messages endpoint (Phase 1 non-streaming #35,
    Phase 2 streaming #36) — synthetic SSE stream over the OpenAI handler.
  • Storage-planner enumerates shared-expert FFN + top-level embeddings/LM
    head
    (#38, #40) — fixes silent miss in MoE memory budget.
  • Strengthened GDN coherence test (#48) — detects recurrent-state collapse.
  • Strengthened Gemma-4 NVFP4 e2e gate (#68) — Paris coherence assertion.
  • Synthetic gemv_kpar M>1 per-row-loop bug repro test (#69).
  • Split imp-tests into 8 per-module binaries (#57) — speeds up filtered
    test runs.
  • tools/analysis/ PTX survey scripts (#67) — re-runnable cvt / MMA /
    async-TMA / atomics / SFU / cluster surveys for sm_120f after CUDA upgrades.

Changed

  • Default KV dtype is FP16 (#51) — see Added. Was implicit auto-FP8.
  • imp.conf is now the configuration interface (#72) — ~50 IMP_* env
    vars retired; sectioned TOML keys (runtime.cuda_graphs, kv_cache.dtype,
    attention.fp8_fmha, …). Loading precedence: --config → `$IMP_C...
Read more

v0.7.0 — Long-context correctness + Gemma-4/GDN stabilization

23 Apr 11:56
9e68101

Choose a tag to compare

Big correctness + platform release covering 195 commits since v0.6. The long-context dispatch cliff is gone, Gemma-4 and the Qwen 3.5 / 3.6 GDN family now produce clean output on Blackwell, CUDA 13.2.1 with stream priorities and mem-sync domains is live, and the StreamingLLM smart-KV mode is available.

Headline

  • FP8 FMHA S_tile smem overlap fix (#33) — pp > 1024 now coherent across every tested architecture. Previously all attention layers emitted NaN above the cuBLAS dispatch boundary. Up to ×1.70 vs llama.cpp at pp=8192 on Qwen3-4B.
  • Qwen 3.5 / 3.6 GDN stabilization (#28, #30) — gdn_scan_fused_kernel __launch_bounds__(HD, 2) miscompile fixed, partial-RoPE pair-offset fixed, ssm_state_dtype never auto-downgraded for GDN (the FP32 scan was overflowing into the next layer's state). Qwen 3.6 tg256 36 → 57 tok/s.
  • Gemma-4 suite — CUDA graphs on the decode fast-path (#11-#14), rope_freqs on global layers (#20), SWA long-context (#21), host-resident MoE gate_up split (e879bcd), split-K cp.async chunk loop for head_dim=512. Q4_K_M decode 55 → 183 tok/s (×1.21 vs llama.cpp).
  • Platform — CUDA 13.2.1 base images (#16), stream priorities + mem-sync domains + cluster spread (#17), StreamingLLM smart KV cache (#26), weight-storage refactor with TensorKind + StoragePlanner + gemm_dispatch (#27), CUTLASS 3.x NVFP4 Grouped GEMM scaffold (#22), ModelArch::QWEN36_MOE scaffold (#23).

Long-Context Prefill (new — pp=8192)

Previously broken. Now functional and ahead of llama.cpp on every tested model:

Model imp v0.7 llama.cpp Speedup
Qwen3-4B Q8_0 13,566 7,978 ×1.70
Qwen3-8B Q8_0 11,050 6,749 ×1.64
Qwen3.5-4B GDN Q8_0 13,090
Mistral-24B Q6_K 3,595 3,058 ×1.18
Qwen3-32B Q4_K_M 2,040 1,802 ×1.13

Diagnostic / infra

  • IMP_DEBUG_RAW meta-flag (#29), IMP_EXPERT_OVERHEAD_PCT hint on graph disable (#32)
  • tools/analysis/layer_diff.py — per-layer tensor diff vs llama.cpp
  • Gemma4GraphsTest e2e regression
  • FmhaFP8Test.Qwen35LikeHD256_GQA41_SeqMultiTile — catches the bug class from #33

Known issues (carried from CHANGELOG)

  • Qwen3-Coder-30B-A3B NVFP4 still needs --no-cuda-graphs (general-MoE D2H routing is graph-incompatible; Gemma-4 excepted via its decode fast-path).
  • Prefill throughput has up to 2.6× variance between container restarts due to cuBLAS autotuning — compare decode-only for reliable A/B testing.
  • 1024→2048 throughput dip on small dense models (Qwen3-4B: 27k → 19k tok/s at the dispatch boundary). Output correct; smoothing is future work.
  • MXFP4 GGUFs use imp-proprietary tensor-type 31 — llama.cpp reads it as the removed Q4_0_4_4, so cross-tool PPL comparison is not possible without a standard-format export.

Full changelog

See CHANGELOG.md for the complete Keep-a-Changelog entry.

🤖 Generated with Claude Code

v0.6 — Qwen3.5, MXFP4, Jinja2 Macros, HuggingFace Hub

02 Apr 19:27

Choose a tag to compare

Highlights

Qwen3.5 (Gated DeltaNet) now works correctly. The root cause was a missing Jinja2 {% macro %} feature — Qwen3.5's chat template uses macros for multimodal content handling. Without macro support, user prompts rendered as "None" and the model ignored all input. Fixed with full Jinja2 macro support (MacroNode, parse_macro, call_macro with positional args, kwargs, and defaults).

Native MXFP4 GGUF weight format. Tensor-core-native 4-bit weights (FP4 E2M1 + UE8M0 block scales) feed directly into Blackwell's CUTLASS block-scaled GEMM — zero dequant overhead. Includes a Python converter (tools/convert_mxfp4.py) and full runtime integration with FP16 decode fallback.

HuggingFace Hub integration. Load models directly from HuggingFace by repo ID instead of local paths. GPTQ SafeTensors dequant and tokenizer.json loader included.

Performance (RTX 5090, CUDA 13.2)

Model Quant Decode (tok/s) Prefill (tok/s)
Qwen3-4B Q8_0 377 27,201
Qwen3-8B Q8_0 255 17,636
Qwen3.5-4B (GDN) Q8_0 306 14,823
Qwen3.5-9B (GDN) Q8_0 134 8,520
Llama-3.2-3B Q8_0 208 22,544

What's New

Features

  • Jinja2 macro support{% macro name(args) %}...{% endmacro %} with positional/keyword args and defaults
  • Native MXFP4 GGUFGGML_TYPE_MXFP4 (type 31), 4.25 bits/weight, CUTLASS tensor-core GEMM
  • MXFP4 convertertools/convert_mxfp4.py (HuggingFace BF16/FP16 → MXFP4 GGUF)
  • HuggingFace Hub — load models by repo ID (--model Qwen/Qwen3-8B)
  • GPTQ dequant — SafeTensors GPTQ models load with on-the-fly dequantization
  • tokenizer.json loader — HuggingFace tokenizer format support
  • N-gram speculative decoding--ngram-spec CLI/server flag, multi-sequence decode verify
  • Jinja2 engine improvements — slice, is-tests (string/iterable/mapping/number), strip(chars), tojson filter
  • --min-kv-tokens — guaranteed KV cache capacity before weight cache allocation

Bug Fixes

  • Qwen3.5 chat template — Jinja2 macro support fixes "None" content rendering
  • N-gram spec verify — replaced pseudo-prefill (KV divergence) with multi-sequence decode verify
  • Gemma-3 multi-turn — three root causes fixed (cuBLAS cache, softcap, banned tokens)
  • CMake sm_120/120f conflict — skip 120f gencode when 120 already in CMAKE_CUDA_ARCHITECTURES
  • GDN L2 norm epsilon — fused kernel (1e-12) now matches decode kernel (1e-6)
  • Think token banning — token_type metadata from GGUF for correct <think> handling
  • Server defaults — default context length, strip banned tokens from output

Infrastructure

  • CUDA 13.2, CUTLASS v4.4.2, GoogleTest v1.17.0, cpp-httplib v0.40.0, nlohmann/json v3.12.0
  • Dead EAGLE-3 code removed
  • TODO.md refreshed with current status

Breaking Changes

None. GGUF models from v0.5.1 continue to work unchanged.

Tested Models

Qwen3-4B, Qwen3-8B, Qwen3-32B, Qwen3-Coder-30B (MoE), Qwen3.5-4B, Qwen3.5-9B, Qwen3.5-27B, Qwen3.5-35B (GDN), Gemma-3-12B/27B, DeepSeek-R1-7B/14B, Nemotron-3-Nano-30B, Llama-3.2-3B, Llama 3.1 8B, Mistral 7B, Mixtral 8x7B, Devstral

Quickstart

docker compose build imp-server
docker run --gpus all -v ./models:/models -p 8080:8080 \
  imp:latest --model /models/Qwen3-8B-Q8_0.gguf

v0.5.1: Fix GDN multi-turn chat

28 Mar 01:32

Choose a tag to compare

What's Fixed

GDN (Qwen3.5) multi-turn chat — conversations with 2+ turns produced degenerate output (repeated tokens, garbage). llama.cpp worked correctly. Root cause: FP8 E4M3 weight quantization (3-bit mantissa) causes precision errors that accumulate through the GDN delta rule scan when processing repeated chat template special tokens.

Changes

  • FP16 prefill weights for GDN: Auto-detected, ~8% prefill throughput reduction vs FP8, but correct multi-turn output
  • Chunked prefill state carry-forward: Recurrent state no longer reset between prefill chunks
  • Conv1d chunk boundary fix: Reads previous chunk context instead of zero-padding
  • Prefix caching guard: Disabled for recurrent models (token skipping breaks sequential state)

Benchmarks (RTX 5090)

Model Decode (tok/s) Prefill (tok/s)
Qwen3-4B 375 24,055
Qwen3-8B 255 17,746
Qwen3.5-4B (GDN) 308 14,687
Qwen3.5-9B (GDN) 134 8,418
Gemma-3-12B 129 6,998

Multi-Turn Quality

Scenario v0.5 v0.5.1
5-turn chat (4B) ❌ garbage ✅ correct
7-turn chat (9B) ❌ garbage ✅ correct

Full Changelog: v0.5...v0.5.1

v0.4.1 — Qwen3.5 9B fix + cuBLASLt robustness

21 Mar 17:41

Choose a tag to compare

Bug Fixes

Qwen3.5 9B+ model quality fix:

  • NVFP4 (4-bit) decode cache auto-disabled for GDN (Gated DeltaNet) models. The delta rule scan accumulates quantization error in the recurrent state across tokens — NVFP4 caused repeated <|im_start|> tokens on Qwen3.5-9B and garbage output on 27B. FP8 prefill + dp4a Q8_0 decode preserves enough precision.
  • Qwen3.5-4B was unaffected (smaller weight matrices tolerate 4-bit), but the fix applies globally to all GDN models for safety.

cuBLASLt crash-to-fallback:

  • cublasLtMatmul failures (CUDA 13.2 status 7 on sm_120 for certain M/K/N) now fall back to cublasGemmEx instead of silently continuing with corrupted output. Affects all three cuBLASLt paths (generic GEMM, INT compute, FP8-scaled).

Performance

Qwen3.5 decode (RTX 5090, Q8_0):

Model v0.4 v0.4.1 Note
Qwen3.5-4B 327 tok/s 253 tok/s No NVFP4 → dp4a path (correct before and after)
Qwen3.5-9B ❌ broken 136 tok/s Was producing garbage, now works
Qwen3.5-27B ~12 tok/s 31 tok/s VRAM-limited (27 GB model on 32 GB card)

Non-GDN models unaffected — Qwen3, Gemma-3, LLaMA, etc. continue using NVFP4 at full speed.

Other

  • Async sampling with pinned host memory (truly async cudaMemcpyAsync)
  • Batched logprobs D2H: single cudaStreamSynchronize for N sequences instead of N syncs

v0.4 — Qwen3.5 Gated DeltaNet support

21 Mar 12:42

Choose a tag to compare

What's New

Qwen3.5 (Gated DeltaNet) architecture support — 4B, 27B, 35B-A3B MoE models now work with correct, coherent output. Fused CUDA kernels make imp significantly faster than llama.cpp on this architecture.

Benchmarks (Qwen3.5-4B Q8_0, RTX 5090)

Metric imp v0.4 llama.cpp b8445 Speedup
Decode (tg128) 327 tok/s 180 tok/s +82%
Prefill (pp512) 16,017 tok/s 11,149 tok/s +44%
Prefill (pp128) 5,799 tok/s 6,136 tok/s ~1x

Key Changes

Correctness fixes:

  • RoPE frequency base for partial RoPE (rope_dim < head_dim) — affected all attention layers in Qwen3.5
  • post_attn_norm misplacement — was applied inside attention output instead of as pre-FFN norm
  • Conv1d decode buffer aliasing (FP16/FP32 shared buffer)
  • BOS token default for GPT2 tokenizers

Performance optimizations:

  • Fused multi-token GDN scan kernel with register-cached recurrent state (125x less memory traffic)
  • Fused RMSNormGated+SiLU kernel (8192 → 1 kernel launch for pp128)
  • Fused conv1d+SiLU+FP32 prefill kernel
  • CUDA graphs enabled for GDN decode

No regressions on existing models (Qwen3, Qwen3-MoE, Gemma-3, DeepSeek, Nemotron-H).

Supported Qwen3.5 Models

  • Qwen3.5-4B (Q8_0, Q6_K, Q4_K_M)
  • Qwen3.5-27B (Q8_0, Q6_K, Q4_K_M)
  • Qwen3.5-35B-A3B MoE (Q6_K)

Requirements

  • NVIDIA GPU with sm_90+ (Hopper/Blackwell)
  • CUDA Toolkit 13.2+
  • Docker with NVIDIA Container Toolkit

v0.2

14 Mar 09:09

Choose a tag to compare