Releases · kekzl/imp

10 May 19:16

kekzl

v0.9.0

2ce3093

v0.9.0 — NVFP4 prefill fast-path, NVFP4 KV cache, BitDecoding parity, NemotronH, sm_120a Latest

Latest

First release after v0.8.0 — 60+ PRs of NVFP4 maturation, new architectures, and the hybrid-prefill correctness pass.

Highlights

NVFP4 MoE prefill fast-path (#160) — Qwen3-Coder-30B-A3B-NVFP4 pp512 1241 → 13046 tok/s (×10.5).
NVFP4 KV cache (#108, #125) — opt-in --kv-nvfp4; 16k → 40k tokens at the same VRAM, parity with FP16 baseline after vectorized PTX dequant.
BitDecoding TC paged decode (Phases 0-3) (#142, #145, #146, #147, #148, #149) — parity with FP16 baseline (193 vs 193 tok/s) on Qwen3-4B Q8 NVFP4-KV.
NemotronH hybrid Mamba2+MoE+Attention NVFP4 (#104, #109) — tg128 42 → 319 tok/s after dynamic NVFP4 MoE reserve.
sm_120f → sm_120a build target (#105) — full RTX 5090 feature set (FP8 MMA .kind::f8f6f4, NVFP4 block-scaled MMA, TMA warp-specialized grouped GEMM tactics).
1024 → 4096 prefill cliff closed (#110) — pp4096 +18-28% across Q8 baselines.
Chunked prefill correctness (#130) + hybrid GDN+MoE / Mamba2+MoE archs (#156).

Notable additions

Native SentencePiece (.model) parser (#128) — drops Python fallback for Mistral-family tokenizers.
Multimodal Qwen3.6-VL NVFP4 loader (#152).
Zero-config SafeTensors auto-detect (#116).
Server: tools + JSON-schema coordination (#103, #112, #119), opt-in --log-requests JSONL (#155).
CUDA 13.2 modernization (#131): cudaMemcpyWithAttributesAsync, add.f32x2.
GHCR release pipeline (#101).

Known issues (carry-over)

NVFP4 MoE prefill ceiling at ~16k tok/s warm vs vLLM single-seq 18.5k = 1.42× gap. Next-step memo in repo.
Spec-decode / MTP still off on NVFP4 decode-cache models.
CUTLASS NVFP4 sm_120 non-determinism under graph-replay (skip-guard retained for llm-compressor exports).
Prefill throughput shows up to 2.6× variance between container restarts due to cuBLAS autotuning. Compare decode-only for reliable A/B.

Full notes: CHANGELOG.md.

Docker image (auto-built on this release): ghcr.io/kekzl/imp:0.9.0, :0.9, :0, :latest.

Assets 2

03 May 10:58

kekzl

v0.8.0

cac9955

v0.8.0 — NVFP4 prequant production-ready, native tool calling, public release cleanup

NVFP4-prequant SafeTensors hits production: Mistral-3.2 / Gemma-4 /
Qwen3.6 / Qwen3-Coder all coherent on single-turn, sampling, multi-turn
and short long-context. FP8 KV warmup calibration fixed for Llama and
GDN families. CUDA Graphs lit up for prequant SafeTensors. Forty-plus
PRs since v0.7.0.

Server + tools (PR #97)

Native function calling for Gemma-4 + Qwen3.6 — root-cause was a
tokenizer bug, not just missing parsers. encode_spm / encode_gpt2 /
encode_gemma4 now run a longest-match pre-split pass against
CONTROL-flagged added tokens before BPE. Multi-character markers like
<|tool_call> (Gemma-4 token id 48) were being BPE'd as raw UTF-8
bytes — the model never saw the trained marker in its prompt's
tools-rendering and answered with markdown JSON code blocks instead
of the native protocol. Fixed: token 48/49 round-trip as their
assigned id. Added parse_tool_calls_gemma() for Gemma's non-JSON
syntax (<|tool_call>call:NAME{key:value}<tool_call|> with
<|"|>...<|"|> string escapes), and extended parse_tool_calls_chatml()
to branch on body shape so Qwen3.6's XML-styled
<function=...><parameter=...> payload parses too. End-to-end
verified on Gemma-4 Q4_K_M (finish_reason=tool_calls, 19 tokens
completion) and Qwen3.6-NVFP4 (finish_reason=tool_calls with
reasoning_content alongside).
Faster cold start (24s → 18s on Qwen3.6 NVFP4) — skip MTP / vision
-only SafeTensors shards when neither is wired up (~5s, 2.4 GiB of
mmap + header parse + page-cache pressure avoided), MAP_POPULATE +
MADV_WILLNEED on weight mmaps, pinned staging ring 2x64 MiB →
4x128 MiB, Pass-2 expert upload re-arms cudaMemGetInfo cache so
per-tensor checked_cuda_malloc skips ~15k sync calls on 128-expert
MoE, concurrent SafeTensors shard parse (3 shards in parallel
threads), exposed name_is_skipped() to deduplicate the shard-skip
filter and translate_name's skip rules.
Server fixes (Open WebUI on Qwen3.6-NVFP4) — UTF-8 boundary walk
in reasoning stream (German umlauts came out as f��r because the
7-byte tail-overlap landed mid-multibyte), drop leaked stop tokens
(<|im_end|> / <|endoftext|>) before the is_last gate, restrict
"[Reasoning truncated]" notice to finish == "length", post-</think>
grace 4 → 16 tokens, repetition_penalty default 1.0 → 1.05 to break
multi-turn loop degeneration, workspace skips FP8 / MXFP4 scratch
for paths we won't use (~6.4 GiB VRAM headroom on Qwen3.6 NVFP4 GDN).
Open WebUI tools enabled in docker-compose — DuckDuckGo web
search (no API key), Pyodide code interpreter (browser-side, no
sandbox service), URL fetch, native function calling toggleable
per message via the chat-input icons.

Fixed

FP8 KV warmup-calibration bug (#89) — Engine::warmup() ran a forward
pass with synthetic BOS tokens; the FP8 write path's online calibration
treated this as the FIRST prefill, locked kv_scales_[layer] to a
too-small absmax, and never recalibrated. Real generation then overflowed
FP8 dynamic range on Llama-3.2-3B Q8_0 and Qwen3.5-4B GDN Q8_0 (output
degenerated within ~30 tokens, e.g. " France, and, 2008, 201, 201, …").
Fix: Engine::warmup() drops the kv_calibrated_ flags at end-of-warmup;
the FP8 write path promotes the scale monotonically via std::max so
the warmup observation survives if it's already wider, and real prefill
widens it further when needed. Long generation (100 tokens) on
Llama-3.2-3B FP8 KV now produces a clean factually-correct list of world
capitals.
NVFP4 prequant CUTLASS prefill cache (#88) — Phase 0 promotes set
Tensor.qtype = NVFP4 directly on the main weight tensors but Phase 3b
(CUTLASS cache build) only iterated the legacy wcache_.nvfp4 map.
Prequant SafeTensors prefill therefore fell through to gemm_nvfp4
dequant→cuBLAS, allocating ~40 MiB FP16 scratch per layer per prefill —
graph-incompatible AND noisy on SmoothQuant-calibrated Mistral-3.2-NVFP4.
Phase 0b loop registers all dense + out_proj_ prequant tensors in
cutlass_nvfp4 directly. Standard pp512/tg256 bench post-fix:
Mistral-3.2-NVFP4 tg 81→101, Qwen3.6-NVFP4 tg 117–142→217,
Gemma-4-NVFP4 tg 157–180→213, Qwen3-Coder-30B-A3B-NVFP4 tg 51→272
(--no-cuda-graphs no longer needed). Mistral-3.2-NVFP4 long-context
Lorem×11 numerical-hash garbage → coherent text.
NVFP4 prequant MoE decode fast-path (#85) — Qwen3.6-NVFP4 went 8.34 →
117–142 tok/s (~14–17×); Gemma-4-NVFP4 went ~42 → 157–180 tok/s (~4×).
Three bugs: can_decode_fast whitelist did not include NVFP4-prequant
models; cache_moe_native_nvfp4 had to be added to build the contiguous
per-expert NVFP4 buffer for SafeTensors per-expert layouts; per-layer
free of per-expert allocations (32 GiB VRAM ceiling on 35B-A3B).
Six Qwen3.5/3.6-NVFP4 SafeTensors loader bugs (#81) blocking coherent
decode: (1) RMSNorm 1+W convention now honoured via
UploadCtx::arch_norm_offset, (2) GDN head layout HF-grouped vs
GGUF-tiled with kernel grouped_layout flag, (3) partial_rotary_factor
read from both top-level and nested rope_parameters,
(4) rope_theta from nested rope_parameters.rope_theta,
(5) A_log → -exp(A_log) transform applied to BF16/F16 SafeTensors path
only, (6) fp32_scan y_buf populated outside debug_forward. Per-layer
correlation vs GGUF Q4_K_M now ≥0.997 across all 40 layers; output
matches the GGUF oracle for the standard verification prompt.
Qwen3.5 GDN Q8_0 α/β qtype mismatch (#59) — upload_weight pre-dequanted
Q8 → FP16 without updating qtype. Dispatcher mis-interpreted bytes →
state collapse ( my my my…).
MXFP4 GDN-fallback dequant (#58) — replaced buggy CPU path with GPU kernel.
MXFP4 FP16-fallback VRAM oversubscription diagnostic (#60) — clear error
message for the Qwen3.5-27B-MXFP4 IMA-on-load case (was silent).
Qwen3.5-MXFP4 A_log from blk.X.ssm_dt.weight (#61).
MoE expert-offload auto-pick (#54) — defaults try 10 % overhead first
before falling back to 30 %. Qwen3-Coder-30B Q6_K 77 → 234 tok/s.
Mistral-3.2-NVFP4 use_default_system_prompt (#78) — honour the
tokenizer-config flag and skip the 600-token jinja default system prompt.
"I am the capital of France?" → "Paris".
Server <channel|> swallowing answer body on Gemma-4 (#39).
Gemma-4 byte-fallback on common names (#37).
Server reasoning_content for chat-template-injected <think> (#86).
verify auto re-execs in imp:test when host CMake is missing (#70) —
unblocks make verify-fast for clean-host workflows.

Added

KV-cache safety default flip (#51) — default KV dtype is now FP16; FP8
is opt-in via --kv-fp8 / imp.conf:kv_cache.dtype="fp8". Fixes Mistral,
DeepSeek, and Qwen3.5-GDN out of the box on first decode.
Auto-deterministic cuBLAS when FP8 KV active (#52) — pins cuBLAS algo
selection to avoid quant-dequant noise → softmax NaN. Necessary fix; not
sufficient for all archs (see docs/roadmap.md "FP8 KV stride bug").
CUDA Graph coverage expansion (#53) — speculative-verify graphs, SigLIP
vision graph, default mem-pool retain, cudaGraphExecUpdate re-capture.
SM120 FMHA optimisation pass — Project B Stage 4 (#55, #56) — float4
tile loads + HW FP4 conversion. +11–13 % prefill on Qwen3-4B Q8_0 at
pp=8192. Stage 5 (mxf4nvf4.block_scale.scale_vec::4X.m16n8k64) layouts
verified byte-exact, integration is the next open Project B item.
NVFP4 SafeTensors loader from llm-compressor (Phase 1, #63; Phase 2
Item 1 Mistral3, #64; Phase 2 Item 2 Gemma-4 extras + per-row gemv
bypass, #65). Mistral3-NVFP4 decode tg ≈ 81 tok/s post Phase 2 Item 1.
Gemma-4-NVFP4 (llm-compressor) decodes coherent end-to-end at ~34 tok/s
with default flags after #65 routes M>1 expert GEMV through gemm_nvfp4
dequant→cuBLAS (legacy serial path's per-row gemv_nvfp4_kpar loop
produced wrong output at Gemma-4 expert dimensions; M=1 decode path is
unchanged).
Qwen3.6-NVFP4 SafeTensors plumbing (Phase 1 #71) — load-only.
Decode lit up later via #85.
JSON config plumbing (#74, #77) — generation_config.json sampling
defaults, special_tokens_map.json, Mistral V3 tokenizer-config flags.
Tokenizer-config use_default_system_prompt=false honoured (#78) — see
Fixed.
Type-system + config refactor (#72) — unified QType, Tensor sidecars,
imp.conf (TOML, ~50 former IMP_* env vars now keys). New top-level
imp.conf.example. CLI --set kv_cache.dtype=fp8 for per-run overrides.
NVFP4 collapsed load-time scratch (#73) — single Model map.
FP32 attention S-matrix + Qwen3.5 QK-norm split (#66) — improves
numerical headroom on long-context attention.
Diagnostic env vars for NVFP4 + attention (#79) — reproducer support
for the long-context NVFP4 bug.
Anthropic /v1/messages endpoint (Phase 1 non-streaming #35,
Phase 2 streaming #36) — synthetic SSE stream over the OpenAI handler.
Storage-planner enumerates shared-expert FFN + top-level embeddings/LM
head (#38, #40) — fixes silent miss in MoE memory budget.
Strengthened GDN coherence test (#48) — detects recurrent-state collapse.
Strengthened Gemma-4 NVFP4 e2e gate (#68) — Paris coherence assertion.
Synthetic gemv_kpar M>1 per-row-loop bug repro test (#69).
Split imp-tests into 8 per-module binaries (#57) — speeds up filtered
test runs.
tools/analysis/ PTX survey scripts (#67) — re-runnable cvt / MMA /
async-TMA / atomics / SFU / cluster surveys for sm_120f after CUDA upgrades.

Changed

Default KV dtype is FP16 (#51) — see Added. Was implicit auto-FP8.
imp.conf is now the configuration interface (#72) — ~50 IMP_* env
vars retired; sectioned TOML keys (runtime.cuda_graphs, kv_cache.dtype,
attention.fp8_fmha, …). Loading precedence: --config → `$IMP_C...

Assets 2

23 Apr 11:56

kekzl

v0.7

9e68101

v0.7.0 — Long-context correctness + Gemma-4/GDN stabilization

Big correctness + platform release covering 195 commits since v0.6. The long-context dispatch cliff is gone, Gemma-4 and the Qwen 3.5 / 3.6 GDN family now produce clean output on Blackwell, CUDA 13.2.1 with stream priorities and mem-sync domains is live, and the StreamingLLM smart-KV mode is available.

Headline

FP8 FMHA S_tile smem overlap fix (#33) — pp > 1024 now coherent across every tested architecture. Previously all attention layers emitted NaN above the cuBLAS dispatch boundary. Up to ×1.70 vs llama.cpp at pp=8192 on Qwen3-4B.
Qwen 3.5 / 3.6 GDN stabilization (#28, #30) — gdn_scan_fused_kernel __launch_bounds__(HD, 2) miscompile fixed, partial-RoPE pair-offset fixed, ssm_state_dtype never auto-downgraded for GDN (the FP32 scan was overflowing into the next layer's state). Qwen 3.6 tg256 36 → 57 tok/s.
Gemma-4 suite — CUDA graphs on the decode fast-path (#11-#14), rope_freqs on global layers (#20), SWA long-context (#21), host-resident MoE gate_up split (e879bcd), split-K cp.async chunk loop for head_dim=512. Q4_K_M decode 55 → 183 tok/s (×1.21 vs llama.cpp).
Platform — CUDA 13.2.1 base images (#16), stream priorities + mem-sync domains + cluster spread (#17), StreamingLLM smart KV cache (#26), weight-storage refactor with TensorKind + StoragePlanner + gemm_dispatch (#27), CUTLASS 3.x NVFP4 Grouped GEMM scaffold (#22), ModelArch::QWEN36_MOE scaffold (#23).

Long-Context Prefill (new — pp=8192)

Previously broken. Now functional and ahead of llama.cpp on every tested model:

Model	imp v0.7	llama.cpp	Speedup
Qwen3-4B Q8_0	13,566	7,978	×1.70
Qwen3-8B Q8_0	11,050	6,749	×1.64
Qwen3.5-4B GDN Q8_0	13,090	—	—
Mistral-24B Q6_K	3,595	3,058	×1.18
Qwen3-32B Q4_K_M	2,040	1,802	×1.13

Diagnostic / infra

IMP_DEBUG_RAW meta-flag (#29), IMP_EXPERT_OVERHEAD_PCT hint on graph disable (#32)
tools/analysis/layer_diff.py — per-layer tensor diff vs llama.cpp
Gemma4GraphsTest e2e regression
FmhaFP8Test.Qwen35LikeHD256_GQA41_SeqMultiTile — catches the bug class from #33

Known issues (carried from CHANGELOG)

Qwen3-Coder-30B-A3B NVFP4 still needs --no-cuda-graphs (general-MoE D2H routing is graph-incompatible; Gemma-4 excepted via its decode fast-path).
Prefill throughput has up to 2.6× variance between container restarts due to cuBLAS autotuning — compare decode-only for reliable A/B testing.
1024→2048 throughput dip on small dense models (Qwen3-4B: 27k → 19k tok/s at the dispatch boundary). Output correct; smoothing is future work.
MXFP4 GGUFs use imp-proprietary tensor-type 31 — llama.cpp reads it as the removed Q4_0_4_4, so cross-tool PPL comparison is not possible without a standard-format export.

Full changelog

See CHANGELOG.md for the complete Keep-a-Changelog entry.

🤖 Generated with Claude Code

Assets 2

02 Apr 19:27

kekzl

v0.6

ec1dadd

v0.6 — Qwen3.5, MXFP4, Jinja2 Macros, HuggingFace Hub

Highlights

Qwen3.5 (Gated DeltaNet) now works correctly. The root cause was a missing Jinja2 {% macro %} feature — Qwen3.5's chat template uses macros for multimodal content handling. Without macro support, user prompts rendered as "None" and the model ignored all input. Fixed with full Jinja2 macro support (MacroNode, parse_macro, call_macro with positional args, kwargs, and defaults).

Native MXFP4 GGUF weight format. Tensor-core-native 4-bit weights (FP4 E2M1 + UE8M0 block scales) feed directly into Blackwell's CUTLASS block-scaled GEMM — zero dequant overhead. Includes a Python converter (tools/convert_mxfp4.py) and full runtime integration with FP16 decode fallback.

HuggingFace Hub integration. Load models directly from HuggingFace by repo ID instead of local paths. GPTQ SafeTensors dequant and tokenizer.json loader included.

Performance (RTX 5090, CUDA 13.2)

Model	Quant	Decode (tok/s)	Prefill (tok/s)
Qwen3-4B	Q8_0	377	27,201
Qwen3-8B	Q8_0	255	17,636
Qwen3.5-4B (GDN)	Q8_0	306	14,823
Qwen3.5-9B (GDN)	Q8_0	134	8,520
Llama-3.2-3B	Q8_0	208	22,544

What's New

Features

Jinja2 macro support — {% macro name(args) %}...{% endmacro %} with positional/keyword args and defaults
Native MXFP4 GGUF — GGML_TYPE_MXFP4 (type 31), 4.25 bits/weight, CUTLASS tensor-core GEMM
MXFP4 converter — tools/convert_mxfp4.py (HuggingFace BF16/FP16 → MXFP4 GGUF)
HuggingFace Hub — load models by repo ID (--model Qwen/Qwen3-8B)
GPTQ dequant — SafeTensors GPTQ models load with on-the-fly dequantization
tokenizer.json loader — HuggingFace tokenizer format support
N-gram speculative decoding — --ngram-spec CLI/server flag, multi-sequence decode verify
Jinja2 engine improvements — slice, is-tests (string/iterable/mapping/number), strip(chars), tojson filter
--min-kv-tokens — guaranteed KV cache capacity before weight cache allocation

Bug Fixes

Qwen3.5 chat template — Jinja2 macro support fixes "None" content rendering
N-gram spec verify — replaced pseudo-prefill (KV divergence) with multi-sequence decode verify
Gemma-3 multi-turn — three root causes fixed (cuBLAS cache, softcap, banned tokens)
CMake sm_120/120f conflict — skip 120f gencode when 120 already in CMAKE_CUDA_ARCHITECTURES
GDN L2 norm epsilon — fused kernel (1e-12) now matches decode kernel (1e-6)
Think token banning — token_type metadata from GGUF for correct <think> handling
Server defaults — default context length, strip banned tokens from output

Infrastructure

CUDA 13.2, CUTLASS v4.4.2, GoogleTest v1.17.0, cpp-httplib v0.40.0, nlohmann/json v3.12.0
Dead EAGLE-3 code removed
TODO.md refreshed with current status

Breaking Changes

None. GGUF models from v0.5.1 continue to work unchanged.

Tested Models

Qwen3-4B, Qwen3-8B, Qwen3-32B, Qwen3-Coder-30B (MoE), Qwen3.5-4B, Qwen3.5-9B, Qwen3.5-27B, Qwen3.5-35B (GDN), Gemma-3-12B/27B, DeepSeek-R1-7B/14B, Nemotron-3-Nano-30B, Llama-3.2-3B, Llama 3.1 8B, Mistral 7B, Mixtral 8x7B, Devstral

Quickstart

docker compose build imp-server
docker run --gpus all -v ./models:/models -p 8080:8080 \
  imp:latest --model /models/Qwen3-8B-Q8_0.gguf

Assets 2

28 Mar 01:32

kekzl

v0.5.1

55c290e

v0.5.1: Fix GDN multi-turn chat

What's Fixed

GDN (Qwen3.5) multi-turn chat — conversations with 2+ turns produced degenerate output (repeated tokens, garbage). llama.cpp worked correctly. Root cause: FP8 E4M3 weight quantization (3-bit mantissa) causes precision errors that accumulate through the GDN delta rule scan when processing repeated chat template special tokens.

Changes

FP16 prefill weights for GDN: Auto-detected, ~8% prefill throughput reduction vs FP8, but correct multi-turn output
Chunked prefill state carry-forward: Recurrent state no longer reset between prefill chunks
Conv1d chunk boundary fix: Reads previous chunk context instead of zero-padding
Prefix caching guard: Disabled for recurrent models (token skipping breaks sequential state)

Benchmarks (RTX 5090)

Model	Decode (tok/s)	Prefill (tok/s)
Qwen3-4B	375	24,055
Qwen3-8B	255	17,746
Qwen3.5-4B (GDN)	308	14,687
Qwen3.5-9B (GDN)	134	8,418
Gemma-3-12B	129	6,998

Multi-Turn Quality

Scenario	v0.5	v0.5.1
5-turn chat (4B)	❌ garbage	✅ correct
7-turn chat (9B)	❌ garbage	✅ correct

Full Changelog: v0.5...v0.5.1

Assets 2

21 Mar 17:41

kekzl

v0.4.1

5ab7d50

v0.4.1 — Qwen3.5 9B fix + cuBLASLt robustness

Bug Fixes

Qwen3.5 9B+ model quality fix:

NVFP4 (4-bit) decode cache auto-disabled for GDN (Gated DeltaNet) models. The delta rule scan accumulates quantization error in the recurrent state across tokens — NVFP4 caused repeated <|im_start|> tokens on Qwen3.5-9B and garbage output on 27B. FP8 prefill + dp4a Q8_0 decode preserves enough precision.
Qwen3.5-4B was unaffected (smaller weight matrices tolerate 4-bit), but the fix applies globally to all GDN models for safety.

cuBLASLt crash-to-fallback:

cublasLtMatmul failures (CUDA 13.2 status 7 on sm_120 for certain M/K/N) now fall back to cublasGemmEx instead of silently continuing with corrupted output. Affects all three cuBLASLt paths (generic GEMM, INT compute, FP8-scaled).

Performance

Qwen3.5 decode (RTX 5090, Q8_0):

Model	v0.4	v0.4.1	Note
Qwen3.5-4B	327 tok/s	253 tok/s	No NVFP4 → dp4a path (correct before and after)
Qwen3.5-9B	❌ broken	136 tok/s	Was producing garbage, now works
Qwen3.5-27B	~12 tok/s	31 tok/s	VRAM-limited (27 GB model on 32 GB card)

Non-GDN models unaffected — Qwen3, Gemma-3, LLaMA, etc. continue using NVFP4 at full speed.

Other

Async sampling with pinned host memory (truly async cudaMemcpyAsync)
Batched logprobs D2H: single cudaStreamSynchronize for N sequences instead of N syncs

Assets 2

21 Mar 12:42

kekzl

v0.4

d2fe039

v0.4 — Qwen3.5 Gated DeltaNet support

What's New

Qwen3.5 (Gated DeltaNet) architecture support — 4B, 27B, 35B-A3B MoE models now work with correct, coherent output. Fused CUDA kernels make imp significantly faster than llama.cpp on this architecture.

Benchmarks (Qwen3.5-4B Q8_0, RTX 5090)

Metric	imp v0.4	llama.cpp b8445	Speedup
Decode (tg128)	327 tok/s	180 tok/s	+82%
Prefill (pp512)	16,017 tok/s	11,149 tok/s	+44%
Prefill (pp128)	5,799 tok/s	6,136 tok/s	~1x

Key Changes

Correctness fixes:

RoPE frequency base for partial RoPE (rope_dim < head_dim) — affected all attention layers in Qwen3.5
post_attn_norm misplacement — was applied inside attention output instead of as pre-FFN norm
Conv1d decode buffer aliasing (FP16/FP32 shared buffer)
BOS token default for GPT2 tokenizers

Performance optimizations:

Fused multi-token GDN scan kernel with register-cached recurrent state (125x less memory traffic)
Fused RMSNormGated+SiLU kernel (8192 → 1 kernel launch for pp128)
Fused conv1d+SiLU+FP32 prefill kernel
CUDA graphs enabled for GDN decode

No regressions on existing models (Qwen3, Qwen3-MoE, Gemma-3, DeepSeek, Nemotron-H).

Supported Qwen3.5 Models

Qwen3.5-4B (Q8_0, Q6_K, Q4_K_M)
Qwen3.5-27B (Q8_0, Q6_K, Q4_K_M)
Qwen3.5-35B-A3B MoE (Q6_K)

Requirements

NVIDIA GPU with sm_90+ (Hopper/Blackwell)
CUDA Toolkit 13.2+
Docker with NVIDIA Container Toolkit

Assets 2

14 Mar 09:09

kekzl

v0.2

23a0baa

v0.2

Full Changelog: https://github.com/kekzl/imp/commits/v0.2

Assets 2

Releases: kekzl/imp

v0.9.0 — NVFP4 prefill fast-path, NVFP4 KV cache, BitDecoding parity, NemotronH, sm_120a

Highlights

Notable additions

Known issues (carry-over)

Uh oh!

v0.8.0 — NVFP4 prequant production-ready, native tool calling, public release cleanup

Server + tools (PR #97)

Fixed

Added

Changed

Uh oh!

v0.7.0 — Long-context correctness + Gemma-4/GDN stabilization

Headline

Long-Context Prefill (new — pp=8192)

Diagnostic / infra

Known issues (carried from CHANGELOG)

Full changelog

Uh oh!

v0.6 — Qwen3.5, MXFP4, Jinja2 Macros, HuggingFace Hub

Highlights

Performance (RTX 5090, CUDA 13.2)

What's New

Features

Bug Fixes

Infrastructure

Breaking Changes

Tested Models

Quickstart

Uh oh!

v0.5.1: Fix GDN multi-turn chat

What's Fixed

Changes

Benchmarks (RTX 5090)

Multi-Turn Quality

Uh oh!

v0.4.1 — Qwen3.5 9B fix + cuBLASLt robustness

Bug Fixes

Performance

Other

Uh oh!

v0.4 — Qwen3.5 Gated DeltaNet support

What's New

Benchmarks (Qwen3.5-4B Q8_0, RTX 5090)

Key Changes

Supported Qwen3.5 Models

Requirements

Uh oh!

v0.2

Uh oh!