Skip to content

q4-imatrix weights fail to load: "expected IQ2_XXS expert tensors" #114

@aisaacsmitchell

Description

@aisaacsmitchell

Summary

The q4-imatrix weights file added in commit ed5d30d ("q4 imatrix file in the download script") fails to load — the binary aborts with expected IQ2_XXS expert tensors immediately after the model header is parsed.

The download_model.sh help text recommends q4-imatrix for "machines with 256 GB RAM or more", but the current loader hardcodes IQ2_XXS as the only accepted expert tensor type.

Reproducer

Mac Studio M3 Ultra, 256 GB. Built from latest main (HEAD = 0230891 at time of report).

./download_model.sh q4-imatrix
./ds4 -p "Hello" --temp 0 -n 50 --nothink

Expected

Generates a normal response.

Actual

Two failure modes observed:

  1. Diagnostic flag surfaces the assertion cleanly:

    $ ./ds4 --first-token-test -p "Hello"
    ds4: Metal device Apple M3 Ultra, 256.00 GiB RAM
    ds4: Metal model views created in 4.6 ms, residency requested in 1518 ms, ...
    ds4: expected IQ2_XXS expert tensors
    
  2. Normal generation paths silently produce BOS spam — model loads, runs, but emits only <|begin▁of▁sentence|> tokens. Reproduced via ./ds4 -p and ./ds4-server with /v1/chat/completions. Performance numbers look fine (prefill 58 t/s, generation 34 t/s on M3 Ultra Metal), so the model is being read and computed against, just not as the right quantization.

Root cause

ds4.c:3783 and ds4.c:3855 hardcode w->type != 16 (IQ2_XXS) for routed experts:

if (w0->type != 16 || w1->type != 16) ds4_die("expected IQ2_XXS expert tensors");

But ds4.c:126 and ds4_gpu.h:559 both reference Q4_K routed experts as a supported "high-memory variant", and the README / download_model.sh direct 256 GB users to that variant.

--inspect confirms the weights are well-formed Q4_K

model: DeepSeek V4 Flash
arch:  deepseek4
gguf:  v3, 62 metadata keys, 1328 tensors
layers: 43
experts: count=256 used=6
tensor types:
  f32        492 tensors, 0.00 GiB
  f16        359 tensors, 2.04 GiB
  q8_0       345 tensors, 6.15 GiB
  q4_k       129 tensors, 145.12 GiB   ← experts as Q4_K (GGUF type 12), not IQ2_XXS
  i32          3 tensors, 0.01 GiB

File: DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2-imatrix.gguf from antirez/deepseek-v4-gguf.

Workaround

For now, 256 GB Mac users have to fall back to q2-imatrix despite the docs.

Happy to test a fix.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions