Summary
The q4-imatrix weights file added in commit ed5d30d ("q4 imatrix file in the download script") fails to load — the binary aborts with expected IQ2_XXS expert tensors immediately after the model header is parsed.
The download_model.sh help text recommends q4-imatrix for "machines with 256 GB RAM or more", but the current loader hardcodes IQ2_XXS as the only accepted expert tensor type.
Reproducer
Mac Studio M3 Ultra, 256 GB. Built from latest main (HEAD = 0230891 at time of report).
./download_model.sh q4-imatrix
./ds4 -p "Hello" --temp 0 -n 50 --nothink
Expected
Generates a normal response.
Actual
Two failure modes observed:
-
Diagnostic flag surfaces the assertion cleanly:
$ ./ds4 --first-token-test -p "Hello"
ds4: Metal device Apple M3 Ultra, 256.00 GiB RAM
ds4: Metal model views created in 4.6 ms, residency requested in 1518 ms, ...
ds4: expected IQ2_XXS expert tensors
-
Normal generation paths silently produce BOS spam — model loads, runs, but emits only <|begin▁of▁sentence|> tokens. Reproduced via ./ds4 -p and ./ds4-server with /v1/chat/completions. Performance numbers look fine (prefill 58 t/s, generation 34 t/s on M3 Ultra Metal), so the model is being read and computed against, just not as the right quantization.
Root cause
ds4.c:3783 and ds4.c:3855 hardcode w->type != 16 (IQ2_XXS) for routed experts:
if (w0->type != 16 || w1->type != 16) ds4_die("expected IQ2_XXS expert tensors");
But ds4.c:126 and ds4_gpu.h:559 both reference Q4_K routed experts as a supported "high-memory variant", and the README / download_model.sh direct 256 GB users to that variant.
--inspect confirms the weights are well-formed Q4_K
model: DeepSeek V4 Flash
arch: deepseek4
gguf: v3, 62 metadata keys, 1328 tensors
layers: 43
experts: count=256 used=6
tensor types:
f32 492 tensors, 0.00 GiB
f16 359 tensors, 2.04 GiB
q8_0 345 tensors, 6.15 GiB
q4_k 129 tensors, 145.12 GiB ← experts as Q4_K (GGUF type 12), not IQ2_XXS
i32 3 tensors, 0.01 GiB
File: DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2-imatrix.gguf from antirez/deepseek-v4-gguf.
Workaround
For now, 256 GB Mac users have to fall back to q2-imatrix despite the docs.
Happy to test a fix.
Summary
The
q4-imatrixweights file added in commit ed5d30d ("q4 imatrix file in the download script") fails to load — the binary aborts withexpected IQ2_XXS expert tensorsimmediately after the model header is parsed.The
download_model.shhelp text recommendsq4-imatrixfor "machines with 256 GB RAM or more", but the current loader hardcodes IQ2_XXS as the only accepted expert tensor type.Reproducer
Mac Studio M3 Ultra, 256 GB. Built from latest main (HEAD = 0230891 at time of report).
./download_model.sh q4-imatrix ./ds4 -p "Hello" --temp 0 -n 50 --nothinkExpected
Generates a normal response.
Actual
Two failure modes observed:
Diagnostic flag surfaces the assertion cleanly:
Normal generation paths silently produce BOS spam — model loads, runs, but emits only
<|begin▁of▁sentence|>tokens. Reproduced via./ds4 -pand./ds4-serverwith/v1/chat/completions. Performance numbers look fine (prefill 58 t/s, generation 34 t/s on M3 Ultra Metal), so the model is being read and computed against, just not as the right quantization.Root cause
ds4.c:3783andds4.c:3855hardcodew->type != 16(IQ2_XXS) for routed experts:But
ds4.c:126andds4_gpu.h:559both reference Q4_K routed experts as a supported "high-memory variant", and the README /download_model.shdirect 256 GB users to that variant.--inspectconfirms the weights are well-formed Q4_KFile:
DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2-imatrix.gguffromantirez/deepseek-v4-gguf.Workaround
For now, 256 GB Mac users have to fall back to
q2-imatrixdespite the docs.Happy to test a fix.