q4-imatrix weights fail to load: "expected IQ2_XXS expert tensors"

## Summary

The `q4-imatrix` weights file added in commit ed5d30d ("q4 imatrix file in the download script") fails to load — the binary aborts with `expected IQ2_XXS expert tensors` immediately after the model header is parsed.

The `download_model.sh` help text recommends `q4-imatrix` for "machines with 256 GB RAM or more", but the current loader hardcodes IQ2_XXS as the only accepted expert tensor type.

## Reproducer

Mac Studio M3 Ultra, 256 GB. Built from latest main (HEAD = 0230891 at time of report).

```sh
./download_model.sh q4-imatrix
./ds4 -p "Hello" --temp 0 -n 50 --nothink
```

## Expected

Generates a normal response.

## Actual

Two failure modes observed:

1. **Diagnostic flag surfaces the assertion cleanly:**
   ```
   $ ./ds4 --first-token-test -p "Hello"
   ds4: Metal device Apple M3 Ultra, 256.00 GiB RAM
   ds4: Metal model views created in 4.6 ms, residency requested in 1518 ms, ...
   ds4: expected IQ2_XXS expert tensors
   ```

2. **Normal generation paths silently produce BOS spam** — model loads, runs, but emits only `<｜begin▁of▁sentence｜>` tokens. Reproduced via `./ds4 -p` and `./ds4-server` with `/v1/chat/completions`. Performance numbers look fine (prefill 58 t/s, generation 34 t/s on M3 Ultra Metal), so the model is being read and computed against, just not as the right quantization.

## Root cause

`ds4.c:3783` and `ds4.c:3855` hardcode `w->type != 16` (IQ2_XXS) for routed experts:

```c
if (w0->type != 16 || w1->type != 16) ds4_die("expected IQ2_XXS expert tensors");
```

But `ds4.c:126` and `ds4_gpu.h:559` both reference Q4_K routed experts as a supported "high-memory variant", and the README / `download_model.sh` direct 256 GB users to that variant.

## `--inspect` confirms the weights are well-formed Q4_K

```
model: DeepSeek V4 Flash
arch:  deepseek4
gguf:  v3, 62 metadata keys, 1328 tensors
layers: 43
experts: count=256 used=6
tensor types:
  f32        492 tensors, 0.00 GiB
  f16        359 tensors, 2.04 GiB
  q8_0       345 tensors, 6.15 GiB
  q4_k       129 tensors, 145.12 GiB   ← experts as Q4_K (GGUF type 12), not IQ2_XXS
  i32          3 tensors, 0.01 GiB
```

File: `DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2-imatrix.gguf` from `antirez/deepseek-v4-gguf`.

## Workaround

For now, 256 GB Mac users have to fall back to `q2-imatrix` despite the docs.

Happy to test a fix.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

q4-imatrix weights fail to load: "expected IQ2_XXS expert tensors" #114

Summary

Reproducer

Expected

Actual

Root cause

`--inspect` confirms the weights are well-formed Q4_K

Workaround

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

q4-imatrix weights fail to load: "expected IQ2_XXS expert tensors" #114

Description

Summary

Reproducer

Expected

Actual

Root cause

--inspect confirms the weights are well-formed Q4_K

Workaround

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

`--inspect` confirms the weights are well-formed Q4_K