Summary
z-lab/Qwen3.6-35B-A3B-PARO's config.json declares mtp_num_hidden_layers=1 (which advertises an MTP head) but the shipped safetensors contain no MTP weights. When a runtime takes the config at face value and engages its MTP path, the result is 0% accept rate + ~50% decode overhead from the empty-tensor verify step. The fix on the operator side is to manually set mtp_enabled=false in oMLX model_settings.json (or equivalent), but anyone who tries this model with defaults wastes a lot of confusing debug time before figuring that out.
Affected runtime behavior observed
On oMLX 0.3.9.dev2 (M1 Max) without overriding mtp_enabled:
- MTP path engages because config declares an MTP layer.
- Forward pass through the empty MTP head returns an empty/uninitialized tensor each step.
- Accept rate: 0%.
- Decode wall ~1.5× the no-MTP baseline (the MTP head is still computed, just produces no useful drafts).
With mtp_enabled=false set explicitly:
- MTP path skipped.
- 31.3 t/s warm @ 32K, 200-tok decode — the actual PARO performance (this is our Mac long-ctx workhorse number).
Reproducer
huggingface-cli download z-lab/Qwen3.6-35B-A3B-PARO.
- Inspect
config.json — note mtp_num_hidden_layers: 1.
- Inspect
model.safetensors.index.json — note no language_model.mtp.* keys (for comparison, Jundot/Qwen3.6-27B-oQ6-mtp has 29 such keys, samwang0041/Qwen3.6-27B-MLX-4bit-MTP has 31, Jundot/Qwen3.6-35B-A3B-oQ4-mtp has 42).
- Serve via oMLX 0.3.9.dev2 with model_settings defaulting
mtp_enabled=true.
- Observe 0% accept rate and ~50% decode overhead in the
MTP[N] server log lines.
Fix options
(a) Drop mtp_num_hidden_layers from the released config.json. Cleanest fix — config matches what's actually shipped.
(b) Ship the MTP head weights in the safetensors release. Adds size but enables MTP for downstream users.
(c) Document the required mtp_enabled=false override in the model card / README and accept that downstream stacks default-on MTP from config. Operator-side fix only.
Most other paroquant PARO releases I checked (gemma-4-*-it-PARO, Qwen3.6-27B-PARO) ship clean configs that don't trigger this. The 35B-A3B-PARO is the outlier.
Why this matters
z-lab/Qwen3.6-35B-A3B-PARO is the best Mac long-context workhorse model currently available for M1 Max (256K cold/warm, 36 t/s @ 32K via oMLX). I'd rather not have it cost downstream operators a wasted afternoon on first try. Happy to confirm option (a) works end-to-end with a quick re-pull and bench if useful.
Companion data
This is captured in our test fleet's per-model facts card alongside the MTP sampler-floor finding (filed separately at jundot/omlx) and the MoE Gemma-4 Metal OOM (filed separately, also paroquant). Happy to share full server logs or bench JSONLs.
Summary
z-lab/Qwen3.6-35B-A3B-PARO'sconfig.jsondeclaresmtp_num_hidden_layers=1(which advertises an MTP head) but the shipped safetensors contain no MTP weights. When a runtime takes the config at face value and engages its MTP path, the result is 0% accept rate + ~50% decode overhead from the empty-tensor verify step. The fix on the operator side is to manually setmtp_enabled=falsein oMLXmodel_settings.json(or equivalent), but anyone who tries this model with defaults wastes a lot of confusing debug time before figuring that out.Affected runtime behavior observed
On oMLX 0.3.9.dev2 (M1 Max) without overriding
mtp_enabled:With
mtp_enabled=falseset explicitly:Reproducer
huggingface-cli download z-lab/Qwen3.6-35B-A3B-PARO.config.json— notemtp_num_hidden_layers: 1.model.safetensors.index.json— note nolanguage_model.mtp.*keys (for comparison,Jundot/Qwen3.6-27B-oQ6-mtphas 29 such keys,samwang0041/Qwen3.6-27B-MLX-4bit-MTPhas 31,Jundot/Qwen3.6-35B-A3B-oQ4-mtphas 42).mtp_enabled=true.MTP[N]server log lines.Fix options
(a) Drop
mtp_num_hidden_layersfrom the releasedconfig.json. Cleanest fix — config matches what's actually shipped.(b) Ship the MTP head weights in the safetensors release. Adds size but enables MTP for downstream users.
(c) Document the required
mtp_enabled=falseoverride in the model card / README and accept that downstream stacks default-on MTP from config. Operator-side fix only.Most other paroquant PARO releases I checked (
gemma-4-*-it-PARO,Qwen3.6-27B-PARO) ship clean configs that don't trigger this. The 35B-A3B-PARO is the outlier.Why this matters
z-lab/Qwen3.6-35B-A3B-PAROis the best Mac long-context workhorse model currently available for M1 Max (256K cold/warm, 36 t/s @ 32K via oMLX). I'd rather not have it cost downstream operators a wasted afternoon on first try. Happy to confirm option (a) works end-to-end with a quick re-pull and bench if useful.Companion data
This is captured in our test fleet's per-model facts card alongside the MTP sampler-floor finding (filed separately at jundot/omlx) and the MoE Gemma-4 Metal OOM (filed separately, also paroquant). Happy to share full server logs or bench JSONLs.