Qwen3.6-35B-A3B-PARO: config.json declares mtp_num_hidden_layers=1 but no MTP weights shipped → 0% accept + ~50% decode overhead by default

## Summary

`z-lab/Qwen3.6-35B-A3B-PARO`'s `config.json` declares `mtp_num_hidden_layers=1` (which advertises an MTP head) but the shipped safetensors **contain no MTP weights**. When a runtime takes the config at face value and engages its MTP path, the result is **0% accept rate + ~50% decode overhead** from the empty-tensor verify step. The fix on the operator side is to manually set `mtp_enabled=false` in oMLX `model_settings.json` (or equivalent), but anyone who tries this model with defaults wastes a lot of confusing debug time before figuring that out.

## Affected runtime behavior observed

On oMLX 0.3.9.dev2 (M1 Max) without overriding `mtp_enabled`:
- MTP path engages because config declares an MTP layer.
- Forward pass through the empty MTP head returns an empty/uninitialized tensor each step.
- Accept rate: 0%.
- Decode wall ~1.5× the no-MTP baseline (the MTP head is still computed, just produces no useful drafts).

With `mtp_enabled=false` set explicitly:
- MTP path skipped.
- 31.3 t/s warm @ 32K, 200-tok decode — the actual PARO performance (this is our Mac long-ctx workhorse number).

## Reproducer

1. `huggingface-cli download z-lab/Qwen3.6-35B-A3B-PARO`.
2. Inspect `config.json` — note `mtp_num_hidden_layers: 1`.
3. Inspect `model.safetensors.index.json` — note **no** `language_model.mtp.*` keys (for comparison, `Jundot/Qwen3.6-27B-oQ6-mtp` has 29 such keys, `samwang0041/Qwen3.6-27B-MLX-4bit-MTP` has 31, `Jundot/Qwen3.6-35B-A3B-oQ4-mtp` has 42).
4. Serve via oMLX 0.3.9.dev2 with model_settings defaulting `mtp_enabled=true`.
5. Observe 0% accept rate and ~50% decode overhead in the `MTP[N]` server log lines.

## Fix options

**(a)** Drop `mtp_num_hidden_layers` from the released `config.json`. Cleanest fix — config matches what's actually shipped.

**(b)** Ship the MTP head weights in the safetensors release. Adds size but enables MTP for downstream users.

**(c)** Document the required `mtp_enabled=false` override in the model card / README and accept that downstream stacks default-on MTP from config. Operator-side fix only.

Most other paroquant PARO releases I checked (`gemma-4-*-it-PARO`, `Qwen3.6-27B-PARO`) ship clean configs that don't trigger this. The 35B-A3B-PARO is the outlier.

## Why this matters

`z-lab/Qwen3.6-35B-A3B-PARO` is the **best Mac long-context workhorse model currently available** for M1 Max (256K cold/warm, 36 t/s @ 32K via oMLX). I'd rather not have it cost downstream operators a wasted afternoon on first try. Happy to confirm option (a) works end-to-end with a quick re-pull and bench if useful.

## Companion data

This is captured in our test fleet's per-model facts card alongside the MTP sampler-floor finding (filed separately at jundot/omlx) and the MoE Gemma-4 Metal OOM (filed separately, also paroquant). Happy to share full server logs or bench JSONLs.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3.6-35B-A3B-PARO: config.json declares mtp_num_hidden_layers=1 but no MTP weights shipped → 0% accept + ~50% decode overhead by default #47

Summary

Affected runtime behavior observed

Reproducer

Fix options

Why this matters

Companion data

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Qwen3.6-35B-A3B-PARO: config.json declares mtp_num_hidden_layers=1 but no MTP weights shipped → 0% accept + ~50% decode overhead by default #47

Description

Summary

Affected runtime behavior observed

Reproducer

Fix options

Why this matters

Companion data

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions