Skip to content

Qwen3.6-35B-A3B-PARO: config.json declares mtp_num_hidden_layers=1 but no MTP weights shipped → 0% accept + ~50% decode overhead by default #47

@sangemaru

Description

@sangemaru

Summary

z-lab/Qwen3.6-35B-A3B-PARO's config.json declares mtp_num_hidden_layers=1 (which advertises an MTP head) but the shipped safetensors contain no MTP weights. When a runtime takes the config at face value and engages its MTP path, the result is 0% accept rate + ~50% decode overhead from the empty-tensor verify step. The fix on the operator side is to manually set mtp_enabled=false in oMLX model_settings.json (or equivalent), but anyone who tries this model with defaults wastes a lot of confusing debug time before figuring that out.

Affected runtime behavior observed

On oMLX 0.3.9.dev2 (M1 Max) without overriding mtp_enabled:

  • MTP path engages because config declares an MTP layer.
  • Forward pass through the empty MTP head returns an empty/uninitialized tensor each step.
  • Accept rate: 0%.
  • Decode wall ~1.5× the no-MTP baseline (the MTP head is still computed, just produces no useful drafts).

With mtp_enabled=false set explicitly:

  • MTP path skipped.
  • 31.3 t/s warm @ 32K, 200-tok decode — the actual PARO performance (this is our Mac long-ctx workhorse number).

Reproducer

  1. huggingface-cli download z-lab/Qwen3.6-35B-A3B-PARO.
  2. Inspect config.json — note mtp_num_hidden_layers: 1.
  3. Inspect model.safetensors.index.json — note no language_model.mtp.* keys (for comparison, Jundot/Qwen3.6-27B-oQ6-mtp has 29 such keys, samwang0041/Qwen3.6-27B-MLX-4bit-MTP has 31, Jundot/Qwen3.6-35B-A3B-oQ4-mtp has 42).
  4. Serve via oMLX 0.3.9.dev2 with model_settings defaulting mtp_enabled=true.
  5. Observe 0% accept rate and ~50% decode overhead in the MTP[N] server log lines.

Fix options

(a) Drop mtp_num_hidden_layers from the released config.json. Cleanest fix — config matches what's actually shipped.

(b) Ship the MTP head weights in the safetensors release. Adds size but enables MTP for downstream users.

(c) Document the required mtp_enabled=false override in the model card / README and accept that downstream stacks default-on MTP from config. Operator-side fix only.

Most other paroquant PARO releases I checked (gemma-4-*-it-PARO, Qwen3.6-27B-PARO) ship clean configs that don't trigger this. The 35B-A3B-PARO is the outlier.

Why this matters

z-lab/Qwen3.6-35B-A3B-PARO is the best Mac long-context workhorse model currently available for M1 Max (256K cold/warm, 36 t/s @ 32K via oMLX). I'd rather not have it cost downstream operators a wasted afternoon on first try. Happy to confirm option (a) works end-to-end with a quick re-pull and bench if useful.

Companion data

This is captured in our test fleet's per-model facts card alongside the MTP sampler-floor finding (filed separately at jundot/omlx) and the MoE Gemma-4 Metal OOM (filed separately, also paroquant). Happy to share full server logs or bench JSONLs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions