Text-only PARO MoE checkpoints ship a vision_config, causing inference servers to mis-route them to a VLM/multimodal engine
Summary: The published text-only PARO checkpoints for the Qwen3.6 / Gemma-4 MoE families include a vision_config block (and a …ForConditionalGeneration architecture) in config.json. Inference servers that pick the engine by presence of vision_config therefore route these to a vision-language engine, which then fails to load the MoE expert tensors. This is purely vestigial — the checkpoints are text-only.
Confirmed example: z-lab/Qwen3.6-35B-A3B-PARO
architectures: ["Qwen3_5MoeForConditionalGeneration"]
model_type: "qwen3_5_moe"
vision_config: present
quantization_config.quant_method: "paroquant"
Observed behavior (oMLX 0.3.9.dev2, Apple Silicon / M1 Max):
- oMLX classifies the model as VLM (because
vision_config is present) and loads it on the VLM engine.
- The VLM path cannot load the MoE expert tensors and throws on
model.language_model.layers.0.mlp.experts.gate_up_proj.
- For
Qwen3.6-35B-A3B-PARO it then falls back to the LLM (batched) engine — but only after a failed VLM attempt, contributing to a very slow cold load (~8 min).
- For
gemma-4-26B-A4B-it-PARO there is no MoE→LLM fallback on that arch, and the server is hard-killed (SIGKILL) on load.
Root cause: these are text-only quantizations with no usable vision weights, but the vision_config is retained, so any server that routes on vision_config (a common heuristic) sends them down the multimodal path.
Suggested fix: strip vision_config (and vision-related top-level keys like image_token_id) from the published text-only PARO MoE checkpoints. This is the established pattern for text-only quants of multimodal-arch models (e.g. Unsloth's text-only Gemma quants drop vision_config), and it makes these load correctly as LLMs across servers without per-user workarounds.
Per-user workaround (for reference): on oMLX, either set model_type_override: "llm" in model_settings.json, or remove vision_config from the local config.json — both force the LLM engine and avoid the failed VLM attempt.
Happy to provide full logs if useful.
Text-only PARO MoE checkpoints ship a
vision_config, causing inference servers to mis-route them to a VLM/multimodal engineSummary: The published text-only PARO checkpoints for the Qwen3.6 / Gemma-4 MoE families include a
vision_configblock (and a…ForConditionalGenerationarchitecture) inconfig.json. Inference servers that pick the engine by presence ofvision_configtherefore route these to a vision-language engine, which then fails to load the MoE expert tensors. This is purely vestigial — the checkpoints are text-only.Confirmed example:
z-lab/Qwen3.6-35B-A3B-PAROObserved behavior (oMLX 0.3.9.dev2, Apple Silicon / M1 Max):
vision_configis present) and loads it on the VLM engine.model.language_model.layers.0.mlp.experts.gate_up_proj.Qwen3.6-35B-A3B-PAROit then falls back to the LLM (batched) engine — but only after a failed VLM attempt, contributing to a very slow cold load (~8 min).gemma-4-26B-A4B-it-PAROthere is no MoE→LLM fallback on that arch, and the server is hard-killed (SIGKILL) on load.Root cause: these are text-only quantizations with no usable vision weights, but the
vision_configis retained, so any server that routes onvision_config(a common heuristic) sends them down the multimodal path.Suggested fix: strip
vision_config(and vision-related top-level keys likeimage_token_id) from the published text-only PARO MoE checkpoints. This is the established pattern for text-only quants of multimodal-arch models (e.g. Unsloth's text-only Gemma quants dropvision_config), and it makes these load correctly as LLMs across servers without per-user workarounds.Per-user workaround (for reference): on oMLX, either set
model_type_override: "llm"inmodel_settings.json, or removevision_configfrom the localconfig.json— both force the LLM engine and avoid the failed VLM attempt.Happy to provide full logs if useful.