fix(oq): prefuse per-expert MoE weights before mlx_vlm sanitize by marxo126 · Pull Request #933 · jundot/omlx

marxo126 · 2026-04-23T20:16:28Z

Summary

FP8 MoE VLM checkpoints (e.g. Huihui-Qwen3.6-35B-A3B-*-FP8, yujiepan/qwen3.5-moe-tiny-random) crash during oQ quantization after the FP8 dequant phase completes:

WARNING - FP8 sanitize failed ('model.language_model.layers.0.mlp.experts.gate_up_proj'), aborting
ERROR   - oQ quantization failed: <model> -> 'model.language_model.layers.0.mlp.experts.gate_up_proj'
Traceback (most recent call last):
  File "omlx/admin/oq_manager.py", line 441, in _run_quantization
  ...
  File "omlx/oq.py", line 1914, in quantize_oq_streaming
    all_weights = sanitize_fn(all_weights)
  File "omlx/oq.py", line 1476, in _vlm_sanitize
    w = model_module.Model.sanitize(proxy, weights)
  File ".../mlx_vlm/models/qwen3_5_moe/qwen3_5_moe.py", line 29, in sanitize
    gate_up_weight = weights.pop(f"{prefix}.experts.gate_up_proj")
KeyError: 'model.language_model.layers.0.mlp.experts.gate_up_proj'

Reproduced on v0.3.6 and v0.3.7rc2. FP8 dequant succeeds (31,738 tensors), then mlx_vlm's own Model.sanitize crashes because it assumes a fused experts.gate_up_proj tensor while FP8 checkpoints are laid out per-expert (experts.{n}.gate_proj.weight, etc.).

Fix

mlx_vlm upstream declined to handle this layout (see Blaizzy/mlx-vlm#815 / #816: "works well with official model weights"). oMLX owns the FP8 dequant → per-expert layout, so the fix belongs in oMLX rather than mlx_vlm.

Add _prefuse_moe_experts_for_vlm(weights, config) in oq.py that rebuilds the fused layout mlx_vlm expects before calling Model.sanitize:

Scans once for ...mlp.experts.{n}.(gate|up|down)_proj.weight keys via a compiled regex
Chunks the stack (16 experts per step, matching _DiscoveredPlan._STACK_CHUNK) + mx.clear_cache() between chunks → bounded peak memory for 128/256-expert models
Writes back experts.gate_up_proj ([E, 2·I, H]) and experts.down_proj ([E, H, I]) in the layout mlx_vlm.qwen3_5_moe.sanitize then splits at axis=-2 and passes through
No-op on dense models (num_experts unset) and already-fused checkpoints — early return, zero scan cost
Resolves num_experts from four config locations (num_local_experts / num_experts, top-level or text_config)
Partial/mismatched layers are left untouched so downstream sanitize raises a clear error rather than silently corrupting shapes

Call site is a single line in _vlm_sanitize before Model.sanitize(proxy, weights).

Test plan

tests/test_oq.py::TestPrefuseMoeExpertsForVlm — 7 unit tests:
- fuses per-expert layout end-to-end
- round-trip split via mx.split(fused, 2, axis=-2) recovers original gate / up values
- no-op on already-fused checkpoints
- no-op on dense models (no num_experts)
- partial layer (missing expert) left untouched
- four num_experts config key fallbacks
- 20-expert stack spills across the 16-expert chunk boundary
pytest tests/test_oq.py -m "not slow" → 133 passed (7 new + 126 existing, no regressions)

Scope

No change to the LLM (mlx-lm) sanitize path
No change to the discovery-based streaming sanitize for non-FP8 VLMs
Only fires on VLM path after FP8 dequant / for any VLM whose checkpoint exposes per-expert keys

FP8 MoE VLM checkpoints (e.g. Qwen3.5-MoE variants) store experts as `...mlp.experts.{n}.(gate|up|down)_proj.weight` after FP8 dequant, but `mlx_vlm`'s `Model.sanitize` pops a single fused `experts.gate_up_proj` and crashes with `KeyError` when it is not present. Add `_prefuse_moe_experts_for_vlm` that rebuilds the fused layout in chunks (16 experts per stack) so peak memory stays bounded for models with hundreds of experts. Called from `_vlm_sanitize` before the upstream sanitize. No-op on dense models and already-fused checkpoints. Covers 7 unit tests: round-trip split invariance, already-fused no-op, dense no-op, partial-layer guard, four `num_experts` config fallbacks, and 20-expert chunk spillover.

jundot force-pushed the main branch from 7844f15 to b078330 Compare April 28, 2026 02:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(oq): prefuse per-expert MoE weights before mlx_vlm sanitize#933

fix(oq): prefuse per-expert MoE weights before mlx_vlm sanitize#933
marxo126 wants to merge 1 commit intojundot:mainfrom
marxo126:fix/moe-vlm-per-expert-prefuse

marxo126 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marxo126 commented Apr 23, 2026

Summary

Fix

Test plan

Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant