Skip to content

Bug: MiniMax-M2.7 JANG MoE model fails to load — switch_mlp.gate_proj shape mismatch (config says 8-bit, actual is 4-bit) #967

@Keenni

Description

@Keenni

Bug: MiniMax-M2.7 JANG MoE model fails to load

Model: MiniMax-M2.7-JANG_2L (JANG format v2.0, MoE, 256 local experts)
oMLX Version: latest (via App / Homebrew)
vMLX: loads the same model successfully

Error

Expected shape (256, 1536, 768) but received shape (256, 1536, 192)
for parameter model.layers.0.block_sparse_moe.switch_mlp.gate_proj.weight

Root Cause Analysis

The config.json declares switch_mlp layer quantization as:

"model.layers.0.block_sparse_moe.switch_mlp.gate_proj": {"bits": 8, "group_size": 32, "mode": "affine"}

But the actual stored tensor in safetensors is:

weight: shape=(256, 1536, 192), dtype=uint32
scales: shape=(256, 1536, 24), dtype=float16
biases: shape=(256, 1536, 24), dtype=float16

The config claims 8-bit quantization with group_size=32, but the actual storage is 4-bit:

Parameter Claimed (config) Actual (stored)
bits 8 4
group_size 32 32
expected shape (256, 1536, 768) (256, 1536, 192) ✅

The dequantize logic in oMLX reads bits: 8 from config and computes a shape that doesn't match the actual tensor.

Meanwhile the actual tensor was stored with 4-bit quantization (4 values per uint32), giving:

  • 3072 elements / 32 groups = 96 groups
  • Each group: 32 × 4 bits / 8 = 16 bytes → but packed as uint32: 96 × 2 = 192 uint32

The config.json quantization metadata is wrong for switch_mlp layers — it reports bits: 8 but the safetensors are actually 4-bit packed. oMLX should either:

  1. Auto-detect the actual bits from the tensor shape vs. config shape, or
  2. Allow the JANG format to store the correct per-tensor bits in a separate field

Reproduction Steps

  1. Download MiniMax-M2.7-JANG_2L model
  2. Load in oMLX → fails with shape mismatch error
  3. Load in vMLX → works fine

Model Config Snippet

{
  "model_type": "minimax_m2",
  "hidden_size": 3072,
  "intermediate_size": 1536,
  "num_hidden_layers": 62,
  "num_local_experts": 256,
  "num_experts_per_tok": 8,
  "head_dim": 128,
  "quantization": {
    "bits": 8,
    "group_size": 128,
    "mode": "affine",
    "model.layers.0.block_sparse_moe.switch_mlp.gate_proj": {"bits": 8, "group_size": 32, "mode": "affine"},
    "model.layers.0.block_sparse_moe.switch_mlp.up_proj": {"bits": 8, "group_size": 32, "mode": "affine"},
    "model.layers.0.block_sparse_moe.switch_mlp.down_proj": {"bits": 8, "group_size": 32, "mode": "affine"},
    ...
  }
}

Suggested Fix

In the MoE expert dequantization path, before applying the config's bits value, verify the actual tensor shape against what the config claims. If there's a mismatch, auto-detect the correct bit width from the tensor dimensions.

Alternatively, store the correct per-tensor bits value in the JANG format metadata so oMLX reads the right quantization parameters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions