Skip to content

Add MiniMax M3 model support (text-only)#1398

Open
machiabeli wants to merge 2 commits into
ml-explore:mainfrom
machiabeli:feat/minimax-m3-vl
Open

Add MiniMax M3 model support (text-only)#1398
machiabeli wants to merge 2 commits into
ml-explore:mainfrom
machiabeli:feat/minimax-m3-vl

Conversation

@machiabeli

@machiabeli machiabeli commented Jun 12, 2026

Copy link
Copy Markdown

Summary

  • Adds minimax_m3_vl.py model module for MiniMax M3 (427B MoE, ~23B active params)
  • Builds on the existing minimax.py (M2.7) with key additions for the M3 architecture
  • Phase 1: text-only with dense attention — vision tower and MiniMax Sparse Attention (MSA) deferred to a follow-up

Architecture highlights

Feature Implementation
Hybrid dense/MoE First 3 layers dense MLP, rest MoE (dispatched via moe_layer_freq)
MoE routing 128 experts, top-4 sigmoid + e_score_correction_bias + shared expert
SwigluOAI Custom gated SiLU with alpha=1.702 scaling + limit=7.0 clamping
GQA 64 heads / 4 KV heads with per-head QK norm (Gemma-style RMSNorm)
Partial RoPE rotary_dim=64 (half of head_dim=128)
FP8 source Block-128 dequantization in sanitize()
Tensor parallelism Full shard() support for distributed inference

Weight mapping

Expert weights on disk are per-expert 2D tensors (w1/w2/w3), stacked into 3D SwitchGLU format in sanitize() — same pattern as minimax.py. The language_model.model.* HF prefix is remapped to match the nested LanguageModel > TextModel structure.

Test plan

  • from mlx_lm.models.minimax_m3_vl import Model, ModelArgs imports cleanly
  • ModelArgs.from_dict(config) parses the nested text_config correctly
  • Forward pass on tiny config produces correct output shapes
  • mlx_lm convert --hf-path MiniMaxAI/MiniMax-M3 -q --q-bits 4 completes
  • Converted model generates coherent text
  • Quantization predicates keep routing gate at 8-bit

Adds support for MiniMax M3, a 427B MoE model with ~23B active parameters.

Key features:
- Hybrid dense/MoE architecture (first 3 layers dense, rest MoE)
- 128 experts with top-4 routing + 1 shared expert
- SwigluOAI activation (gated SiLU with alpha scaling + output clamping)
- GQA (64 heads / 4 KV heads) with per-head QK normalization
- Gemma-style RMSNorm (weight + 1)
- Partial rotary embeddings (factor=0.5)
- FP8 weight dequantization in sanitize()
- Tensor parallelism via shard()

Phase 1: text-only with dense attention fallback. Vision tower and
MiniMax Sparse Attention (MSA) deferred to a follow-up PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@machiabeli machiabeli changed the title Add MiniMax M3 VL model support Add MiniMax M3 model support (text-only) Jun 12, 2026
Without this, conversion fails with "4 parameters not in model" for the
vision tower's patch merge MLP weights.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TheTom added a commit to TheTom/mlx-lm that referenced this pull request Jun 14, 2026
…locks)

The upstream PR ml-explore#1398 dequant assumed DeepSeek-style 128x128 float-scale
blocks. MiniMaxAI/MiniMax-M3-MXFP8 uses OCP microscaling: U8 E8M0 scales
(value v -> 2^(v-127)) per weight_block_size [1,32] tile. Detect U8
scale_inv and dequant accordingly; verified sane magnitudes on real
expert + attention tensors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant