Add MiniMax M3 model support (text-only) by machiabeli · Pull Request #1398 · ml-explore/mlx-lm

machiabeli · 2026-06-12T16:32:36Z

Summary

Adds minimax_m3_vl.py model module for MiniMax M3 (427B MoE, ~23B active params)
Builds on the existing minimax.py (M2.7) with key additions for the M3 architecture
Phase 1: text-only with dense attention — vision tower and MiniMax Sparse Attention (MSA) deferred to a follow-up

Architecture highlights

Feature	Implementation
Hybrid dense/MoE	First 3 layers dense MLP, rest MoE (dispatched via `moe_layer_freq`)
MoE routing	128 experts, top-4 sigmoid + `e_score_correction_bias` + shared expert
SwigluOAI	Custom gated SiLU with alpha=1.702 scaling + limit=7.0 clamping
GQA	64 heads / 4 KV heads with per-head QK norm (Gemma-style RMSNorm)
Partial RoPE	`rotary_dim=64` (half of `head_dim=128`)
FP8 source	Block-128 dequantization in `sanitize()`
Tensor parallelism	Full `shard()` support for distributed inference

Weight mapping

Expert weights on disk are per-expert 2D tensors (w1/w2/w3), stacked into 3D SwitchGLU format in sanitize() — same pattern as minimax.py. The language_model.model.* HF prefix is remapped to match the nested LanguageModel > TextModel structure.

Test plan

from mlx_lm.models.minimax_m3_vl import Model, ModelArgs imports cleanly
ModelArgs.from_dict(config) parses the nested text_config correctly
Forward pass on tiny config produces correct output shapes
mlx_lm convert --hf-path MiniMaxAI/MiniMax-M3 -q --q-bits 4 completes
Converted model generates coherent text
Quantization predicates keep routing gate at 8-bit

Adds support for MiniMax M3, a 427B MoE model with ~23B active parameters. Key features: - Hybrid dense/MoE architecture (first 3 layers dense, rest MoE) - 128 experts with top-4 routing + 1 shared expert - SwigluOAI activation (gated SiLU with alpha scaling + output clamping) - GQA (64 heads / 4 KV heads) with per-head QK normalization - Gemma-style RMSNorm (weight + 1) - Partial rotary embeddings (factor=0.5) - FP8 weight dequantization in sanitize() - Tensor parallelism via shard() Phase 1: text-only with dense attention fallback. Vision tower and MiniMax Sparse Attention (MSA) deferred to a follow-up PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Without this, conversion fails with "4 parameters not in model" for the vision tower's patch merge MLP weights. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…locks) The upstream PR ml-explore#1398 dequant assumed DeepSeek-style 128x128 float-scale blocks. MiniMaxAI/MiniMax-M3-MXFP8 uses OCP microscaling: U8 E8M0 scales (value v -> 2^(v-127)) per weight_block_size [1,32] tile. Detect U8 scale_inv and dequant accordingly; verified sane magnitudes on real expert + attention tensors.

machiabeli changed the title ~~Add MiniMax M3 VL model support~~ Add MiniMax M3 model support (text-only) Jun 12, 2026

fix: skip patch_merge_mlp vision keys in sanitize()

3b8d2c9

Without this, conversion fails with "4 parameters not in model" for the vision tower's patch merge MLP weights. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MiniMax M3 model support (text-only)#1398

Add MiniMax M3 model support (text-only)#1398
machiabeli wants to merge 2 commits into
ml-explore:mainfrom
machiabeli:feat/minimax-m3-vl

machiabeli commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

machiabeli commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture highlights

Weight mapping

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

machiabeli commented Jun 12, 2026 •

edited

Loading