Add MiniMax M3 model support (text-only)#1398
Open
machiabeli wants to merge 2 commits into
Open
Conversation
Adds support for MiniMax M3, a 427B MoE model with ~23B active parameters. Key features: - Hybrid dense/MoE architecture (first 3 layers dense, rest MoE) - 128 experts with top-4 routing + 1 shared expert - SwigluOAI activation (gated SiLU with alpha scaling + output clamping) - GQA (64 heads / 4 KV heads) with per-head QK normalization - Gemma-style RMSNorm (weight + 1) - Partial rotary embeddings (factor=0.5) - FP8 weight dequantization in sanitize() - Tensor parallelism via shard() Phase 1: text-only with dense attention fallback. Vision tower and MiniMax Sparse Attention (MSA) deferred to a follow-up PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Without this, conversion fails with "4 parameters not in model" for the vision tower's patch merge MLP weights. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TheTom
added a commit
to TheTom/mlx-lm
that referenced
this pull request
Jun 14, 2026
…locks) The upstream PR ml-explore#1398 dequant assumed DeepSeek-style 128x128 float-scale blocks. MiniMaxAI/MiniMax-M3-MXFP8 uses OCP microscaling: U8 E8M0 scales (value v -> 2^(v-127)) per weight_block_size [1,32] tile. Detect U8 scale_inv and dequant accordingly; verified sane magnitudes on real expert + attention tensors.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
minimax_m3_vl.pymodel module for MiniMax M3 (427B MoE, ~23B active params)minimax.py(M2.7) with key additions for the M3 architectureArchitecture highlights
moe_layer_freq)e_score_correction_bias+ shared expertrotary_dim=64(half ofhead_dim=128)sanitize()shard()support for distributed inferenceWeight mapping
Expert weights on disk are per-expert 2D tensors (
w1/w2/w3), stacked into 3D SwitchGLU format insanitize()— same pattern asminimax.py. Thelanguage_model.model.*HF prefix is remapped to match the nestedLanguageModel > TextModelstructure.Test plan
from mlx_lm.models.minimax_m3_vl import Model, ModelArgsimports cleanlyModelArgs.from_dict(config)parses the nestedtext_configcorrectlymlx_lm convert --hf-path MiniMaxAI/MiniMax-M3 -q --q-bits 4completes