Skip to content

Add MiniMax-M3 (text backbone)#1401

Open
davidrhodus wants to merge 1 commit into
ml-explore:mainfrom
davidrhodus:add-minimax-m3
Open

Add MiniMax-M3 (text backbone)#1401
davidrhodus wants to merge 1 commit into
ml-explore:mainfrom
davidrhodus:add-minimax-m3

Conversation

@davidrhodus

Copy link
Copy Markdown

Summary

Adds support for MiniMax-M3 as a text-only LLM (model_type: "minimax_m3") — the text backbone of MiniMaxAI/MiniMax-M3 (~427B-parameter MoE).

M3 builds on the existing minimax (M2) implementation, with these architectural differences handled here:

  • Gemma-style RMSNorm (normalize in fp32, scale by 1 + weight)
  • Per-head QK-norm over the head dimension
  • Partial RoPE (rotary_dim < head_dim)
  • SwiGLU-OAI activation (clamped gate/up with an (up + 1) term), reused for the dense MLPs, the shared expert, and the experts (via SwitchGLU with a custom activation)
  • MoE with a sigmoid router + correction bias, a shared expert, and a routed-scaling factor; the first few layers are dense MLPs (mlp_layer_types)
  • MiniMax Sparse Attention (MSA) is realized as full causal attention — numerically exact for sequences up to index_topk_blocks * index_block_size tokens (where MSA selects every key block) and the dense, un-approximated attention beyond, so quality is preserved at the cost of MSA's long-context speed/memory savings. The lightning-indexer tensors are dropped in sanitize.

The checkpoint's per-expert w1/w2/w3 weights are stacked into SwitchGLU exactly as in minimax.py.

Notes

  • This targets a text-only extraction of the backbone (the M3 checkpoint is minimax_m3_vl; the vision tower, projector, and MTP heads are not part of this LLM build). Quantized MLX builds produced with this code are published here.

Testing

  • Added a minimax_m3 entry to tests/test_models.py exercising both dense and sparse layers.
  • Verified against the model_test_runner logic: forward pass in fp32 and fp16, KV-cache decode, batch size > 1, and deepcopy all pass.

🤖 Generated with Claude Code

Adds support for MiniMax-M3 as a text-only LLM (model_type "minimax_m3"),
the text backbone of MiniMaxAI/MiniMax-M3 (~427B MoE).

M3 extends MiniMax-M2 with Gemma-style RMSNorm (scale by 1+w), per-head
QK-norm, partial RoPE, the SwiGLU-OAI activation, a shared expert plus a
routed-scaling factor in the MoE, and the first few layers being dense MLPs.
MiniMax Sparse Attention is realized as full causal attention (exact up to
topk_blocks*block_size tokens; dense beyond), reusing the existing per-expert
w1/w2/w3 sanitize layout and SwitchGLU with a custom activation.

Includes a model test config exercising both dense and sparse layers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant