Add MiniMax-M3 (text backbone) by davidrhodus · Pull Request #1401 · ml-explore/mlx-lm

davidrhodus · 2026-06-13T16:20:19Z

Summary

Adds support for MiniMax-M3 as a text-only LLM (model_type: "minimax_m3") — the text backbone of MiniMaxAI/MiniMax-M3 (~427B-parameter MoE).

M3 builds on the existing minimax (M2) implementation, with these architectural differences handled here:

Gemma-style RMSNorm (normalize in fp32, scale by 1 + weight)
Per-head QK-norm over the head dimension
Partial RoPE (rotary_dim < head_dim)
SwiGLU-OAI activation (clamped gate/up with an (up + 1) term), reused for the dense MLPs, the shared expert, and the experts (via SwitchGLU with a custom activation)
MoE with a sigmoid router + correction bias, a shared expert, and a routed-scaling factor; the first few layers are dense MLPs (mlp_layer_types)
MiniMax Sparse Attention (MSA) is realized as full causal attention — numerically exact for sequences up to index_topk_blocks * index_block_size tokens (where MSA selects every key block) and the dense, un-approximated attention beyond, so quality is preserved at the cost of MSA's long-context speed/memory savings. The lightning-indexer tensors are dropped in sanitize.

The checkpoint's per-expert w1/w2/w3 weights are stacked into SwitchGLU exactly as in minimax.py.

Notes

This targets a text-only extraction of the backbone (the M3 checkpoint is minimax_m3_vl; the vision tower, projector, and MTP heads are not part of this LLM build). Quantized MLX builds produced with this code are published here.

Testing

Added a minimax_m3 entry to tests/test_models.py exercising both dense and sparse layers.
Verified against the model_test_runner logic: forward pass in fp32 and fp16, KV-cache decode, batch size > 1, and deepcopy all pass.

🤖 Generated with Claude Code

Adds support for MiniMax-M3 as a text-only LLM (model_type "minimax_m3"), the text backbone of MiniMaxAI/MiniMax-M3 (~427B MoE). M3 extends MiniMax-M2 with Gemma-style RMSNorm (scale by 1+w), per-head QK-norm, partial RoPE, the SwiGLU-OAI activation, a shared expert plus a routed-scaling factor in the MoE, and the first few layers being dense MLPs. MiniMax Sparse Attention is realized as full causal attention (exact up to topk_blocks*block_size tokens; dense beyond), reusing the existing per-expert w1/w2/w3 sanitize layout and SwitchGLU with a custom activation. Includes a model test config exercising both dense and sparse layers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MiniMax-M3 (text backbone)#1401

Add MiniMax-M3 (text backbone)#1401
davidrhodus wants to merge 1 commit into
ml-explore:mainfrom
davidrhodus:add-minimax-m3

davidrhodus commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davidrhodus commented Jun 13, 2026

Summary

Notes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant