[RFC][Data] Add tuned FMoE configs for MiniMax-M2.5 FP8 per_1x128 shapes (MI355X) by peymanr · Pull Request #2 · peymanr/aiter

peymanr · 2026-05-29T19:20:22Z

Part of #4

Summary

Add 28 new tuned fused-MoE (CK 2-stage) configuration rows to aiter/configs/tuned_fmoe.csv for the MiniMax-M2.5 expert shape: E=256, topk=8, hidden=3072, inter=768, activation bfloat16, weights float8_e4m3fn, QuantType.per_1x128. Rows cover M ∈ {4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 144, 152, 160, 168, 176, 184, 192, 200, 208, 256, 512} and are tagged minimax_m2.5 in the trailing model-tag column.

This is a pure configuration / tuning-table update — no source code, kernel, or dispatch logic is modified.

Motivation

Without shape-specific tuned entries, the fused-MoE dispatcher falls back to the nearest generic shape, which under-utilizes the GEMM tiles for MiniMax-M2.5's (3072, 768) expert dimensions. Landing tuned entries lets the autotuner pick the optimal moe_ck2stages_gemm1/gemm2 variant per M, improving decode and small-prefill latency for MiniMax-M2.5 inference on ROCm.

This PR is part 2 of a 3-PR series for MiniMax-M2.5 enablement:

PR 1 (perf): [Perf][Hardware][AMD] Speed up FP8 blockscale MoE decode path on gfx950 (to be opened)
PR 2 (this PR): [Perf][Kernel] Add tuned fused-MoE configs for MiniMax-M2.5 FP8 per_1x128 shapes
PR 3 (tuning): [Perf] Add tuned a8w8 blockscale GEMM configs for MiniMax-M2.5 (to be opened)

Changes

aiter/configs/tuned_fmoe.csv: append 28 rows for MiniMax-M2.5 FP8 per_1x128 shapes. Each row records:
- Selected gemm1 kernel variant (mostly moe_ck2stages_gemm1_256x16x128x256_1x4_MulABScaleExpertWeightA8W8blkscale_v1_*_silu_F8_F8_B16, with 128x16x128x128_1x2 chosen for several mid-M cases).
- Selected gemm2 kernel variant (mostly moe_ck2stages_gemm2_256x16x128x256_1x4_*, with 128x16x128x128_1x2 at M=64 and M=128).
- Measured per-stage and total kernel time (us), model tag minimax_m2.5.

No kernels, headers, build files, or Python dispatch code are modified.

Performance (if applicable)

Below are the tuned kernel total times (us) recorded by the tuner and stored in the new CSV rows. "Before" refers to the previous CSV state, where these M/shape combinations had no exact match and the dispatcher fell back to a nearest-shape generic entry; absolute pre-PR numbers are not captured by the tuner and are marked TBD.

M	Selected gemm1 tile	Selected gemm2 tile	Total kernel time (us) — after	Total — before
4	256x16x128x256_1x4	256x16x128x256_1x4	51.21	TBD (fallback)
8	128x16x128x128_1x2	256x16x128x256_1x4	85.42	TBD (fallback)
16	128x16x128x128_1x2	256x16x128x256_1x4	137.57	TBD (fallback)
32	128x16x128x128_1x2	256x16x128x256_1x4	202.02	TBD (fallback)
64	128x16x128x128_1x2	128x16x128x128_1x2	281.41	TBD (fallback)
128	128x16x128x128_1x2	128x16x128x128_1x2	307.55	TBD (fallback)
256	256x16x128x256_1x4	256x16x128x256_1x4	319.73	TBD (fallback)
512	256x32x128x128_1x4	256x32x128x128_1x4	333.81	TBD (fallback)

Full per-M numbers are in the diff for aiter/configs/tuned_fmoe.csv.

Reproduce / re-tune:

python tune_fmoe_minimax_m25_fp8_mi355x.py

Report: paste the final summary table printed by the script (per-M retuned kernel latency in us for E=256, topk=8, hidden=3072, inter=768, bf16/fp8_e4m3fn, per_1x128).

Testing

This PR only touches a CSV tuning table, so testing is limited to schema/consistency validation plus the standard aiter test entry point.

CSV row validation:
```
python validate_tuned_fmoe_csv.py
```
Report: file path validated, total / valid row counts, errors and warnings counts (expected 0 errors).
MiniMax-M2.5–specific validation (presence, dtype, QuantType, model-tag coverage):
```
python validate_minimax_m25_fp8_moe_tuning.py
```
Report: paste the SUMMARY block (count of MiniMax-M2.5 FP8 per_1x128 rows matched, missing Ms, any flagged anomalies).
aiter test entry point (sanity, on MI300X/MI355X):
```
bash .github/scripts/aiter_test.sh
```

Tested on MI300X: TBD
Tested on MI355X (gfx950): TBD
Tested on MI250X: N/A (configs target gfx94x/gfx950 FP8 paths)

Documentation

No user-facing documentation changes required — this PR only extends an internal tuning table consumed by the autotuner.

Dependencies

No new third-party dependencies.

Breaking Changes

None. Existing rows are untouched; only new rows are appended.

Append 28 tuned fused-MoE CK 2-stage GEMM entries to aiter/configs/tuned_fmoe.csv covering MiniMax-M2.5's expert layout (E=256, topk=8, hidden=3072, inter=768) with bf16 activation and float8_e4m3fn weights using QuantType.per_1x128, across M in {4,8,16,...,256,512}. This allows the autotuner to dispatch the optimal moe_ck2stages kernel variants for this model on gfx94x/gfx950 instead of falling back to generic shapes. No code paths are changed; this is a pure tuning-table update. Signed-off-by: <Your Name> <your.email@example.com>

github-actions · 2026-05-29T19:20:31Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 2 --add-label <label>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC][Data] Add tuned FMoE configs for MiniMax-M2.5 FP8 per_1x128 shapes (MI355X)#2

[RFC][Data] Add tuned FMoE configs for MiniMax-M2.5 FP8 per_1x128 shapes (MI355X)#2
peymanr wants to merge 1 commit into
mainfrom
seed/perf-add-tuned-fmoe-configs-for-minimax-2

peymanr commented May 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

peymanr commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Performance (if applicable)

Testing

Documentation

Dependencies

Breaking Changes

Uh oh!

github-actions Bot commented May 29, 2026

🏷️ CI Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

peymanr commented May 29, 2026 •

edited

Loading