Skip to content

[RFC][Data] Add tuned FMoE configs for MiniMax-M2.5 FP8 per_1x128 shapes (MI355X)#2

Draft
peymanr wants to merge 1 commit into
mainfrom
seed/perf-add-tuned-fmoe-configs-for-minimax-2
Draft

[RFC][Data] Add tuned FMoE configs for MiniMax-M2.5 FP8 per_1x128 shapes (MI355X)#2
peymanr wants to merge 1 commit into
mainfrom
seed/perf-add-tuned-fmoe-configs-for-minimax-2

Conversation

@peymanr

@peymanr peymanr commented May 29, 2026

Copy link
Copy Markdown
Owner

Part of #4

Summary

Add 28 new tuned fused-MoE (CK 2-stage) configuration rows to aiter/configs/tuned_fmoe.csv for the MiniMax-M2.5 expert shape: E=256, topk=8, hidden=3072, inter=768, activation bfloat16, weights float8_e4m3fn, QuantType.per_1x128. Rows cover M ∈ {4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 144, 152, 160, 168, 176, 184, 192, 200, 208, 256, 512} and are tagged minimax_m2.5 in the trailing model-tag column.

This is a pure configuration / tuning-table update — no source code, kernel, or dispatch logic is modified.

Motivation

Without shape-specific tuned entries, the fused-MoE dispatcher falls back to the nearest generic shape, which under-utilizes the GEMM tiles for MiniMax-M2.5's (3072, 768) expert dimensions. Landing tuned entries lets the autotuner pick the optimal moe_ck2stages_gemm1/gemm2 variant per M, improving decode and small-prefill latency for MiniMax-M2.5 inference on ROCm.

This PR is part 2 of a 3-PR series for MiniMax-M2.5 enablement:

  • PR 1 (perf): [Perf][Hardware][AMD] Speed up FP8 blockscale MoE decode path on gfx950 (to be opened)
  • PR 2 (this PR): [Perf][Kernel] Add tuned fused-MoE configs for MiniMax-M2.5 FP8 per_1x128 shapes
  • PR 3 (tuning): [Perf] Add tuned a8w8 blockscale GEMM configs for MiniMax-M2.5 (to be opened)

Changes

  • aiter/configs/tuned_fmoe.csv: append 28 rows for MiniMax-M2.5 FP8 per_1x128 shapes. Each row records:
    • Selected gemm1 kernel variant (mostly moe_ck2stages_gemm1_256x16x128x256_1x4_MulABScaleExpertWeightA8W8blkscale_v1_*_silu_F8_F8_B16, with 128x16x128x128_1x2 chosen for several mid-M cases).
    • Selected gemm2 kernel variant (mostly moe_ck2stages_gemm2_256x16x128x256_1x4_*, with 128x16x128x128_1x2 at M=64 and M=128).
    • Measured per-stage and total kernel time (us), model tag minimax_m2.5.

No kernels, headers, build files, or Python dispatch code are modified.

Performance (if applicable)

Below are the tuned kernel total times (us) recorded by the tuner and stored in the new CSV rows. "Before" refers to the previous CSV state, where these M/shape combinations had no exact match and the dispatcher fell back to a nearest-shape generic entry; absolute pre-PR numbers are not captured by the tuner and are marked TBD.

M Selected gemm1 tile Selected gemm2 tile Total kernel time (us) — after Total — before
4 256x16x128x256_1x4 256x16x128x256_1x4 51.21 TBD (fallback)
8 128x16x128x128_1x2 256x16x128x256_1x4 85.42 TBD (fallback)
16 128x16x128x128_1x2 256x16x128x256_1x4 137.57 TBD (fallback)
32 128x16x128x128_1x2 256x16x128x256_1x4 202.02 TBD (fallback)
64 128x16x128x128_1x2 128x16x128x128_1x2 281.41 TBD (fallback)
128 128x16x128x128_1x2 128x16x128x128_1x2 307.55 TBD (fallback)
256 256x16x128x256_1x4 256x16x128x256_1x4 319.73 TBD (fallback)
512 256x32x128x128_1x4 256x32x128x128_1x4 333.81 TBD (fallback)

Full per-M numbers are in the diff for aiter/configs/tuned_fmoe.csv.

Reproduce / re-tune:

python tune_fmoe_minimax_m25_fp8_mi355x.py

Report: paste the final summary table printed by the script (per-M retuned kernel latency in us for E=256, topk=8, hidden=3072, inter=768, bf16/fp8_e4m3fn, per_1x128).

Testing

This PR only touches a CSV tuning table, so testing is limited to schema/consistency validation plus the standard aiter test entry point.

  1. CSV row validation:

    python validate_tuned_fmoe_csv.py
    

    Report: file path validated, total / valid row counts, errors and warnings counts (expected 0 errors).

  2. MiniMax-M2.5–specific validation (presence, dtype, QuantType, model-tag coverage):

    python validate_minimax_m25_fp8_moe_tuning.py
    

    Report: paste the SUMMARY block (count of MiniMax-M2.5 FP8 per_1x128 rows matched, missing Ms, any flagged anomalies).

  3. aiter test entry point (sanity, on MI300X/MI355X):

    bash .github/scripts/aiter_test.sh
    
  • Tested on MI300X: TBD
  • Tested on MI355X (gfx950): TBD
  • Tested on MI250X: N/A (configs target gfx94x/gfx950 FP8 paths)

Documentation

No user-facing documentation changes required — this PR only extends an internal tuning table consumed by the autotuner.

Dependencies

No new third-party dependencies.

Breaking Changes

None. Existing rows are untouched; only new rows are appended.

Append 28 tuned fused-MoE CK 2-stage GEMM entries to
aiter/configs/tuned_fmoe.csv covering MiniMax-M2.5's expert layout
(E=256, topk=8, hidden=3072, inter=768) with bf16 activation and
float8_e4m3fn weights using QuantType.per_1x128, across M in
{4,8,16,...,256,512}. This allows the autotuner to dispatch the
optimal moe_ck2stages kernel variants for this model on gfx94x/gfx950
instead of falling back to generic shapes.

No code paths are changed; this is a pure tuning-table update.

Signed-off-by: <Your Name> <your.email@example.com>
@github-actions

Copy link
Copy Markdown

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 2 --add-label <label>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant