[RFC][Data] Add tuned FMoE configs for MiniMax-M2.5 FP8 per_1x128 shapes (MI355X)#2
Draft
peymanr wants to merge 1 commit into
Draft
[RFC][Data] Add tuned FMoE configs for MiniMax-M2.5 FP8 per_1x128 shapes (MI355X)#2peymanr wants to merge 1 commit into
peymanr wants to merge 1 commit into
Conversation
Append 28 tuned fused-MoE CK 2-stage GEMM entries to
aiter/configs/tuned_fmoe.csv covering MiniMax-M2.5's expert layout
(E=256, topk=8, hidden=3072, inter=768) with bf16 activation and
float8_e4m3fn weights using QuantType.per_1x128, across M in
{4,8,16,...,256,512}. This allows the autotuner to dispatch the
optimal moe_ck2stages kernel variants for this model on gfx94x/gfx950
instead of falling back to generic shapes.
No code paths are changed; this is a pure tuning-table update.
Signed-off-by: <Your Name> <your.email@example.com>
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part of #4
Summary
Add 28 new tuned fused-MoE (CK 2-stage) configuration rows to
aiter/configs/tuned_fmoe.csvfor the MiniMax-M2.5 expert shape:E=256, topk=8, hidden=3072, inter=768, activationbfloat16, weightsfloat8_e4m3fn,QuantType.per_1x128. Rows coverM ∈ {4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 144, 152, 160, 168, 176, 184, 192, 200, 208, 256, 512}and are taggedminimax_m2.5in the trailing model-tag column.This is a pure configuration / tuning-table update — no source code, kernel, or dispatch logic is modified.
Motivation
Without shape-specific tuned entries, the fused-MoE dispatcher falls back to the nearest generic shape, which under-utilizes the GEMM tiles for MiniMax-M2.5's (3072, 768) expert dimensions. Landing tuned entries lets the autotuner pick the optimal
moe_ck2stages_gemm1/gemm2variant per M, improving decode and small-prefill latency for MiniMax-M2.5 inference on ROCm.This PR is part 2 of a 3-PR series for MiniMax-M2.5 enablement:
[Perf][Hardware][AMD] Speed up FP8 blockscale MoE decode path on gfx950(to be opened)[Perf][Kernel] Add tuned fused-MoE configs for MiniMax-M2.5 FP8 per_1x128 shapes[Perf] Add tuned a8w8 blockscale GEMM configs for MiniMax-M2.5(to be opened)Changes
aiter/configs/tuned_fmoe.csv: append 28 rows for MiniMax-M2.5 FP8 per_1x128 shapes. Each row records:gemm1kernel variant (mostlymoe_ck2stages_gemm1_256x16x128x256_1x4_MulABScaleExpertWeightA8W8blkscale_v1_*_silu_F8_F8_B16, with128x16x128x128_1x2chosen for several mid-M cases).gemm2kernel variant (mostlymoe_ck2stages_gemm2_256x16x128x256_1x4_*, with128x16x128x128_1x2atM=64andM=128).minimax_m2.5.No kernels, headers, build files, or Python dispatch code are modified.
Performance (if applicable)
Below are the tuned kernel total times (us) recorded by the tuner and stored in the new CSV rows. "Before" refers to the previous CSV state, where these M/shape combinations had no exact match and the dispatcher fell back to a nearest-shape generic entry; absolute pre-PR numbers are not captured by the tuner and are marked TBD.
Full per-M numbers are in the diff for
aiter/configs/tuned_fmoe.csv.Reproduce / re-tune:
Report: paste the final summary table printed by the script (per-M retuned kernel latency in us for E=256, topk=8, hidden=3072, inter=768, bf16/fp8_e4m3fn, per_1x128).
Testing
This PR only touches a CSV tuning table, so testing is limited to schema/consistency validation plus the standard aiter test entry point.
CSV row validation:
Report: file path validated, total / valid row counts, errors and warnings counts (expected 0 errors).
MiniMax-M2.5–specific validation (presence, dtype, QuantType, model-tag coverage):
Report: paste the
SUMMARYblock (count of MiniMax-M2.5 FP8 per_1x128 rows matched, missing Ms, any flagged anomalies).aiter test entry point (sanity, on MI300X/MI355X):
Documentation
No user-facing documentation changes required — this PR only extends an internal tuning table consumed by the autotuner.
Dependencies
No new third-party dependencies.
Breaking Changes
None. Existing rows are untouched; only new rows are appended.