[RFC][Data] Add tuned A8W8 blockscale GEMM configs for MiniMax-M2.5 shapes (MI355X)#3
Draft
peymanr wants to merge 1 commit into
Draft
[RFC][Data] Add tuned A8W8 blockscale GEMM configs for MiniMax-M2.5 shapes (MI355X)#3peymanr wants to merge 1 commit into
peymanr wants to merge 1 commit into
Conversation
…5 shapes Ship pre-tuned a8w8_blockscale_tuned_gemm.csv covering N=3072 and N=4096 with K=3072 across the M sweep (1..512, plus 1024 and 8192) used by MiniMax-M2.5 dense/attention projections on cu_num=256 (MI300-class GPUs). Entries select between intrawave_v1 16x64x256, 16x128x256, 64x64x256 tiles and an intrawave_v3 128x128x128 tile for the large-M case, matching the dispatcher contract in aiter/configs/. No code paths are changed; this is a tuning-data-only PR. Part of the MiniMax-M2.5 enablement series alongside the FP8 blockscale MoE decode perf PR and the tuned FMoE configs PR. Signed-off-by: <your name> <your email>
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part of #4
Summary
Adds a new tuning CSV
aiter/configs/a8w8_blockscale_tuned_gemm.csvwith pre-selected CK kernel configurations for the A8W8 (per-token-activation, per-128x128-weight) block-scale GEMM used by MiniMax-M2.5 dense / attention projections oncu_num=256(MI300-class) GPUs.This is a tuning-data-only change. No kernels, dispatch logic, or Python APIs are modified.
Motivation
MiniMax-M2.5 exercises a specific set of A8W8 block-scale GEMM shapes (N ∈ {3072, 4096}, K = 3072, with M sweeping from 1 to 512 plus 1024 and 8192). Without an entry in
a8w8_blockscale_tuned_gemm.csv, the dispatcher falls back to a heuristic default that is not optimal for these shapes oncu_num=256. Landing the tuned CSV lets the runtime pick the measured-best(kernelId, splitK)pair directly.This PR is part of a MiniMax-M2.5 enablement series:
Changes
aiter/configs/a8w8_blockscale_tuned_gemm.csv(107 rows).cu_num,M,N,K,libtype,kernelId,splitK,us,kernelName,tflops,bw,errRatio.cu_num = 256(N, K) ∈ {(3072, 3072), (4096, 3072)}Msweep: 1, 2, 4, 8, 16, 24, …, 512 (step 8 in mid-range), plus 1024 and 8192.libtype = ck.a8w8_blockscale_1x128x128_256x16x64x256_..._intrawave_v1a8w8_blockscale_1x128x128_256x16x128x256_..._intrawave_v1a8w8_blockscale_1x128x128_256x64x64x256_..._intrawave_v1a8w8_blockscale_1x128x128_256x128x128x128_..._intrawave_v3errRatio = 0.0(numerical match against the reference path during tuning).Performance (if applicable)
Measured
us/tflops/bwvalues from the tuner are recorded directly in the CSV. Representative entries (cu_num=256):End-to-end model-level numbers (vs. the untuned heuristic fallback) — TBD — to be filled after benchmarking with the suggested driver scripts below.
Reproduce / validate:
Paste the SUMMARY block from
validate_a8w8_blockscale_tuned_csv.sh(row count, duplicate(cu_num,M,N,K)count, nonzero-errRatiocount) and the per-GPU summary frombenchmark_a8w8_blockscale_tuner.pyhere once collected:Testing
(cu_num,M,N,K)keys,errRatio == 0.0for all rows): TBD — runvalidate_a8w8_blockscale_tuned_csv.shand paste the SUMMARY block.benchmark_a8w8_blockscale_tuner.pyconfirms each(M,N,K)row still resolves to a runnable kernel on the target GPU and that the recordedusis within tolerance of a fresh measurement.Documentation
No doc changes. The CSV is consumed automatically by the existing A8W8 block-scale GEMM dispatcher; no user-facing API or guide is affected.
Dependencies
No new third-party dependencies.
Breaking Changes
None. Adding a previously-missing tuning CSV can only change which kernel variant the dispatcher picks for the covered
(cu_num, M, N, K)tuples; it does not alter numerical semantics (all rows tuned witherrRatio = 0.0) and shapes not listed continue to use the existing fallback.