[RFC][Data] Add tuned A8W8 blockscale GEMM configs for MiniMax-M2.5 shapes (MI355X) by peymanr · Pull Request #3 · peymanr/aiter

peymanr · 2026-05-29T19:20:25Z

Part of #4

Summary

Adds a new tuning CSV aiter/configs/a8w8_blockscale_tuned_gemm.csv with pre-selected CK kernel configurations for the A8W8 (per-token-activation, per-128x128-weight) block-scale GEMM used by MiniMax-M2.5 dense / attention projections on cu_num=256 (MI300-class) GPUs.

This is a tuning-data-only change. No kernels, dispatch logic, or Python APIs are modified.

Motivation

MiniMax-M2.5 exercises a specific set of A8W8 block-scale GEMM shapes (N ∈ {3072, 4096}, K = 3072, with M sweeping from 1 to 512 plus 1024 and 8192). Without an entry in a8w8_blockscale_tuned_gemm.csv, the dispatcher falls back to a heuristic default that is not optimal for these shapes on cu_num=256. Landing the tuned CSV lets the runtime pick the measured-best (kernelId, splitK) pair directly.

This PR is part of a MiniMax-M2.5 enablement series:

PR 1 (perf): [Perf][Hardware][AMD] Speed up FP8 blockscale MoE decode path on gfx950
PR 2 (tuning): [Perf][Kernel] Add tuned fused-MoE configs for MiniMax-M2.5 FP8 per_1x128 shapes
PR 3 (this PR): tuned A8W8 blockscale GEMM CSV

Changes

New file: aiter/configs/a8w8_blockscale_tuned_gemm.csv (107 rows).
Schema: cu_num,M,N,K,libtype,kernelId,splitK,us,kernelName,tflops,bw,errRatio.
Coverage:
- cu_num = 256
- (N, K) ∈ {(3072, 3072), (4096, 3072)}
- M sweep: 1, 2, 4, 8, 16, 24, …, 512 (step 8 in mid-range), plus 1024 and 8192.
- All libtype = ck.
Selected kernels (by M range):
- Small M (≤ ~64): a8w8_blockscale_1x128x128_256x16x64x256_..._intrawave_v1
- Mid M (~88–160): a8w8_blockscale_1x128x128_256x16x128x256_..._intrawave_v1
- Larger M (≥ ~168): a8w8_blockscale_1x128x128_256x64x64x256_..._intrawave_v1
- M = 8192 / M = 1024 at N=4096: a8w8_blockscale_1x128x128_256x128x128x128_..._intrawave_v3
All entries report errRatio = 0.0 (numerical match against the reference path during tuning).

Performance (if applicable)

Measured us / tflops / bw values from the tuner are recorded directly in the CSV. Representative entries (cu_num=256):

M	N	K	kernel	splitK	us	TFLOPS	BW (GB/s)
16	3072	3072	256x16x64x256 v1	0	10.23	29.51	936.74
64	3072	3072	256x16x64x256 v1	0	10.26	117.72	977.19
128	3072	3072	256x16x128x256 v1	2	11.31	213.69	939.05
256	3072	3072	256x64x64x256 v1	0	11.45	421.82	1029.83
512	3072	3072	256x64x64x256 v1	2	16.54	584.28	855.88
1024	3072	3072	256x64x64x256 v1	0	24.49	789.24	770.74
8192	3072	3072	256x128x128x128 v3	3	122.22	1265.13	694.95
64	4096	3072	256x16x64x256 v1	0	9.95	161.94	1337.64
256	4096	3072	256x64x64x256 v1	0	11.64	553.64	1329.13
512	4096	3072	256x64x64x256 v1	2	17.52	735.46	1047.40
1024	4096	3072	256x128x128x128 v3	1	27.52	936.36	876.31
8192	4096	3072	256x128x128x128 v3	3	158.89	1297.53	659.96

End-to-end model-level numbers (vs. the untuned heuristic fallback) — TBD — to be filled after benchmarking with the suggested driver scripts below.

Reproduce / validate:

# 1) Structural validation of the CSV
bash validate_a8w8_blockscale_tuned_csv.sh aiter/configs/a8w8_blockscale_tuned_gemm.csv

# 2) Re-run the tuner and diff against the shipped CSV
python3 benchmark_a8w8_blockscale_tuner.py \
    --csv aiter/configs/a8w8_blockscale_tuned_gemm.csv \
    --out diff_report.txt

# 3) Model-level multi-kernel throughput (MI355X driver)
python3 bench_a8w8_blockscale_models_mi355x.py \
    --csv aiter/configs/a8w8_blockscale_tuned_gemm.csv

Paste the SUMMARY block from validate_a8w8_blockscale_tuned_csv.sh (row count, duplicate (cu_num,M,N,K) count, nonzero-errRatio count) and the per-GPU summary from benchmark_a8w8_blockscale_tuner.py here once collected:

MI300X (gfx942): TBD
MI350X (gfx950): TBD
MI355X (gfx950): TBD

Testing

Structural CSV check (schema, column count, no duplicate (cu_num,M,N,K) keys, errRatio == 0.0 for all rows): TBD — run validate_a8w8_blockscale_tuned_csv.sh and paste the SUMMARY block.
Tuner re-run diff: TBD — benchmark_a8w8_blockscale_tuner.py confirms each (M,N,K) row still resolves to a runnable kernel on the target GPU and that the recorded us is within tolerance of a fresh measurement.
Existing op tests:
```
bash .github/scripts/aiter_test.sh
```

Documentation

No doc changes. The CSV is consumed automatically by the existing A8W8 block-scale GEMM dispatcher; no user-facing API or guide is affected.

Dependencies

No new third-party dependencies.

Breaking Changes

None. Adding a previously-missing tuning CSV can only change which kernel variant the dispatcher picks for the covered (cu_num, M, N, K) tuples; it does not alter numerical semantics (all rows tuned with errRatio = 0.0) and shapes not listed continue to use the existing fallback.

…5 shapes Ship pre-tuned a8w8_blockscale_tuned_gemm.csv covering N=3072 and N=4096 with K=3072 across the M sweep (1..512, plus 1024 and 8192) used by MiniMax-M2.5 dense/attention projections on cu_num=256 (MI300-class GPUs). Entries select between intrawave_v1 16x64x256, 16x128x256, 64x64x256 tiles and an intrawave_v3 128x128x128 tile for the large-M case, matching the dispatcher contract in aiter/configs/. No code paths are changed; this is a tuning-data-only PR. Part of the MiniMax-M2.5 enablement series alongside the FP8 blockscale MoE decode perf PR and the tuned FMoE configs PR. Signed-off-by: <your name> <your email>

github-actions · 2026-05-29T19:20:33Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3 --add-label <label>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC][Data] Add tuned A8W8 blockscale GEMM configs for MiniMax-M2.5 shapes (MI355X)#3

[RFC][Data] Add tuned A8W8 blockscale GEMM configs for MiniMax-M2.5 shapes (MI355X)#3
peymanr wants to merge 1 commit into
mainfrom
seed/perf-add-tuned-a8w8-blockscale-gemm-conf-3

peymanr commented May 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

peymanr commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Performance (if applicable)

Testing

Documentation

Dependencies

Breaking Changes

Uh oh!

github-actions Bot commented May 29, 2026

🏷️ CI Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

peymanr commented May 29, 2026 •

edited

Loading