Skip to content

[RFC][Data] Add tuned A8W8 blockscale GEMM configs for MiniMax-M2.5 shapes (MI355X)#3

Draft
peymanr wants to merge 1 commit into
mainfrom
seed/perf-add-tuned-a8w8-blockscale-gemm-conf-3
Draft

[RFC][Data] Add tuned A8W8 blockscale GEMM configs for MiniMax-M2.5 shapes (MI355X)#3
peymanr wants to merge 1 commit into
mainfrom
seed/perf-add-tuned-a8w8-blockscale-gemm-conf-3

Conversation

@peymanr

@peymanr peymanr commented May 29, 2026

Copy link
Copy Markdown
Owner

Part of #4

Summary

Adds a new tuning CSV aiter/configs/a8w8_blockscale_tuned_gemm.csv with pre-selected CK kernel configurations for the A8W8 (per-token-activation, per-128x128-weight) block-scale GEMM used by MiniMax-M2.5 dense / attention projections on cu_num=256 (MI300-class) GPUs.

This is a tuning-data-only change. No kernels, dispatch logic, or Python APIs are modified.

Motivation

MiniMax-M2.5 exercises a specific set of A8W8 block-scale GEMM shapes (N ∈ {3072, 4096}, K = 3072, with M sweeping from 1 to 512 plus 1024 and 8192). Without an entry in a8w8_blockscale_tuned_gemm.csv, the dispatcher falls back to a heuristic default that is not optimal for these shapes on cu_num=256. Landing the tuned CSV lets the runtime pick the measured-best (kernelId, splitK) pair directly.

This PR is part of a MiniMax-M2.5 enablement series:

  • PR 1 (perf): [Perf][Hardware][AMD] Speed up FP8 blockscale MoE decode path on gfx950
  • PR 2 (tuning): [Perf][Kernel] Add tuned fused-MoE configs for MiniMax-M2.5 FP8 per_1x128 shapes
  • PR 3 (this PR): tuned A8W8 blockscale GEMM CSV

Changes

  • New file: aiter/configs/a8w8_blockscale_tuned_gemm.csv (107 rows).
  • Schema: cu_num,M,N,K,libtype,kernelId,splitK,us,kernelName,tflops,bw,errRatio.
  • Coverage:
    • cu_num = 256
    • (N, K) ∈ {(3072, 3072), (4096, 3072)}
    • M sweep: 1, 2, 4, 8, 16, 24, …, 512 (step 8 in mid-range), plus 1024 and 8192.
    • All libtype = ck.
  • Selected kernels (by M range):
    • Small M (≤ ~64): a8w8_blockscale_1x128x128_256x16x64x256_..._intrawave_v1
    • Mid M (~88–160): a8w8_blockscale_1x128x128_256x16x128x256_..._intrawave_v1
    • Larger M (≥ ~168): a8w8_blockscale_1x128x128_256x64x64x256_..._intrawave_v1
    • M = 8192 / M = 1024 at N=4096: a8w8_blockscale_1x128x128_256x128x128x128_..._intrawave_v3
  • All entries report errRatio = 0.0 (numerical match against the reference path during tuning).

Performance (if applicable)

Measured us / tflops / bw values from the tuner are recorded directly in the CSV. Representative entries (cu_num=256):

M N K kernel splitK us TFLOPS BW (GB/s)
16 3072 3072 256x16x64x256 v1 0 10.23 29.51 936.74
64 3072 3072 256x16x64x256 v1 0 10.26 117.72 977.19
128 3072 3072 256x16x128x256 v1 2 11.31 213.69 939.05
256 3072 3072 256x64x64x256 v1 0 11.45 421.82 1029.83
512 3072 3072 256x64x64x256 v1 2 16.54 584.28 855.88
1024 3072 3072 256x64x64x256 v1 0 24.49 789.24 770.74
8192 3072 3072 256x128x128x128 v3 3 122.22 1265.13 694.95
64 4096 3072 256x16x64x256 v1 0 9.95 161.94 1337.64
256 4096 3072 256x64x64x256 v1 0 11.64 553.64 1329.13
512 4096 3072 256x64x64x256 v1 2 17.52 735.46 1047.40
1024 4096 3072 256x128x128x128 v3 1 27.52 936.36 876.31
8192 4096 3072 256x128x128x128 v3 3 158.89 1297.53 659.96

End-to-end model-level numbers (vs. the untuned heuristic fallback) — TBD — to be filled after benchmarking with the suggested driver scripts below.

Reproduce / validate:

# 1) Structural validation of the CSV
bash validate_a8w8_blockscale_tuned_csv.sh aiter/configs/a8w8_blockscale_tuned_gemm.csv

# 2) Re-run the tuner and diff against the shipped CSV
python3 benchmark_a8w8_blockscale_tuner.py \
    --csv aiter/configs/a8w8_blockscale_tuned_gemm.csv \
    --out diff_report.txt

# 3) Model-level multi-kernel throughput (MI355X driver)
python3 bench_a8w8_blockscale_models_mi355x.py \
    --csv aiter/configs/a8w8_blockscale_tuned_gemm.csv

Paste the SUMMARY block from validate_a8w8_blockscale_tuned_csv.sh (row count, duplicate (cu_num,M,N,K) count, nonzero-errRatio count) and the per-GPU summary from benchmark_a8w8_blockscale_tuner.py here once collected:

  • MI300X (gfx942): TBD
  • MI350X (gfx950): TBD
  • MI355X (gfx950): TBD

Testing

  • Structural CSV check (schema, column count, no duplicate (cu_num,M,N,K) keys, errRatio == 0.0 for all rows): TBD — run validate_a8w8_blockscale_tuned_csv.sh and paste the SUMMARY block.
  • Tuner re-run diff: TBD — benchmark_a8w8_blockscale_tuner.py confirms each (M,N,K) row still resolves to a runnable kernel on the target GPU and that the recorded us is within tolerance of a fresh measurement.
  • Existing op tests:
    bash .github/scripts/aiter_test.sh

Documentation

No doc changes. The CSV is consumed automatically by the existing A8W8 block-scale GEMM dispatcher; no user-facing API or guide is affected.

Dependencies

No new third-party dependencies.

Breaking Changes

None. Adding a previously-missing tuning CSV can only change which kernel variant the dispatcher picks for the covered (cu_num, M, N, K) tuples; it does not alter numerical semantics (all rows tuned with errRatio = 0.0) and shapes not listed continue to use the existing fallback.

…5 shapes

Ship pre-tuned a8w8_blockscale_tuned_gemm.csv covering N=3072 and
N=4096 with K=3072 across the M sweep (1..512, plus 1024 and 8192)
used by MiniMax-M2.5 dense/attention projections on cu_num=256
(MI300-class GPUs). Entries select between intrawave_v1 16x64x256,
16x128x256, 64x64x256 tiles and an intrawave_v3 128x128x128 tile
for the large-M case, matching the dispatcher contract in
aiter/configs/.

No code paths are changed; this is a tuning-data-only PR. Part of
the MiniMax-M2.5 enablement series alongside the FP8 blockscale MoE
decode perf PR and the tuned FMoE configs PR.

Signed-off-by: <your name> <your email>
@github-actions

Copy link
Copy Markdown

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3 --add-label <label>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant