bench(fmha): cp.async vs TMA bulk microbench — refutes Tier-2 LDGSTS→TMA lever by kekzl · Pull Request #169 · kekzl/imp

kekzl · 2026-05-14T10:53:29Z

Summary

Phase-1 microbench gate for the LDGSTS→TMA migration lever from memory file `hw_capability_audit_complete_2026_05_10` (predicted 5-10% kernel speedup on hand-rolled FMHA-MXFP4 kernels via cp.async → cp.async.bulk.tensor.2d). The bench A/B's the two variants on the four FMHA V-load shapes used by current model templates.

Empirical result — TMA loses on SM120

Shape	cp.async	TMA bulk	speedup
HD=128 Bkv=128 (32 KiB)	6008 GB/s	4616 GB/s	0.77× ⬇30%
HD=128 Bkv=64 (16 KiB)	14193 GB/s	4598 GB/s	0.32× ⬇3.1×
HD=64 Bkv=128 (16 KiB)	14938 GB/s	4561 GB/s	0.31× ⬇3.3×
HD=256 Bkv=64 (32 KiB)	5848 GB/s	4646 GB/s	0.79× ⬇26%

Tested on RTX 5090 (sm_120a), 7 reps × 4096 iters × 170 CTAs. Same source tile, same SMEM destination, same launch geometry.

Why TMA loses here

TMA bulk on SM120 has fixed setup overhead per tile that doesn't amortize at 16-32 KiB tile sizes typical for FMHA
cp.async with 128 threads × 16 B/issue keeps the memory engine saturated via per-thread issue fan-out
SM120 (consumer Blackwell) appears to have lower TMA throughput than SM100 (data-center) where the original Tier-2 estimate likely came from

Result consistent across all four production-relevant shapes — no shape regime where TMA would justify the 1-2 dev-week integration.

Decision

Abandon the LDGSTS→TMA conversion lever. The microbench-first gate just saved a multi-week effort with negative empirical evidence.

Bench shipped as re-runnable infrastructure for future revisiting:
```bash
docker run --rm --gpus all imp:test imp-tests --gtest_filter='FmhaVLoadBench.*'
```

Test plan

build clean (`make build`)
verify-fast green (decode +2.26%, prefill +3.53%, graphs 1.39×)
All 4 bench shapes execute cleanly
No behavior change for production code paths — bench-only TU

🤖 Generated with Claude Code

…TMA lever Phase-1 gate for the LDGSTS→TMA migration documented in memory file `hw_capability_audit_complete_2026_05_10` (predicted 5-10% kernel speedup on hand-rolled FMHA-MXFP4 kernels). The microbench A/B's the two variants on the four FMHA V-load shapes used by current model templates (HD={64, 128, 256}, Bkv={64, 128}). ## Empirical result — TMA is SLOWER on SM120 for these shapes | Shape | cp.async | TMA bulk | speedup | |----------------------|----------:|----------:|------------:| | HD=128 Bkv=128 (32K) | 6008 GB/s | 4616 GB/s | 0.77× ⬇30% | | HD=128 Bkv=64 (16K) |14193 GB/s | 4598 GB/s | 0.32× ⬇3.1× | | HD=64 Bkv=128 (16K) |14938 GB/s | 4561 GB/s | 0.31× ⬇3.3× | | HD=256 Bkv=64 (32K) | 5848 GB/s | 4646 GB/s | 0.79× ⬇26% | Both variants tested on real RTX 5090 (sm_120a) via gtest, 7 reps × 4096 iters × 170 CTAs. Same source tile, same SMEM destination, same launch geometry. ## Why TMA loses here - TMA bulk on SM120 has fixed setup overhead per tile that doesn't amortize at 16-32 KiB tile sizes typical for FMHA. - cp.async with 128 threads × 16 B/issue keeps the memory engine saturated via per-thread issue fan-out. - SM120 (consumer Blackwell) appears to have lower TMA throughput than SM100 (data-center) where the original Tier-2 estimate likely came from. The result is consistent across all four production-relevant shapes — no shape regime where TMA would justify the 1-2 dev-week integration. ## Decision **Abandon** the LDGSTS→TMA conversion lever. Memory files updated to reflect this empirical refutation. The bench is shipped as re-runnable infrastructure so the decision can be revisited if CUDA / driver / hardware updates change the picture: `imp-tests --gtest_filter='FmhaVLoadBench.*'` No behavior change for any production code path — bench-only TU.

github-actions Bot enabled auto-merge (squash) May 14, 2026 10:53

github-actions Bot merged commit adbf4e4 into main May 14, 2026
3 checks passed

kekzl deleted the perf/fmha-v-load-tma-bench branch May 14, 2026 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench(fmha): cp.async vs TMA bulk microbench — refutes Tier-2 LDGSTS→TMA lever#169

bench(fmha): cp.async vs TMA bulk microbench — refutes Tier-2 LDGSTS→TMA lever#169
github-actions[bot] merged 1 commit into
mainfrom
perf/fmha-v-load-tma-bench

kekzl commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented May 14, 2026

Summary

Empirical result — TMA loses on SM120

Why TMA loses here

Decision

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant