Skip to content

bench(fmha): cp.async vs TMA bulk microbench — refutes Tier-2 LDGSTS→TMA lever#169

Merged
github-actions[bot] merged 1 commit into
mainfrom
perf/fmha-v-load-tma-bench
May 14, 2026
Merged

bench(fmha): cp.async vs TMA bulk microbench — refutes Tier-2 LDGSTS→TMA lever#169
github-actions[bot] merged 1 commit into
mainfrom
perf/fmha-v-load-tma-bench

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented May 14, 2026

Summary

Phase-1 microbench gate for the LDGSTS→TMA migration lever from memory file `hw_capability_audit_complete_2026_05_10` (predicted 5-10% kernel speedup on hand-rolled FMHA-MXFP4 kernels via cp.async → cp.async.bulk.tensor.2d). The bench A/B's the two variants on the four FMHA V-load shapes used by current model templates.

Empirical result — TMA loses on SM120

Shape cp.async TMA bulk speedup
HD=128 Bkv=128 (32 KiB) 6008 GB/s 4616 GB/s 0.77× ⬇30%
HD=128 Bkv=64 (16 KiB) 14193 GB/s 4598 GB/s 0.32× ⬇3.1×
HD=64 Bkv=128 (16 KiB) 14938 GB/s 4561 GB/s 0.31× ⬇3.3×
HD=256 Bkv=64 (32 KiB) 5848 GB/s 4646 GB/s 0.79× ⬇26%

Tested on RTX 5090 (sm_120a), 7 reps × 4096 iters × 170 CTAs. Same source tile, same SMEM destination, same launch geometry.

Why TMA loses here

  • TMA bulk on SM120 has fixed setup overhead per tile that doesn't amortize at 16-32 KiB tile sizes typical for FMHA
  • cp.async with 128 threads × 16 B/issue keeps the memory engine saturated via per-thread issue fan-out
  • SM120 (consumer Blackwell) appears to have lower TMA throughput than SM100 (data-center) where the original Tier-2 estimate likely came from

Result consistent across all four production-relevant shapes — no shape regime where TMA would justify the 1-2 dev-week integration.

Decision

Abandon the LDGSTS→TMA conversion lever. The microbench-first gate just saved a multi-week effort with negative empirical evidence.

Bench shipped as re-runnable infrastructure for future revisiting:
```bash
docker run --rm --gpus all imp:test imp-tests --gtest_filter='FmhaVLoadBench.*'
```

Test plan

  • build clean (`make build`)
  • verify-fast green (decode +2.26%, prefill +3.53%, graphs 1.39×)
  • All 4 bench shapes execute cleanly
  • No behavior change for production code paths — bench-only TU

🤖 Generated with Claude Code

…TMA lever

Phase-1 gate for the LDGSTS→TMA migration documented in memory file
`hw_capability_audit_complete_2026_05_10` (predicted 5-10% kernel speedup
on hand-rolled FMHA-MXFP4 kernels). The microbench A/B's the two variants on
the four FMHA V-load shapes used by current model templates (HD={64, 128, 256},
Bkv={64, 128}).

## Empirical result — TMA is SLOWER on SM120 for these shapes

| Shape                | cp.async  | TMA bulk  | speedup     |
|----------------------|----------:|----------:|------------:|
| HD=128 Bkv=128 (32K) | 6008 GB/s | 4616 GB/s | 0.77× ⬇30%  |
| HD=128 Bkv=64  (16K) |14193 GB/s | 4598 GB/s | 0.32× ⬇3.1× |
| HD=64  Bkv=128 (16K) |14938 GB/s | 4561 GB/s | 0.31× ⬇3.3× |
| HD=256 Bkv=64  (32K) | 5848 GB/s | 4646 GB/s | 0.79× ⬇26%  |

Both variants tested on real RTX 5090 (sm_120a) via gtest, 7 reps × 4096
iters × 170 CTAs. Same source tile, same SMEM destination, same launch geometry.

## Why TMA loses here

- TMA bulk on SM120 has fixed setup overhead per tile that doesn't amortize
  at 16-32 KiB tile sizes typical for FMHA.
- cp.async with 128 threads × 16 B/issue keeps the memory engine saturated
  via per-thread issue fan-out.
- SM120 (consumer Blackwell) appears to have lower TMA throughput than
  SM100 (data-center) where the original Tier-2 estimate likely came from.

The result is consistent across all four production-relevant shapes —
no shape regime where TMA would justify the 1-2 dev-week integration.

## Decision

**Abandon** the LDGSTS→TMA conversion lever. Memory files updated to
reflect this empirical refutation.

The bench is shipped as re-runnable infrastructure so the decision can be
revisited if CUDA / driver / hardware updates change the picture:
  `imp-tests --gtest_filter='FmhaVLoadBench.*'`

No behavior change for any production code path — bench-only TU.
@github-actions github-actions Bot enabled auto-merge (squash) May 14, 2026 10:53
@github-actions github-actions Bot merged commit adbf4e4 into main May 14, 2026
3 checks passed
@kekzl kekzl deleted the perf/fmha-v-load-tma-bench branch May 14, 2026 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant