bench(fmha): cp.async vs TMA bulk microbench — refutes Tier-2 LDGSTS→TMA lever#169
Merged
Merged
Conversation
…TMA lever
Phase-1 gate for the LDGSTS→TMA migration documented in memory file
`hw_capability_audit_complete_2026_05_10` (predicted 5-10% kernel speedup
on hand-rolled FMHA-MXFP4 kernels). The microbench A/B's the two variants on
the four FMHA V-load shapes used by current model templates (HD={64, 128, 256},
Bkv={64, 128}).
## Empirical result — TMA is SLOWER on SM120 for these shapes
| Shape | cp.async | TMA bulk | speedup |
|----------------------|----------:|----------:|------------:|
| HD=128 Bkv=128 (32K) | 6008 GB/s | 4616 GB/s | 0.77× ⬇30% |
| HD=128 Bkv=64 (16K) |14193 GB/s | 4598 GB/s | 0.32× ⬇3.1× |
| HD=64 Bkv=128 (16K) |14938 GB/s | 4561 GB/s | 0.31× ⬇3.3× |
| HD=256 Bkv=64 (32K) | 5848 GB/s | 4646 GB/s | 0.79× ⬇26% |
Both variants tested on real RTX 5090 (sm_120a) via gtest, 7 reps × 4096
iters × 170 CTAs. Same source tile, same SMEM destination, same launch geometry.
## Why TMA loses here
- TMA bulk on SM120 has fixed setup overhead per tile that doesn't amortize
at 16-32 KiB tile sizes typical for FMHA.
- cp.async with 128 threads × 16 B/issue keeps the memory engine saturated
via per-thread issue fan-out.
- SM120 (consumer Blackwell) appears to have lower TMA throughput than
SM100 (data-center) where the original Tier-2 estimate likely came from.
The result is consistent across all four production-relevant shapes —
no shape regime where TMA would justify the 1-2 dev-week integration.
## Decision
**Abandon** the LDGSTS→TMA conversion lever. Memory files updated to
reflect this empirical refutation.
The bench is shipped as re-runnable infrastructure so the decision can be
revisited if CUDA / driver / hardware updates change the picture:
`imp-tests --gtest_filter='FmhaVLoadBench.*'`
No behavior change for any production code path — bench-only TU.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase-1 microbench gate for the LDGSTS→TMA migration lever from memory file `hw_capability_audit_complete_2026_05_10` (predicted 5-10% kernel speedup on hand-rolled FMHA-MXFP4 kernels via cp.async → cp.async.bulk.tensor.2d). The bench A/B's the two variants on the four FMHA V-load shapes used by current model templates.
Empirical result — TMA loses on SM120
Tested on RTX 5090 (sm_120a), 7 reps × 4096 iters × 170 CTAs. Same source tile, same SMEM destination, same launch geometry.
Why TMA loses here
Result consistent across all four production-relevant shapes — no shape regime where TMA would justify the 1-2 dev-week integration.
Decision
Abandon the LDGSTS→TMA conversion lever. The microbench-first gate just saved a multi-week effort with negative empirical evidence.
Bench shipped as re-runnable infrastructure for future revisiting:
```bash
docker run --rm --gpus all imp:test imp-tests --gtest_filter='FmhaVLoadBench.*'
```
Test plan
🤖 Generated with Claude Code