Draft
Conversation
Three tightly-scoped changes for the Gluon GEMM tutorial blog:
1. Update repo references from gfx9-gluon-tutorials to
gfx950-gluon-tutorials (4 occurrences in the blog README) — the
tutorial repo was renamed to scope the name to the gfx950 ISA
variant the kernels actually target.
2. Add an author profile under blogs/authors/lixun-zhang.md and a
matching contributor card in blogs/contributor-bios.md (alphabetical
slot between Justin Chang and Mahdieh Ghazimirsaeed). Author photo
reference points at blogs/authors/data/Lixun-Zhang.jpg, to be
uploaded separately.
3. Fix two metadata values in primus-projection/README.md that the
"Blogs / Tags & Category" linter rejects:
amd_blog_hardware_platforms: 'AMD Instinct(TM) GPUs'
-> 'Instinct GPUs'
amd_blog_development_tools: includes 'AMD ROCm(TM) Software'
-> 'ROCm Software'
Body-text mentions of "AMD Instinct(TM)" are unaffected (the linter
only checks the metadata block). Out of scope for this blog but
blocks the same CI check on every PR until fixed; included here so
the Tags & Category check turns green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restructure based on editorial review:
Opening
- Lead with the 522 -> 1619 TFLOPS / 3x speedup arc and the BF8/MXFP4
numbers, instead of opening with a topic list.
- Make audience explicit in the second paragraph (kernel devs, compiler
engineers, performance specialists).
- Borrow the repo's sharper Gluon-vs-Triton positioning ("Triton's
strength is hardware-portable productivity; Gluon is the tool when you
need to extract every last percent on a target architecture").
- Move the "blog is the map, repo is the tutorial" line to a callout
after the perf table so it lands harder once the reader has scope.
Calibration
- Add an explicit explanation of what MFMA efficiency means, and the
approximate MI355 MFMA peaks (~1650 TFLOPS FP16, ~3500 BF8, ~6200
MXFP4) so readers can calibrate "near-peak".
- Update the chart caption to call out 522 -> 1619 explicitly.
- Name the architecture precisely: MI350/MI355 (gfx950, CDNA4).
Optimization narrative
- Adopt the repo's Acts I-IV framing for v0-v9 instead of the
"early/middle/later" paraphrase.
- Sharpen verbs: v3 "eliminates LDS bank conflicts" (was "studies"),
v6 "removes register-copy overhead", v7 "first reaches 98% MFMA
efficiency", v9 unpacks XCD-aware L2 locality with a brief
parenthetical (8 XCDs, separate L2s, etc.).
- Add a thread-trace screenshot (gluon-gemm-att-near-peak.png, copied
from the tutorial repo's v7_amdgcnas_bottleneck.png) to reinforce
the "profiling is the through-line" thesis with visual evidence.
MXFP4 section
- Lead with the 5728 TFLOPS / 92% headline; explain the scale pipeline
after the result.
- Expand "GR -> LW -> LR" to "Global Read of scales into registers, LDS
Write to convert their layout, then LDS Read to feed the scaled MFMA
instruction" on first use, in a callout.
- Mention ds_read_tr explicitly as the hardware-assisted layout
conversion.
Reproducibility
- Replace vague "ROCm Triton branch and pinned tag documented in the
repository" with an explicit link to the gfx9-gluon-tutorials-pin tag
in ROCm/triton.
- Show "bench.py --version 0" and "--version 9" back-to-back so readers
can see the 3x for themselves.
Disclaimers
- Add a Performance-varies disclaimer covering hardware/software
variability, alongside the existing third-party-content disclaimer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop "production-grade" from the lede; keep just "1619 TFLOPS near-peak kernel". The repo positions these as tutorial artifacts, not deployable libraries. - Define amdgcnas inline on first body-text mention: "combined with amdgcnas (a post-assembly peephole pass over the generated AMDGCN assembly)". Same treatment LLIR scheduler already gets in Act II. - Reword v6's register-set alternation as "double-buffering the operand registers, so consecutive K iterations swap which register set the MFMA consumes". Plainer than "alternating two register sets". - Move the slicing figure from after Act IV to right after Act III, and reword its caption to call out v7/v8 explicitly so the visual sits next to the text it illustrates. - Switch both Triton-pin links from /tree/<tag> to /releases/tag/<tag>, the canonical URL form for an annotated tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep the conceptual framing ("MFMA efficiency is the share of MFMA peak
throughput the kernel actually sustains, so 98% means within 2% of
MI355's theoretical MFMA peak") but remove the concrete peak figures
per data type. The MFMA-efficiency percentages already do the
calibration work without committing the post to specific peak numbers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous paragraph conflated two metrics: MFMA efficiency (a within-loop, cycle-level fraction measured from the thread trace) and the end-to-end TFLOPS / peak ratio. They are not the same — end-to-end also folds in epilogue stores, prologue setup, and multi-CU dispatch. Reword to define the metric correctly and explicitly note the scope limitation, matching docs/mfma_efficiency.md in the tutorial repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The companion tutorial repo (ROCm/gfx950-gluon-tutorials) finalized
the gfx950-tutorial-v0.1 Triton pin and refreshed every perf table
under it; sync the blog so its numbers, pin links, command examples,
and narrative match.
Headline FP16 v9 522 -> 1619 TFLOPS, 3x becomes
520 -> 1489 TFLOPS, ~3x
Headline BF8 3456 -> 3257 TFLOPS / 99% -> 99.72%
Headline MXFP4 5728 -> 5255 TFLOPS / 92% -> 92.41%
Other refreshes:
- Result table (a16w16/a8w8/a4w4 row) and chart caption updated
to the new TFLOPS / MFMA-eff numbers.
- Replaced gluon-gemm-performance-progression.png with the chart
regenerated by the tutorial's offline plotter against the new
/data/perf_data.csv (18 bars: each version with all measured
configs, no v7 + llir+ra column).
- Triton pin link rewritten from gfx9-gluon-tutorials-pin on
ROCm/triton (matmul_4waves) to gfx950-tutorial-v0.1 on
triton-lang/triton (commit on the gfx950-tutorial branch).
Both inline and disclaimer references updated.
- Disclaimer ROCm version bumped 6.5.0 -> 7.0 (the tutorial
scripts and rocprofv3 paths now require ROCm >= 7.0).
- bench.py example flag corrected: --use-rocprof did not exist
in bench.py, so dropped it; the v9 example now sets
TRITON_ENABLE_LLIR_SCHED=1 / TRITON_ENABLE_AMDGCN_AS=1 so the
invocation actually exercises the optimizations the comment
promises.
- run_perf_table.py example: --use-rocprof -> --rocprof to match
the actual flag.
- Slightly tightened the Act III v6 paragraph: v6's unroll-by-2
eliminates the per-iteration copy as designed, but the unroll
keeps both register sets concurrently live and pushes against
the 512-VGPR limit; v7's N-slicing is what actually delivers
the throughput by halving the B-tile footprint.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rework the post around the 10 reader-perspective fixes called out in
review:
1. Open with the most arresting fact (98.75% MFMA efficiency on
MI355) instead of a generic "measurement, not guesswork" line.
2. Reframe the MFMA-efficiency caveat from a defensive footnote
under the table to a positive methodology note inside
"Profiling drives the tutorial."
3. Promote the v6 spill regression to the centerpiece of Act III,
with a callout: a 73% TFLOPS crash, 99 spilled VGPRs, and v7's
slicing as the by-design fix. This is the tutorial's most
teachable moment and was previously buried in continuous prose.
4. Update the chart caption to call attention to the visible v6
dip in the bar chart.
5. Expand the MXFP4 section: name the broader W4 quantization
trajectory (W4A8, W4A16, GPTQ, AWQ-style), name the actual
scaled-MFMA instruction, and explain why the GR -> LW -> LR
scale pipeline is a generalizable pattern for any quantized
data path with on-disk-vs-MFMA layout mismatch.
6. Add a "From kernel to model" section that explicitly bridges
GEMM techniques to AI-workload kernels (FC layers, MoE experts,
attention matmul, KV-cache projections), names what the tutorial
does NOT yet cover (memory-bound GEMM, FlashAttention prefill /
decode, MXFP4 MoE -- on the public roadmap), and gives an
evaluator-readable answer to "why MI350 for AI": the recipe is
open and end-to-end reproducible, no black box.
7. Add a "Where to look in docs/" callout pointing at the four
standalone references (performance_philosophy, mfma_efficiency,
lds_throughput, memory_bandwidth_model) that previously were
only mentioned in passing.
8. End the body on a forward statement instead of dwindling into
disclaimers; the disclaimer block is unchanged but no longer
carries the closing impression.
9. Move the "post is the map / repo is the tutorial" blockquote
from the structural intro to just before "Try the tutorial",
where it actually directs the reader.
10. Tighten the "Try the tutorial" intro to lead with what the
reader will see (~3x in two commands) before the Triton-build
setup.
No factual numbers changed -- this commit is rhetorical, not data.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply the 12 reader-perspective fixes from the second review:
1. Expand Act IV / v9: the chiplet-GPU + 8-XCD insight gets a
full paragraph plus the closed-form `f(GM) = GM + ceil(P/GM)`
remapping objective, the L2-miss counters (5.3M -> 4.1M), and
a sentence framing this as the kind of MI350-specific quirk
evaluators need up front. Was previously a single
parenthetical at the end of one sentence.
2. Reorder/shorten "Why another GEMM tutorial?": drop the
defensive "ROCm provides libraries..." opening that was
deflating the energy of the opening hook; lead instead with a
short pedagogy-first paragraph and go straight to the data
table.
3. Add a callout after the bench.py snippet warning that
bench.py prints do_bench cache-warm numbers a few percent
below the cold-cache rocprofv3 numbers used elsewhere in
the post (prevents reader disappointment on running the
command).
4. Reframe "where the tutorial does not yet go" as "what's
coming next" -- same content (memory-bound GEMM, FA prefill /
decode, MXFP4 MoE), opposite emotional valence, ties to
"the kernels that dominate LLM inference and MoE inference."
5. "Several wrong turns" -> "a regression along the way" --
the post narrates one wrong turn (v6), not several.
6. Cut "What Gluon makes explicit" by ~25%: collapse three
setup paragraphs into one denser paragraph, keep the bullet
list and connective close.
7. Concretize "no vendor secrets": name the actual artifacts
(kernel source, LLIR scheduler pass, amdgcnas peephole,
run_perf_table.py reproducer, pinned Triton commit, MIT
license) instead of abstract claim.
8. Drop reading-order step 5 (redundant with the new docs/
subsection); replace with a one-line pointer.
9. Crisper MFMA caveat: "End-to-end TFLOPS is a different
question..." -> "End-to-end TFLOPS, which also captures
epilogue, prologue, and multi-CU dispatch, is reported
alongside it."
10. Compress the Summary: cut the generic "near-peak isn't
one trick - it's a sequence of design decisions [list]"
windup so the closing punchline ("AMD GPUs need vendor
secrets to hit peak...") lands without dilution.
11. Update chart alt-text to mention the v6 dip and the bar
structure (per-version, per-config), not just baseline-vs-
final monotonic story.
12. Set setup-time expectations in "Try the tutorial": "Setup
is ~30 minutes (Triton built from source); after that, two
commands reproduce the journey." Honest up-front so the
build step doesn't feel like a bait-and-switch after the
2-command promise.
No factual numbers changed; this commit is rhetorical.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Depends on ROCm/gfx950-gluon-tutorials going public first
Objective of the new blog:
Please describe the intent of the blog.
Signoff section must be completed prior to publishing.