Add Gluon GEMM tutorial blog by zhanglx13 · Pull Request #345 · ROCm/rocm-blogs

zhanglx13 · 2026-04-29T08:02:10Z

Depends on ROCm/gfx950-gluon-tutorials going public first

Objective of the new blog:
Please describe the intent of the blog.

Signoff section must be completed prior to publishing.

Technical reviewer approves publishing: (edit and replace with @githubid)
Editorial team approved publishing: (edit and replace with @githubid)
Add a thumbnail image for your blog if one is available
Text nugget summarizing your article. 2-3 lines to draw the reader's attention. Possibly the opening paragraph can be used.
Blog author team signoffs
- AMD Employees only: Legal self review traffic lights completed: (edit and replace with @githubid)
- Licenses file included for content is correct: (edit and replace with @githubid)
- Changes from technical review and editorial team are acceptable: (edit and replace with @githubid)

Three tightly-scoped changes for the Gluon GEMM tutorial blog: 1. Update repo references from gfx9-gluon-tutorials to gfx950-gluon-tutorials (4 occurrences in the blog README) — the tutorial repo was renamed to scope the name to the gfx950 ISA variant the kernels actually target. 2. Add an author profile under blogs/authors/lixun-zhang.md and a matching contributor card in blogs/contributor-bios.md (alphabetical slot between Justin Chang and Mahdieh Ghazimirsaeed). Author photo reference points at blogs/authors/data/Lixun-Zhang.jpg, to be uploaded separately. 3. Fix two metadata values in primus-projection/README.md that the "Blogs / Tags & Category" linter rejects: amd_blog_hardware_platforms: 'AMD Instinct(TM) GPUs' -> 'Instinct GPUs' amd_blog_development_tools: includes 'AMD ROCm(TM) Software' -> 'ROCm Software' Body-text mentions of "AMD Instinct(TM)" are unaffected (the linter only checks the metadata block). Out of scope for this blog but blocks the same CI check on every PR until fixed; included here so the Tags & Category check turns green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Restructure based on editorial review: Opening - Lead with the 522 -> 1619 TFLOPS / 3x speedup arc and the BF8/MXFP4 numbers, instead of opening with a topic list. - Make audience explicit in the second paragraph (kernel devs, compiler engineers, performance specialists). - Borrow the repo's sharper Gluon-vs-Triton positioning ("Triton's strength is hardware-portable productivity; Gluon is the tool when you need to extract every last percent on a target architecture"). - Move the "blog is the map, repo is the tutorial" line to a callout after the perf table so it lands harder once the reader has scope. Calibration - Add an explicit explanation of what MFMA efficiency means, and the approximate MI355 MFMA peaks (~1650 TFLOPS FP16, ~3500 BF8, ~6200 MXFP4) so readers can calibrate "near-peak". - Update the chart caption to call out 522 -> 1619 explicitly. - Name the architecture precisely: MI350/MI355 (gfx950, CDNA4). Optimization narrative - Adopt the repo's Acts I-IV framing for v0-v9 instead of the "early/middle/later" paraphrase. - Sharpen verbs: v3 "eliminates LDS bank conflicts" (was "studies"), v6 "removes register-copy overhead", v7 "first reaches 98% MFMA efficiency", v9 unpacks XCD-aware L2 locality with a brief parenthetical (8 XCDs, separate L2s, etc.). - Add a thread-trace screenshot (gluon-gemm-att-near-peak.png, copied from the tutorial repo's v7_amdgcnas_bottleneck.png) to reinforce the "profiling is the through-line" thesis with visual evidence. MXFP4 section - Lead with the 5728 TFLOPS / 92% headline; explain the scale pipeline after the result. - Expand "GR -> LW -> LR" to "Global Read of scales into registers, LDS Write to convert their layout, then LDS Read to feed the scaled MFMA instruction" on first use, in a callout. - Mention ds_read_tr explicitly as the hardware-assisted layout conversion. Reproducibility - Replace vague "ROCm Triton branch and pinned tag documented in the repository" with an explicit link to the gfx9-gluon-tutorials-pin tag in ROCm/triton. - Show "bench.py --version 0" and "--version 9" back-to-back so readers can see the 3x for themselves. Disclaimers - Add a Performance-varies disclaimer covering hardware/software variability, alongside the existing third-party-content disclaimer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Drop "production-grade" from the lede; keep just "1619 TFLOPS near-peak kernel". The repo positions these as tutorial artifacts, not deployable libraries. - Define amdgcnas inline on first body-text mention: "combined with amdgcnas (a post-assembly peephole pass over the generated AMDGCN assembly)". Same treatment LLIR scheduler already gets in Act II. - Reword v6's register-set alternation as "double-buffering the operand registers, so consecutive K iterations swap which register set the MFMA consumes". Plainer than "alternating two register sets". - Move the slicing figure from after Act IV to right after Act III, and reword its caption to call out v7/v8 explicitly so the visual sits next to the text it illustrates. - Switch both Triton-pin links from /tree/<tag> to /releases/tag/<tag>, the canonical URL form for an annotated tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Keep the conceptual framing ("MFMA efficiency is the share of MFMA peak throughput the kernel actually sustains, so 98% means within 2% of MI355's theoretical MFMA peak") but remove the concrete peak figures per data type. The MFMA-efficiency percentages already do the calibration work without committing the post to specific peak numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous paragraph conflated two metrics: MFMA efficiency (a within-loop, cycle-level fraction measured from the thread trace) and the end-to-end TFLOPS / peak ratio. They are not the same — end-to-end also folds in epilogue stores, prologue setup, and multi-CU dispatch. Reword to define the metric correctly and explicitly note the scope limitation, matching docs/mfma_efficiency.md in the tutorial repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The companion tutorial repo (ROCm/gfx950-gluon-tutorials) finalized the gfx950-tutorial-v0.1 Triton pin and refreshed every perf table under it; sync the blog so its numbers, pin links, command examples, and narrative match. Headline FP16 v9 522 -> 1619 TFLOPS, 3x becomes 520 -> 1489 TFLOPS, ~3x Headline BF8 3456 -> 3257 TFLOPS / 99% -> 99.72% Headline MXFP4 5728 -> 5255 TFLOPS / 92% -> 92.41% Other refreshes: - Result table (a16w16/a8w8/a4w4 row) and chart caption updated to the new TFLOPS / MFMA-eff numbers. - Replaced gluon-gemm-performance-progression.png with the chart regenerated by the tutorial's offline plotter against the new /data/perf_data.csv (18 bars: each version with all measured configs, no v7 + llir+ra column). - Triton pin link rewritten from gfx9-gluon-tutorials-pin on ROCm/triton (matmul_4waves) to gfx950-tutorial-v0.1 on triton-lang/triton (commit on the gfx950-tutorial branch). Both inline and disclaimer references updated. - Disclaimer ROCm version bumped 6.5.0 -> 7.0 (the tutorial scripts and rocprofv3 paths now require ROCm >= 7.0). - bench.py example flag corrected: --use-rocprof did not exist in bench.py, so dropped it; the v9 example now sets TRITON_ENABLE_LLIR_SCHED=1 / TRITON_ENABLE_AMDGCN_AS=1 so the invocation actually exercises the optimizations the comment promises. - run_perf_table.py example: --use-rocprof -> --rocprof to match the actual flag. - Slightly tightened the Act III v6 paragraph: v6's unroll-by-2 eliminates the per-iteration copy as designed, but the unroll keeps both register sets concurrently live and pushes against the 512-VGPR limit; v7's N-slicing is what actually delivers the throughput by halving the B-tile footprint. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rework the post around the 10 reader-perspective fixes called out in review: 1. Open with the most arresting fact (98.75% MFMA efficiency on MI355) instead of a generic "measurement, not guesswork" line. 2. Reframe the MFMA-efficiency caveat from a defensive footnote under the table to a positive methodology note inside "Profiling drives the tutorial." 3. Promote the v6 spill regression to the centerpiece of Act III, with a callout: a 73% TFLOPS crash, 99 spilled VGPRs, and v7's slicing as the by-design fix. This is the tutorial's most teachable moment and was previously buried in continuous prose. 4. Update the chart caption to call attention to the visible v6 dip in the bar chart. 5. Expand the MXFP4 section: name the broader W4 quantization trajectory (W4A8, W4A16, GPTQ, AWQ-style), name the actual scaled-MFMA instruction, and explain why the GR -> LW -> LR scale pipeline is a generalizable pattern for any quantized data path with on-disk-vs-MFMA layout mismatch. 6. Add a "From kernel to model" section that explicitly bridges GEMM techniques to AI-workload kernels (FC layers, MoE experts, attention matmul, KV-cache projections), names what the tutorial does NOT yet cover (memory-bound GEMM, FlashAttention prefill / decode, MXFP4 MoE -- on the public roadmap), and gives an evaluator-readable answer to "why MI350 for AI": the recipe is open and end-to-end reproducible, no black box. 7. Add a "Where to look in docs/" callout pointing at the four standalone references (performance_philosophy, mfma_efficiency, lds_throughput, memory_bandwidth_model) that previously were only mentioned in passing. 8. End the body on a forward statement instead of dwindling into disclaimers; the disclaimer block is unchanged but no longer carries the closing impression. 9. Move the "post is the map / repo is the tutorial" blockquote from the structural intro to just before "Try the tutorial", where it actually directs the reader. 10. Tighten the "Try the tutorial" intro to lead with what the reader will see (~3x in two commands) before the Triton-build setup. No factual numbers changed -- this commit is rhetorical, not data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Apply the 12 reader-perspective fixes from the second review: 1. Expand Act IV / v9: the chiplet-GPU + 8-XCD insight gets a full paragraph plus the closed-form `f(GM) = GM + ceil(P/GM)` remapping objective, the L2-miss counters (5.3M -> 4.1M), and a sentence framing this as the kind of MI350-specific quirk evaluators need up front. Was previously a single parenthetical at the end of one sentence. 2. Reorder/shorten "Why another GEMM tutorial?": drop the defensive "ROCm provides libraries..." opening that was deflating the energy of the opening hook; lead instead with a short pedagogy-first paragraph and go straight to the data table. 3. Add a callout after the bench.py snippet warning that bench.py prints do_bench cache-warm numbers a few percent below the cold-cache rocprofv3 numbers used elsewhere in the post (prevents reader disappointment on running the command). 4. Reframe "where the tutorial does not yet go" as "what's coming next" -- same content (memory-bound GEMM, FA prefill / decode, MXFP4 MoE), opposite emotional valence, ties to "the kernels that dominate LLM inference and MoE inference." 5. "Several wrong turns" -> "a regression along the way" -- the post narrates one wrong turn (v6), not several. 6. Cut "What Gluon makes explicit" by ~25%: collapse three setup paragraphs into one denser paragraph, keep the bullet list and connective close. 7. Concretize "no vendor secrets": name the actual artifacts (kernel source, LLIR scheduler pass, amdgcnas peephole, run_perf_table.py reproducer, pinned Triton commit, MIT license) instead of abstract claim. 8. Drop reading-order step 5 (redundant with the new docs/ subsection); replace with a one-line pointer. 9. Crisper MFMA caveat: "End-to-end TFLOPS is a different question..." -> "End-to-end TFLOPS, which also captures epilogue, prologue, and multi-CU dispatch, is reported alongside it." 10. Compress the Summary: cut the generic "near-peak isn't one trick - it's a sequence of design decisions [list]" windup so the closing punchline ("AMD GPUs need vendor secrets to hit peak...") lands without dilution. 11. Update chart alt-text to mention the v6 dip and the bar structure (per-version, per-config), not just baseline-vs- final monotonic story. 12. Set setup-time expectations in "Try the tutorial": "Setup is ~30 minutes (Triton built from source); after that, two commands reproduce the journey." Honest up-front so the build step doesn't feel like a bait-and-switch after the 2-command promise. No factual numbers changed; this commit is rhetorical. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add Gluon GEMM tutorial blog

51bb92c

zhanglx13 requested a review from saadrahim as a code owner April 29, 2026 08:02

zhanglx13 marked this pull request as draft April 29, 2026 17:01

zhanglx13 and others added 9 commits May 2, 2026 00:04

Update .authorlist.txt on branch with open PR

efdc7bb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gluon GEMM tutorial blog#345

Add Gluon GEMM tutorial blog#345
zhanglx13 wants to merge 10 commits intoreleasefrom
add-gluon-gemm-tutorial-blog-squashed

zhanglx13 commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhanglx13 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhanglx13 commented Apr 29, 2026 •

edited

Loading