Skip to content

Add Gluon GEMM tutorial blog#345

Draft
zhanglx13 wants to merge 10 commits intoreleasefrom
add-gluon-gemm-tutorial-blog-squashed
Draft

Add Gluon GEMM tutorial blog#345
zhanglx13 wants to merge 10 commits intoreleasefrom
add-gluon-gemm-tutorial-blog-squashed

Conversation

@zhanglx13
Copy link
Copy Markdown

@zhanglx13 zhanglx13 commented Apr 29, 2026

Depends on ROCm/gfx950-gluon-tutorials going public first

Objective of the new blog:
Please describe the intent of the blog.

Signoff section must be completed prior to publishing.

  • Technical reviewer approves publishing: (edit and replace with @githubid)
  • Editorial team approved publishing: (edit and replace with @githubid)
  • Add a thumbnail image for your blog if one is available
  • Text nugget summarizing your article. 2-3 lines to draw the reader's attention. Possibly the opening paragraph can be used.
  • Blog author team signoffs

@zhanglx13 zhanglx13 requested a review from saadrahim as a code owner April 29, 2026 08:02
@zhanglx13 zhanglx13 marked this pull request as draft April 29, 2026 17:01
zhanglx13 and others added 9 commits May 2, 2026 00:04
Three tightly-scoped changes for the Gluon GEMM tutorial blog:

1. Update repo references from gfx9-gluon-tutorials to
   gfx950-gluon-tutorials (4 occurrences in the blog README) — the
   tutorial repo was renamed to scope the name to the gfx950 ISA
   variant the kernels actually target.

2. Add an author profile under blogs/authors/lixun-zhang.md and a
   matching contributor card in blogs/contributor-bios.md (alphabetical
   slot between Justin Chang and Mahdieh Ghazimirsaeed). Author photo
   reference points at blogs/authors/data/Lixun-Zhang.jpg, to be
   uploaded separately.

3. Fix two metadata values in primus-projection/README.md that the
   "Blogs / Tags & Category" linter rejects:
     amd_blog_hardware_platforms: 'AMD Instinct(TM) GPUs'
        -> 'Instinct GPUs'
     amd_blog_development_tools: includes 'AMD ROCm(TM) Software'
        -> 'ROCm Software'
   Body-text mentions of "AMD Instinct(TM)" are unaffected (the linter
   only checks the metadata block). Out of scope for this blog but
   blocks the same CI check on every PR until fixed; included here so
   the Tags & Category check turns green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restructure based on editorial review:

Opening
- Lead with the 522 -> 1619 TFLOPS / 3x speedup arc and the BF8/MXFP4
  numbers, instead of opening with a topic list.
- Make audience explicit in the second paragraph (kernel devs, compiler
  engineers, performance specialists).
- Borrow the repo's sharper Gluon-vs-Triton positioning ("Triton's
  strength is hardware-portable productivity; Gluon is the tool when you
  need to extract every last percent on a target architecture").
- Move the "blog is the map, repo is the tutorial" line to a callout
  after the perf table so it lands harder once the reader has scope.

Calibration
- Add an explicit explanation of what MFMA efficiency means, and the
  approximate MI355 MFMA peaks (~1650 TFLOPS FP16, ~3500 BF8, ~6200
  MXFP4) so readers can calibrate "near-peak".
- Update the chart caption to call out 522 -> 1619 explicitly.
- Name the architecture precisely: MI350/MI355 (gfx950, CDNA4).

Optimization narrative
- Adopt the repo's Acts I-IV framing for v0-v9 instead of the
  "early/middle/later" paraphrase.
- Sharpen verbs: v3 "eliminates LDS bank conflicts" (was "studies"),
  v6 "removes register-copy overhead", v7 "first reaches 98% MFMA
  efficiency", v9 unpacks XCD-aware L2 locality with a brief
  parenthetical (8 XCDs, separate L2s, etc.).
- Add a thread-trace screenshot (gluon-gemm-att-near-peak.png, copied
  from the tutorial repo's v7_amdgcnas_bottleneck.png) to reinforce
  the "profiling is the through-line" thesis with visual evidence.

MXFP4 section
- Lead with the 5728 TFLOPS / 92% headline; explain the scale pipeline
  after the result.
- Expand "GR -> LW -> LR" to "Global Read of scales into registers, LDS
  Write to convert their layout, then LDS Read to feed the scaled MFMA
  instruction" on first use, in a callout.
- Mention ds_read_tr explicitly as the hardware-assisted layout
  conversion.

Reproducibility
- Replace vague "ROCm Triton branch and pinned tag documented in the
  repository" with an explicit link to the gfx9-gluon-tutorials-pin tag
  in ROCm/triton.
- Show "bench.py --version 0" and "--version 9" back-to-back so readers
  can see the 3x for themselves.

Disclaimers
- Add a Performance-varies disclaimer covering hardware/software
  variability, alongside the existing third-party-content disclaimer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop "production-grade" from the lede; keep just "1619 TFLOPS
  near-peak kernel". The repo positions these as tutorial artifacts,
  not deployable libraries.

- Define amdgcnas inline on first body-text mention: "combined with
  amdgcnas (a post-assembly peephole pass over the generated AMDGCN
  assembly)". Same treatment LLIR scheduler already gets in Act II.

- Reword v6's register-set alternation as "double-buffering the operand
  registers, so consecutive K iterations swap which register set the
  MFMA consumes". Plainer than "alternating two register sets".

- Move the slicing figure from after Act IV to right after Act III, and
  reword its caption to call out v7/v8 explicitly so the visual sits
  next to the text it illustrates.

- Switch both Triton-pin links from /tree/<tag> to /releases/tag/<tag>,
  the canonical URL form for an annotated tag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep the conceptual framing ("MFMA efficiency is the share of MFMA peak
throughput the kernel actually sustains, so 98% means within 2% of
MI355's theoretical MFMA peak") but remove the concrete peak figures
per data type. The MFMA-efficiency percentages already do the
calibration work without committing the post to specific peak numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous paragraph conflated two metrics: MFMA efficiency (a
within-loop, cycle-level fraction measured from the thread trace) and
the end-to-end TFLOPS / peak ratio. They are not the same — end-to-end
also folds in epilogue stores, prologue setup, and multi-CU dispatch.

Reword to define the metric correctly and explicitly note the scope
limitation, matching docs/mfma_efficiency.md in the tutorial repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The companion tutorial repo (ROCm/gfx950-gluon-tutorials) finalized
the gfx950-tutorial-v0.1 Triton pin and refreshed every perf table
under it; sync the blog so its numbers, pin links, command examples,
and narrative match.

  Headline   FP16 v9         522 -> 1619 TFLOPS, 3x  becomes
                             520 -> 1489 TFLOPS, ~3x
  Headline   BF8             3456 -> 3257 TFLOPS / 99% -> 99.72%
  Headline   MXFP4           5728 -> 5255 TFLOPS / 92% -> 92.41%

Other refreshes:

  - Result table (a16w16/a8w8/a4w4 row) and chart caption updated
    to the new TFLOPS / MFMA-eff numbers.
  - Replaced gluon-gemm-performance-progression.png with the chart
    regenerated by the tutorial's offline plotter against the new
    /data/perf_data.csv (18 bars: each version with all measured
    configs, no v7 + llir+ra column).
  - Triton pin link rewritten from gfx9-gluon-tutorials-pin on
    ROCm/triton (matmul_4waves) to gfx950-tutorial-v0.1 on
    triton-lang/triton (commit on the gfx950-tutorial branch).
    Both inline and disclaimer references updated.
  - Disclaimer ROCm version bumped 6.5.0 -> 7.0 (the tutorial
    scripts and rocprofv3 paths now require ROCm >= 7.0).
  - bench.py example flag corrected: --use-rocprof did not exist
    in bench.py, so dropped it; the v9 example now sets
    TRITON_ENABLE_LLIR_SCHED=1 / TRITON_ENABLE_AMDGCN_AS=1 so the
    invocation actually exercises the optimizations the comment
    promises.
  - run_perf_table.py example: --use-rocprof -> --rocprof to match
    the actual flag.
  - Slightly tightened the Act III v6 paragraph: v6's unroll-by-2
    eliminates the per-iteration copy as designed, but the unroll
    keeps both register sets concurrently live and pushes against
    the 512-VGPR limit; v7's N-slicing is what actually delivers
    the throughput by halving the B-tile footprint.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rework the post around the 10 reader-perspective fixes called out in
review:

1. Open with the most arresting fact (98.75% MFMA efficiency on
   MI355) instead of a generic "measurement, not guesswork" line.
2. Reframe the MFMA-efficiency caveat from a defensive footnote
   under the table to a positive methodology note inside
   "Profiling drives the tutorial."
3. Promote the v6 spill regression to the centerpiece of Act III,
   with a callout: a 73% TFLOPS crash, 99 spilled VGPRs, and v7's
   slicing as the by-design fix. This is the tutorial's most
   teachable moment and was previously buried in continuous prose.
4. Update the chart caption to call attention to the visible v6
   dip in the bar chart.
5. Expand the MXFP4 section: name the broader W4 quantization
   trajectory (W4A8, W4A16, GPTQ, AWQ-style), name the actual
   scaled-MFMA instruction, and explain why the GR -> LW -> LR
   scale pipeline is a generalizable pattern for any quantized
   data path with on-disk-vs-MFMA layout mismatch.
6. Add a "From kernel to model" section that explicitly bridges
   GEMM techniques to AI-workload kernels (FC layers, MoE experts,
   attention matmul, KV-cache projections), names what the tutorial
   does NOT yet cover (memory-bound GEMM, FlashAttention prefill /
   decode, MXFP4 MoE -- on the public roadmap), and gives an
   evaluator-readable answer to "why MI350 for AI": the recipe is
   open and end-to-end reproducible, no black box.
7. Add a "Where to look in docs/" callout pointing at the four
   standalone references (performance_philosophy, mfma_efficiency,
   lds_throughput, memory_bandwidth_model) that previously were
   only mentioned in passing.
8. End the body on a forward statement instead of dwindling into
   disclaimers; the disclaimer block is unchanged but no longer
   carries the closing impression.
9. Move the "post is the map / repo is the tutorial" blockquote
   from the structural intro to just before "Try the tutorial",
   where it actually directs the reader.
10. Tighten the "Try the tutorial" intro to lead with what the
    reader will see (~3x in two commands) before the Triton-build
    setup.

No factual numbers changed -- this commit is rhetorical, not data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply the 12 reader-perspective fixes from the second review:

  1. Expand Act IV / v9: the chiplet-GPU + 8-XCD insight gets a
     full paragraph plus the closed-form `f(GM) = GM + ceil(P/GM)`
     remapping objective, the L2-miss counters (5.3M -> 4.1M), and
     a sentence framing this as the kind of MI350-specific quirk
     evaluators need up front. Was previously a single
     parenthetical at the end of one sentence.
  2. Reorder/shorten "Why another GEMM tutorial?": drop the
     defensive "ROCm provides libraries..." opening that was
     deflating the energy of the opening hook; lead instead with a
     short pedagogy-first paragraph and go straight to the data
     table.
  3. Add a callout after the bench.py snippet warning that
     bench.py prints do_bench cache-warm numbers a few percent
     below the cold-cache rocprofv3 numbers used elsewhere in
     the post (prevents reader disappointment on running the
     command).
  4. Reframe "where the tutorial does not yet go" as "what's
     coming next" -- same content (memory-bound GEMM, FA prefill /
     decode, MXFP4 MoE), opposite emotional valence, ties to
     "the kernels that dominate LLM inference and MoE inference."
  5. "Several wrong turns" -> "a regression along the way" --
     the post narrates one wrong turn (v6), not several.
  6. Cut "What Gluon makes explicit" by ~25%: collapse three
     setup paragraphs into one denser paragraph, keep the bullet
     list and connective close.
  7. Concretize "no vendor secrets": name the actual artifacts
     (kernel source, LLIR scheduler pass, amdgcnas peephole,
     run_perf_table.py reproducer, pinned Triton commit, MIT
     license) instead of abstract claim.
  8. Drop reading-order step 5 (redundant with the new docs/
     subsection); replace with a one-line pointer.
  9. Crisper MFMA caveat: "End-to-end TFLOPS is a different
     question..." -> "End-to-end TFLOPS, which also captures
     epilogue, prologue, and multi-CU dispatch, is reported
     alongside it."
 10. Compress the Summary: cut the generic "near-peak isn't
     one trick - it's a sequence of design decisions [list]"
     windup so the closing punchline ("AMD GPUs need vendor
     secrets to hit peak...") lands without dilution.
 11. Update chart alt-text to mention the v6 dip and the bar
     structure (per-version, per-config), not just baseline-vs-
     final monotonic story.
 12. Set setup-time expectations in "Try the tutorial": "Setup
     is ~30 minutes (Triton built from source); after that, two
     commands reproduce the journey." Honest up-front so the
     build step doesn't feel like a bait-and-switch after the
     2-command promise.

No factual numbers changed; this commit is rhetorical.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant