[WIP] Blockscaled SM90 by KareemMusleh · Pull Request #141 · Dao-AILab/quack

KareemMusleh · 2026-05-17T22:20:07Z

I'm mainly doing this because I'm interested in mxfp8 sonicmoe. Here are some of my thoughts:

FP8 WGMMA requires the input to be k-major, which works fine for the forward but a naive implementation to compute the activation gradient requires us to transpose the weights. I think this can be avoided by using swap_AB. We can quant + transpose the activations in a single kernel
I should at least attempt doing fp8 gather_A it's possible that cp.async will be fast enough for loading the sfa with 32bit alignment. Edit: after thinking about it some more I think it might be possible to fuse the quant of the inputs with a permute of the scales. Which will allow us to both use gather_A for the input and use TMA for the SFA
Should probably swizzle the sfa. Maybe should also try storing the scales as e8m0

tridao · 2026-05-18T02:30:43Z

are you doing block size 128? mxfp8 block size 32 will be very slow on hopper imo.

KareemMusleh · 2026-05-18T03:33:28Z

are you doing block size 128? mxfp8 block size 32 will be very slow on hopper imo.

Yes block size (1, 128) for activations and (128, 128) for weights. Just like DeepGEMM

tridao · 2026-05-18T11:44:40Z

+            scale_a_0 = sSFA[m0, 0, stage]
+            scale_a_1 = sSFA[m1, 0, stage]
+
+            scale_b = mSFB_nk[n_tile_coord, k_tile]


there's only 1 scale for B in the whole k_tile?

For now the kernel assumes tile_n == tile_k == 128. I'll first be adding support for tile_n == 192 so that we can have better auto tuning.

I think that tile_k != 128 is not gonna improve the perf. But if needed I'll implement it

no, i mean for SFA you have 1 scale per row of A. But for SFB you have 1 scale per "tile_n" columns of B?

yes because right now I hardcoded tile_n to be 128. I'll remove this constraint later.

Sorry this whole thing is messy I'll try to clean it up

KareemMusleh · 2026-05-20T08:12:04Z

Leaving this here as it might help with understanding the code

Link to PTX Docs

…e register usage

…arable to DG 1d2d

KareemMusleh · 2026-05-23T22:52:43Z

@tridao this seems to be getting closer to being done. I'm getting similar perf to DG (around 98-100% of DG).

Will be posting proper benchmarks tomorrow. If the perf is good enough I'll be moving on to implementing the actual fwd + bwd in sonicmoe. Though I still have some ideas that I want to try out. Like adding swizzle to SFA and overlapping the scale FMA with WGMMA

KareemMusleh · 2026-05-24T17:32:57Z

testing using latest DG with cuda 13.0. The comparison with m_varlen is unfair rn because DG asserts tile_m == 128 rather than tile_m == 256

Should add support for tile_n == 192 (also padding/masking) to have a fair comparison

tests pass

6554a44

KareemMusleh had a problem deploying to gpu-ci May 17, 2026 22:20 — with GitHub Actions Error

KareemMusleh marked this pull request as draft May 17, 2026 22:20

tridao reviewed May 18, 2026

View reviewed changes

Comment thread quack/gemm_sm90.py Outdated

tridao reviewed May 18, 2026

View reviewed changes

Comment thread quack/gemm_sm90.py Outdated

tridao reviewed May 18, 2026

View reviewed changes

A bunch of bug fixed. ~1100 TFLOPS. works with multiple math wg

eb9edfb

KareemMusleh had a problem deploying to gpu-ci May 19, 2026 08:17 — with GitHub Actions Error

add support for tile_m == 256. 1200 TFLOPS. Reuse acc_slow to minimiz…

7847052

…e register usage

KareemMusleh had a problem deploying to gpu-ci May 23, 2026 12:17 — with GitHub Actions Error

lint

4b1141c

KareemMusleh had a problem deploying to gpu-ci May 23, 2026 12:19 — with GitHub Actions Error

more speed ups

2d5cd61

KareemMusleh had a problem deploying to gpu-ci May 23, 2026 13:04 — with GitHub Actions Error

unroll = 8, added multicast for SFA, SFB is not is smem. Perf is comp…

16d8c61

…arable to DG 1d2d

KareemMusleh had a problem deploying to gpu-ci May 23, 2026 16:22 — with GitHub Actions Error

refactoring

899c898

KareemMusleh had a problem deploying to gpu-ci May 23, 2026 22:30 — with GitHub Actions Error

KareemMusleh marked this pull request as ready for review May 23, 2026 22:55

benchmark against DG

3a495d1

KareemMusleh requested a deployment to gpu-ci May 24, 2026 17:54 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Blockscaled SM90#141

[WIP] Blockscaled SM90#141
KareemMusleh wants to merge 8 commits into
Dao-AILab:mainfrom
KareemMusleh:sm90-blockscaled-support

KareemMusleh commented May 17, 2026 •

edited

Loading

Uh oh!

tridao commented May 18, 2026

Uh oh!

KareemMusleh commented May 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

tridao May 18, 2026

Uh oh!

KareemMusleh May 19, 2026 •

edited

Loading

Uh oh!

tridao May 19, 2026

Uh oh!

KareemMusleh May 20, 2026

Uh oh!

KareemMusleh commented May 20, 2026

Uh oh!

KareemMusleh commented May 23, 2026

Uh oh!

KareemMusleh commented May 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KareemMusleh commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tridao commented May 18, 2026

Uh oh!

KareemMusleh commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tridao May 18, 2026

Choose a reason for hiding this comment

Uh oh!

KareemMusleh May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tridao May 19, 2026

Choose a reason for hiding this comment

Uh oh!

KareemMusleh May 20, 2026

Choose a reason for hiding this comment

Uh oh!

KareemMusleh commented May 20, 2026

Uh oh!

KareemMusleh commented May 23, 2026

Uh oh!

KareemMusleh commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KareemMusleh commented May 17, 2026 •

edited

Loading

KareemMusleh commented May 18, 2026 •

edited

Loading

KareemMusleh May 19, 2026 •

edited

Loading

KareemMusleh commented May 24, 2026 •

edited

Loading