Skip to content

[WIP] Blockscaled SM90#141

Open
KareemMusleh wants to merge 8 commits into
Dao-AILab:mainfrom
KareemMusleh:sm90-blockscaled-support
Open

[WIP] Blockscaled SM90#141
KareemMusleh wants to merge 8 commits into
Dao-AILab:mainfrom
KareemMusleh:sm90-blockscaled-support

Conversation

@KareemMusleh
Copy link
Copy Markdown
Contributor

@KareemMusleh KareemMusleh commented May 17, 2026

I'm mainly doing this because I'm interested in mxfp8 sonicmoe. Here are some of my thoughts:

  1. FP8 WGMMA requires the input to be k-major, which works fine for the forward but a naive implementation to compute the activation gradient requires us to transpose the weights. I think this can be avoided by using swap_AB. We can quant + transpose the activations in a single kernel
  2. I should at least attempt doing fp8 gather_A it's possible that cp.async will be fast enough for loading the sfa with 32bit alignment. Edit: after thinking about it some more I think it might be possible to fuse the quant of the inputs with a permute of the scales. Which will allow us to both use gather_A for the input and use TMA for the SFA
  3. Should probably swizzle the sfa. Maybe should also try storing the scales as e8m0

@tridao
Copy link
Copy Markdown
Member

tridao commented May 18, 2026

are you doing block size 128? mxfp8 block size 32 will be very slow on hopper imo.

@KareemMusleh
Copy link
Copy Markdown
Contributor Author

KareemMusleh commented May 18, 2026

are you doing block size 128? mxfp8 block size 32 will be very slow on hopper imo.

Yes block size (1, 128) for activations and (128, 128) for weights. Just like DeepGEMM

Comment thread quack/gemm_sm90.py Outdated
Comment thread quack/gemm_sm90.py Outdated
Comment thread quack/gemm_sm90.py Outdated
scale_a_0 = sSFA[m0, 0, stage]
scale_a_1 = sSFA[m1, 0, stage]

scale_b = mSFB_nk[n_tile_coord, k_tile]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's only 1 scale for B in the whole k_tile?

Copy link
Copy Markdown
Contributor Author

@KareemMusleh KareemMusleh May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now the kernel assumes tile_n == tile_k == 128. I'll first be adding support for tile_n == 192 so that we can have better auto tuning.

I think that tile_k != 128 is not gonna improve the perf. But if needed I'll implement it

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, i mean for SFA you have 1 scale per row of A. But for SFB you have 1 scale per "tile_n" columns of B?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes because right now I hardcoded tile_n to be 128. I'll remove this constraint later.

Sorry this whole thing is messy I'll try to clean it up

@KareemMusleh
Copy link
Copy Markdown
Contributor Author

Leaving this here as it might help with understanding the code

Link to PTX Docs
image

@KareemMusleh
Copy link
Copy Markdown
Contributor Author

@tridao this seems to be getting closer to being done. I'm getting similar perf to DG (around 98-100% of DG).

Will be posting proper benchmarks tomorrow. If the perf is good enough I'll be moving on to implementing the actual fwd + bwd in sonicmoe. Though I still have some ideas that I want to try out. Like adding swizzle to SFA and overlapping the scale FMA with WGMMA

@KareemMusleh KareemMusleh marked this pull request as ready for review May 23, 2026 22:55
@KareemMusleh
Copy link
Copy Markdown
Contributor Author

KareemMusleh commented May 24, 2026

image

testing using latest DG with cuda 13.0. The comparison with m_varlen is unfair rn because DG asserts tile_m == 128 rather than tile_m == 256

Should add support for tile_n == 192 (also padding/masking) to have a fair comparison

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants