[WIP] Blockscaled SM90#141
Conversation
|
are you doing block size 128? mxfp8 block size 32 will be very slow on hopper imo. |
Yes block size (1, 128) for activations and (128, 128) for weights. Just like DeepGEMM |
| scale_a_0 = sSFA[m0, 0, stage] | ||
| scale_a_1 = sSFA[m1, 0, stage] | ||
|
|
||
| scale_b = mSFB_nk[n_tile_coord, k_tile] |
There was a problem hiding this comment.
there's only 1 scale for B in the whole k_tile?
There was a problem hiding this comment.
For now the kernel assumes tile_n == tile_k == 128. I'll first be adding support for tile_n == 192 so that we can have better auto tuning.
I think that tile_k != 128 is not gonna improve the perf. But if needed I'll implement it
There was a problem hiding this comment.
no, i mean for SFA you have 1 scale per row of A. But for SFB you have 1 scale per "tile_n" columns of B?
There was a problem hiding this comment.
yes because right now I hardcoded tile_n to be 128. I'll remove this constraint later.
Sorry this whole thing is messy I'll try to clean it up
|
Leaving this here as it might help with understanding the code |
…arable to DG 1d2d
|
@tridao this seems to be getting closer to being done. I'm getting similar perf to DG (around 98-100% of DG). Will be posting proper benchmarks tomorrow. If the perf is good enough I'll be moving on to implementing the actual fwd + bwd in sonicmoe. Though I still have some ideas that I want to try out. Like adding swizzle to SFA and overlapping the scale FMA with WGMMA |


I'm mainly doing this because I'm interested in mxfp8 sonicmoe. Here are some of my thoughts:
swap_AB. We can quant + transpose the activations in a single kernel