Rmsnorm bwd deterministic fused reduce by santoshmo · Pull Request #109 · Dao-AILab/quack

santoshmo · 2026-04-19T01:42:26Z

Eliminates the separate .sum(dim=0) kernel for dw_partial reduction by
fusing a deterministic last-CTA-reduces pattern into the backward kernel.

Each CTA writes its partial to dw_partial[bidx, :] as before, then does
a threadfence + atomic increment of a global counter. The last CTA to
arrive loads all partials in fixed order 0..sm_count-1 and accumulates
into dw_final, ensuring deterministic results across runs.

Only enabled for N <= 8192 (cluster_n == 1). For larger N, falls back
to the existing host-side .sum(dim=0) reduction.

Based on the approach discussed in #101.

Co-authored-by: Aaron Wang aaronwang04@users.noreply.github.com"

Co-authored-by: Vijay Thakkar <vijaythakkar@me.com>

…its shape. update doc strings.

Eliminates the separate .sum(dim=0) kernel for dw_partial reduction by fusing a deterministic last-CTA-reduces pattern into the backward kernel. Each CTA writes its partial to dw_partial[bidx, :] as before, then does a threadfence + atomic increment of a global counter. The last CTA to arrive loads all partials in fixed order 0..sm_count-1 and accumulates into dw_final, ensuring deterministic results across runs. Only enabled for N <= 8192 (cluster_n == 1). For larger N, falls back to the existing host-side .sum(dim=0) reduction. Based on the approach discussed in Dao-AILab#101. Co-authored-by: Aaron Wang <aaronwang04@users.noreply.github.com>

santoshmo and others added 5 commits April 7, 2026 12:34

add _get_mma_inst_tile_k() virtual method to GemmSm100

e497d04

Update gemm_sm100.py

8608556

Update quack/gemm_sm100.py

e884250

Co-authored-by: Vijay Thakkar <vijaythakkar@me.com>

tile_shape_mnk is exposed as API. determine mma_inst_tile_k based on …

bf60a14

…its shape. update doc strings.

santoshmo had a problem deploying to gpu-ci April 19, 2026 01:42 — with GitHub Actions Error

santoshmo closed this Apr 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rmsnorm bwd deterministic fused reduce#109

Rmsnorm bwd deterministic fused reduce#109
santoshmo wants to merge 5 commits into
Dao-AILab:mainfrom
santoshmo:rmsnorm-bwd-deterministic-fused-reduce

santoshmo commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

santoshmo commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant