Deterministic fused cross-CTA dW reduction in RMSNorm backward by santoshmo · Pull Request #110 · Dao-AILab/quack

santoshmo · 2026-04-19T02:05:51Z

Eliminates the separate .sum(dim=0) kernel for dw_partial reduction by fusing a deterministic last-CTA-reduces pattern into the backward kernel.

Each CTA writes its partial to dw_partial[bidx, :] as before, then does a threadfence + atomic increment of a global counter. The last CTA to arrive loads all partials in fixed order 0..sm_count-1 and accumulates into dw_final, ensuring deterministic results across runs.

Only enabled for N <= 8192 (cluster_n == 1). For larger N, falls back to the existing host-side .sum(dim=0) reduction.

Based on the approach discussed in #101 by @AaronWang04

Eliminates the separate .sum(dim=0) kernel for dw_partial reduction by fusing a deterministic last-CTA-reduces pattern into the backward kernel. Each CTA writes its partial to dw_partial[bidx, :] as before, then does a threadfence + atomic increment of a global counter. The last CTA to arrive loads all partials in fixed order 0..sm_count-1 and accumulates into dw_final, ensuring deterministic results across runs. Only enabled for N <= 8192 (cluster_n == 1). For larger N, falls back to the existing host-side .sum(dim=0) reduction. Based on the approach discussed in Dao-AILab#101. Co-authored-by: Aaron Wang <aaronwang04@users.noreply.github.com>

santoshmo had a problem deploying to gpu-ci April 19, 2026 02:05 — with GitHub Actions Error

santoshmo marked this pull request as draft April 19, 2026 02:07

santoshmo closed this Apr 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deterministic fused cross-CTA dW reduction in RMSNorm backward#110

Deterministic fused cross-CTA dW reduction in RMSNorm backward#110
santoshmo wants to merge 1 commit into
Dao-AILab:mainfrom
santoshmo:main

santoshmo commented Apr 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

santoshmo commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

santoshmo commented Apr 19, 2026 •

edited

Loading