[Cute,Sm100] allow for zero length sequences in hdim 256 kernels by jayhshah · Pull Request #2568 · Dao-AILab/flash-attention

jayhshah · 2026-05-15T02:55:26Z

We add support for zero-length Q and KV sequences for varlen mode in the sm100 hdim 256 kernels. Changes are as follows:

Forward: have softmax execute one dummy iteration to not hang for zero-length K. We also add row sum check for zero or NaN to not output NaN in this case.

Backward dQ: for guarding work tile per-iteration, check also that trip count is non-zero. If it is zero, write zero as the output.

Backward dKV: zero-length Q and K was nominally supported but the write zero logic was broken and yielded IMA; now fixed in the PR.

Johnsonms · 2026-05-16T22:03:43Z

PR fixes the bug.

Before patch (b11 baseline):

forward OK → out.backward(g) → cudaErrorIllegalAddress in _flash_attn_bwd (sm100 hd=256 kernel)

After patch (jshah/hdim256-varlen-zero-lengths, commit 75db52f):

forward OK
backward OK
grads finite, non-zero, and structurally correct — dk[0:2538] (the K rows paired with the zero-length Q segment) is
exactly zero as expected, while dk[2538:] carries signal.

…ero-len (Dao-AILab#2568), varlen batch search (Dao-AILab#2556)

allow for zero length sequences in hdim 256 sm100 kernels

75db52f

jayhshah mentioned this pull request May 15, 2026

[Bug] [FA4] [hdim 256] CUDA error 700 in backward pass when using zero-length q sequences with hdim=256 in varlen mode #2562

Closed

jayhshah requested a review from Johnsonms May 15, 2026 03:01

Johnsonms approved these changes May 16, 2026

View reviewed changes

Johnsonms merged commit 8a8b2f1 into main May 16, 2026
2 of 3 checks passed

umiswing mentioned this pull request May 19, 2026

[WIP] Fa4 d256 varlen zero seq PaddlePaddle/flash-attention#149

Draft

ussoewwin added a commit to ussoewwin/flash-attention that referenced this pull request May 21, 2026

Merge upstream/main: split-kv blocksparse (Dao-AILab#2536), hdim256 z…

b88e772

…ero-len (Dao-AILab#2568), varlen batch search (Dao-AILab#2556)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cute,Sm100] allow for zero length sequences in hdim 256 kernels#2568

[Cute,Sm100] allow for zero length sequences in hdim 256 kernels#2568
Johnsonms merged 1 commit into
mainfrom
jshah/hdim256-varlen-zero-lengths

jayhshah commented May 15, 2026 •

edited

Loading

Uh oh!

Johnsonms commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jayhshah commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Johnsonms commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jayhshah commented May 15, 2026 •

edited

Loading