[CuTe, sm90, sm100] Feat: enable chunked causal attention for low-latency streaming workloads by gpzlx1 · Pull Request #2546 · Dao-AILab/flash-attention

gpzlx1 · 2026-05-07T11:53:15Z

Chunk Attention

Any suggestions and feedback are welcome.

Overview

Chunk attention is a chunk-causal attention variant where query tokens are
partitioned into fixed-size chunks. Each query token can attend to all keys up to
the end of its current chunk.

It keeps the exact softmax attention computation, but changes the causal mask
from "each token can only see its past" to "each token can see previous chunks
and the full current chunk". In other words, the attention boundary advances by
chunks instead of by individual tokens.

This is useful for streaming workloads where token-level causality is too strict,
but full-sequence attention is too late or too expensive. Common examples include
streaming speech recognition, real-time speech-to-speech models, and full-duplex
voice assistants.

Why Chunk Attention Is Needed

Standard causal attention exposes context one token at a time. This works well
for autoregressive text generation, but speech systems often benefit from a small
amount of bounded right context. For example, nearby future frames can help
resolve phoneme boundaries, short silences, prosody, and word endings.

streaming speech:

audio frames:  f0 f1 f2 f3 | f4 f5 f6 f7 | f8 f9 f10 f11 | ...
               chunk 0    | chunk 1    | chunk 2        |

standard causal at f1: sees f0..f1
chunk attention at f1: sees f0..f3

standard causal at f5: sees f0..f5
chunk attention at f5: sees f0..f7

The latency is still bounded: the model only waits until the current chunk is
available, instead of waiting for the full sequence.

This pattern appears in streaming speech literature under names such as
chunk-wise attention, chunk-aware attention, shifted chunk attention, and
streaming chunked attention. Related work includes Streaming Chunk-Aware
Multihead Attention for online ASR, Chunked Attention-based Encoder-Decoder
models for streaming speech recognition, SSCFormer, and Shifted Chunk Encoder
for streaming Transformer/Conformer ASR.

In short, chunk attention is useful when token-level causality is too restrictive
but full-sequence attention introduces too much latency. It is especially useful
for streaming speech, low-latency speech-to-speech generation, turn-taking,
interruption handling, endpointing, and full-duplex interaction.

Core Idea

For each batch item, chunk attention takes a chunk_size. Query positions are
partitioned into consecutive chunks:

query positions:

0        7 8       15 16      23 24      31
| chunk 0 | chunk 1 | chunk 2 | chunk 3 |

With chunk_size = 8, all query tokens in chunk 0 can attend to keys 0..7,
all query tokens in chunk 1 can attend to keys 0..15, all query tokens in
chunk 2 can attend to keys 0..23, and so on.

Compared with standard causal attention:

standard causal:

q=0   sees k <= 0
q=1   sees k <= 1
q=2   sees k <= 2
...
q=15  sees k <= 15

Chunk attention makes every query in the same chunk share the same right
boundary:

chunk causal, chunk_size = 8:

q=0   sees k < 8
q=1   sees k < 8
...
q=7   sees k < 8
q=8   sees k < 16
...
q=15  sees k < 16

So the mask advances by chunks rather than by individual query tokens. The
attention computation itself remains unchanged; only the visibility mask is
different.

Related Work

Chunked or chunk-wise attention is a common design in low-latency speech models:

Recent real-time and full-duplex speech-to-speech systems also motivate this
type of bounded-context attention, because they must process incoming speech
incrementally while generating responses with low latency.

Feat: support chunk attention in FA4 sm90/sm100

bf01d07

gpzlx1 changed the title ~~[CuTe, sm90, sm100] Feat: enable chunked causal attention for speech and duplex workloads~~ [CuTe, sm90, sm100] Feat: enable chunked causal attention for chunk bi-directional workload May 7, 2026

gpzlx1 changed the title ~~[CuTe, sm90, sm100] Feat: enable chunked causal attention for chunk bi-directional workload~~ [CuTe, sm90, sm100] Feat: enable chunked causal attention for low-latency streaming workloads May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CuTe, sm90, sm100] Feat: enable chunked causal attention for low-latency streaming workloads#2546

[CuTe, sm90, sm100] Feat: enable chunked causal attention for low-latency streaming workloads#2546
gpzlx1 wants to merge 1 commit into
Dao-AILab:mainfrom
gpzlx1:gp/chunk-attn

gpzlx1 commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gpzlx1 commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Chunk Attention

Overview

Why Chunk Attention Is Needed

Core Idea

Related Work

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gpzlx1 commented May 7, 2026 •

edited

Loading