Skip to content

[CuTe, sm90, sm100] Feat: enable chunked causal attention for low-latency streaming workloads#2546

Open
gpzlx1 wants to merge 1 commit into
Dao-AILab:mainfrom
gpzlx1:gp/chunk-attn
Open

[CuTe, sm90, sm100] Feat: enable chunked causal attention for low-latency streaming workloads#2546
gpzlx1 wants to merge 1 commit into
Dao-AILab:mainfrom
gpzlx1:gp/chunk-attn

Conversation

@gpzlx1
Copy link
Copy Markdown

@gpzlx1 gpzlx1 commented May 7, 2026

Chunk Attention

Any suggestions and feedback are welcome.

Overview

Chunk attention is a chunk-causal attention variant where query tokens are
partitioned into fixed-size chunks. Each query token can attend to all keys up to
the end of its current chunk.

It keeps the exact softmax attention computation, but changes the causal mask
from "each token can only see its past" to "each token can see previous chunks
and the full current chunk". In other words, the attention boundary advances by
chunks instead of by individual tokens.

This is useful for streaming workloads where token-level causality is too strict,
but full-sequence attention is too late or too expensive. Common examples include
streaming speech recognition, real-time speech-to-speech models, and full-duplex
voice assistants.

Why Chunk Attention Is Needed

Standard causal attention exposes context one token at a time. This works well
for autoregressive text generation, but speech systems often benefit from a small
amount of bounded right context. For example, nearby future frames can help
resolve phoneme boundaries, short silences, prosody, and word endings.

streaming speech:

audio frames:  f0 f1 f2 f3 | f4 f5 f6 f7 | f8 f9 f10 f11 | ...
               chunk 0    | chunk 1    | chunk 2        |

standard causal at f1: sees f0..f1
chunk attention at f1: sees f0..f3

standard causal at f5: sees f0..f5
chunk attention at f5: sees f0..f7

The latency is still bounded: the model only waits until the current chunk is
available, instead of waiting for the full sequence.

This pattern appears in streaming speech literature under names such as
chunk-wise attention, chunk-aware attention, shifted chunk attention, and
streaming chunked attention. Related work includes Streaming Chunk-Aware
Multihead Attention for online ASR, Chunked Attention-based Encoder-Decoder
models for streaming speech recognition, SSCFormer, and Shifted Chunk Encoder
for streaming Transformer/Conformer ASR.

In short, chunk attention is useful when token-level causality is too restrictive
but full-sequence attention introduces too much latency. It is especially useful
for streaming speech, low-latency speech-to-speech generation, turn-taking,
interruption handling, endpointing, and full-duplex interaction.

Core Idea

For each batch item, chunk attention takes a chunk_size. Query positions are
partitioned into consecutive chunks:

query positions:

0        7 8       15 16      23 24      31
| chunk 0 | chunk 1 | chunk 2 | chunk 3 |

With chunk_size = 8, all query tokens in chunk 0 can attend to keys 0..7,
all query tokens in chunk 1 can attend to keys 0..15, all query tokens in
chunk 2 can attend to keys 0..23, and so on.

Compared with standard causal attention:

standard causal:

q=0   sees k <= 0
q=1   sees k <= 1
q=2   sees k <= 2
...
q=15  sees k <= 15

Chunk attention makes every query in the same chunk share the same right
boundary:

chunk causal, chunk_size = 8:

q=0   sees k < 8
q=1   sees k < 8
...
q=7   sees k < 8
q=8   sees k < 16
...
q=15  sees k < 16

So the mask advances by chunks rather than by individual query tokens. The
attention computation itself remains unchanged; only the visibility mask is
different.

Related Work

Chunked or chunk-wise attention is a common design in low-latency speech models:

Recent real-time and full-duplex speech-to-speech systems also motivate this
type of bounded-context attention, because they must process incoming speech
incrementally while generating responses with low latency.

@gpzlx1 gpzlx1 changed the title [CuTe, sm90, sm100] Feat: enable chunked causal attention for speech and duplex workloads [CuTe, sm90, sm100] Feat: enable chunked causal attention for chunk bi-directional workload May 7, 2026
@gpzlx1 gpzlx1 changed the title [CuTe, sm90, sm100] Feat: enable chunked causal attention for chunk bi-directional workload [CuTe, sm90, sm100] Feat: enable chunked causal attention for low-latency streaming workloads May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants