[CuTe, sm90, sm100] Feat: enable chunked causal attention for low-latency streaming workloads#2546
Open
gpzlx1 wants to merge 1 commit into
Open
[CuTe, sm90, sm100] Feat: enable chunked causal attention for low-latency streaming workloads#2546gpzlx1 wants to merge 1 commit into
gpzlx1 wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Chunk Attention
Any suggestions and feedback are welcome.
Overview
Chunk attention is a chunk-causal attention variant where query tokens are
partitioned into fixed-size chunks. Each query token can attend to all keys up to
the end of its current chunk.
It keeps the exact softmax attention computation, but changes the causal mask
from "each token can only see its past" to "each token can see previous chunks
and the full current chunk". In other words, the attention boundary advances by
chunks instead of by individual tokens.
This is useful for streaming workloads where token-level causality is too strict,
but full-sequence attention is too late or too expensive. Common examples include
streaming speech recognition, real-time speech-to-speech models, and full-duplex
voice assistants.
Why Chunk Attention Is Needed
Standard causal attention exposes context one token at a time. This works well
for autoregressive text generation, but speech systems often benefit from a small
amount of bounded right context. For example, nearby future frames can help
resolve phoneme boundaries, short silences, prosody, and word endings.
The latency is still bounded: the model only waits until the current chunk is
available, instead of waiting for the full sequence.
This pattern appears in streaming speech literature under names such as
chunk-wise attention, chunk-aware attention, shifted chunk attention, and
streaming chunked attention. Related work includes Streaming Chunk-Aware
Multihead Attention for online ASR, Chunked Attention-based Encoder-Decoder
models for streaming speech recognition, SSCFormer, and Shifted Chunk Encoder
for streaming Transformer/Conformer ASR.
In short, chunk attention is useful when token-level causality is too restrictive
but full-sequence attention introduces too much latency. It is especially useful
for streaming speech, low-latency speech-to-speech generation, turn-taking,
interruption handling, endpointing, and full-duplex interaction.
Core Idea
For each batch item, chunk attention takes a
chunk_size. Query positions arepartitioned into consecutive chunks:
With
chunk_size = 8, all query tokens in chunk 0 can attend to keys0..7,all query tokens in chunk 1 can attend to keys
0..15, all query tokens inchunk 2 can attend to keys
0..23, and so on.Compared with standard causal attention:
Chunk attention makes every query in the same chunk share the same right
boundary:
So the mask advances by chunks rather than by individual query tokens. The
attention computation itself remains unchanged; only the visibility mask is
different.
Related Work
Chunked or chunk-wise attention is a common design in low-latency speech models:
Recent real-time and full-duplex speech-to-speech systems also motivate this
type of bounded-context attention, because they must process incoming speech
incrementally while generating responses with low latency.