feat(trtllm-MHA): support mixed prefill/decode batches#176
Conversation
Signed-off-by: rjzhb <rjzhb222@163.com>
Signed-off-by: rjzhb <rjzhb222@163.com>
…atch # Conflicts: # python/tokenspeed/runtime/layers/attention/backends/trtllm.py
Drops the split-kernel routing (decode rows through the decode kernel) in favor of feeding the whole ragged batch through trtllm_batch_context_with_kv_cache. Bench on B200: ~3.6x faster on small batches, ~1.06x on TP=2 MiniMax-M2.5 at 16k. Numerical: bf16 reduction-order drift vs split path (typical kernel-pair noise, not a correctness issue). Signed-off-by: rjzhb <rjzhb222@163.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b2df0ce0e4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: rjzhb <rjzhb222@163.com>
Performance & QualitySetup: MiniMax-M2.5 bf16, B200 TP=2, trtllm backend. Two server instances on disjoint GPU pairs, A with Workload profiles:
Tables format: Prefill-heavy results:
Decode-heavy results:
Findings:
|
Signed-off-by: rjzhb <rjzhb222@163.com>
…d_kernel Signed-off-by: rjzhb <rjzhb222@163.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e3c054e5db
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…ix-cached Signed-off-by: rjzhb <rjzhb222@163.com>
Summary
Adds
ForwardMode.MIXEDhandling toTRTLLMMHAAttnBackendso--enable-mixed-batchcan co-schedule prefill and decode rows in asingle step on MHA models. Pairs with the DSv4 implementation in #122.
CUDA-graph capture
Not added in this PR — capture entrypoint raises
NotImplementedErrorfor
is_extend_or_mixed(), so MIXED runs eagerly (same as EXTENDpost-#164).
Test Plan
_init_mixed_metadatathree-slot layout and_forward_mixed_split_kernelslicing + sentinel restore + concatts servewith sweep workload