feat(trtllm-MHA): support mixed prefill/decode batches by rjzhb · Pull Request #176 · lightseekorg/tokenspeed

rjzhb · 2026-05-18T23:03:49Z

Summary

Adds ForwardMode.MIXED handling to TRTLLMMHAAttnBackend so
--enable-mixed-batch can co-schedule prefill and decode rows in a
single step on MHA models. Pairs with the DSv4 implementation in #122.

CUDA-graph capture

Not added in this PR — capture entrypoint raises NotImplementedError
for is_extend_or_mixed(), so MIXED runs eagerly (same as EXTEND
post-#164).

Test Plan

Unit tests for _init_mixed_metadata three-slot layout and
_forward_mixed_split_kernel slicing + sentinel restore + concat
e2e on Qwen3-4B TP=1 — single-prompt and multi-prompt MIXED
e2e on MiniMax-M2.5 TP=2 ts serve with sweep workload
Existing EXTEND-only / DECODE-only paths unaffected

Signed-off-by: rjzhb <rjzhb222@163.com>

…atch # Conflicts: # python/tokenspeed/runtime/layers/attention/backends/trtllm.py

Drops the split-kernel routing (decode rows through the decode kernel) in favor of feeding the whole ragged batch through trtllm_batch_context_with_kv_cache. Bench on B200: ~3.6x faster on small batches, ~1.06x on TP=2 MiniMax-M2.5 at 16k. Numerical: bf16 reduction-order drift vs split path (typical kernel-pair noise, not a correctness issue). Signed-off-by: rjzhb <rjzhb222@163.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b2df0ce0e4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: rjzhb <rjzhb222@163.com>

rjzhb · 2026-05-20T18:35:32Z

Performance & Quality

Setup: MiniMax-M2.5 bf16, B200 TP=2, trtllm backend. Two server instances on disjoint GPU pairs, A with --enable-mixed-batch, B without. Both modes use the same prefill-priority scheduler; the flag only controls whether decode requests may co-execute in the same step as in-progress prefills (A: yes; B: defer all decode to a later step whenever any prefill op is scheduled). Closed-loop HTTP load (Poisson), same seed/workload for both.

Workload profiles:

Prefill-heavy — long prompts (~12k token avg, p99 ≥ 100k), ~100:1 prefill:decode token ratio, QPS sweep 0.1–0.5.
Decode-heavy — short prompts (~5k token avg), many concurrent decoders, QPS sweep 2–10.

Tables format: baseline → mixed. Latency rows in p50/p90 ms; positive Δ on throughput = win, negative Δ on latency = win.

Prefill-heavy results:

QPS	gen_tps	prefill_tps	TTFT p50/p90 (ms)	E2E p50/p90 (ms)
0.1	92.7 → 91.3 (-1.5%)	2310 → 2270 (-1.7%)	273/1183 → 271/1257	4232/12215 → 4242/12286
0.2	75.9 → 76.1 (≈)	8007 → 8133 (+1.6%)	412/2291 → 442/2318	5214/13038 → 5214/13095
0.3	61.3 → 60.0 (-2.1%)	15202 → 15162 (≈)	550/4319 → 602/4303	7607/24879 → 7933/25095
0.4	49.5 → 48.1 (-2.8%)	20390 → 20604 (+1.1%)	865/6819 → 1007/7079	10176/31784 → 10467/33303
0.5	37.3 → 39.4 (+5.6%)	22965 → 22760 (≈)	3601/38416 → 2756/23551	19975/57752 → 16802/47124

Decode-heavy results:

QPS	gen_tps	prefill_tps	TTFT p50/p90 (ms)	E2E p50/p90 (ms)
2	79.0 → 77.5 (-1.9%)	1240 → 1182 (-4.7%)	101/165 → 113/181	1191/2502 → 1213/2579
4	50.1 → 50.2 (≈)	2574 → 2584 (≈)	138/207 → 157/244	1874/4108 → 1884/4073
6	26.9 → 29.0 (+7.8%)	3869 → 3898 (≈)	175/284 → 205/330	3369/7638 → 3143/7095
8	15.4 → 16.6 (+7.8%)	4592 → 4718 (+2.7%)	235/351 → 298/429	6349/14836 → 5880/13492
10	14.1 → 14.5 (+2.8%)	4787 → 4896 (+2.3%)	243/365 → 307/440	6758/15758 → 6619/15144

Findings:

MIXED delivers throughput + end-to-end-latency wins when the baseline defers decode behind active prefills: prefill-heavy @ qps=0.5 (+5.6% gen_tps, -15.9% E2E p50, -18% E2E p90, -39% TTFT p90), decode-heavy @ qps=6–8 (+7.8% gen_tps, -6~7% E2E p50/p90).
TTFT regresses 7–27% at p50 under moderate load — the split-kernel path runs the context kernel and the decode kernel sequentially on the same CUDA stream, extending the step that completes the new request's last prefill chunk. Tail TTFT (p90) actually improves under high load on prefill-heavy because decode no longer queues up behind long prefill bursts.

Signed-off-by: rjzhb <rjzhb222@163.com>

…d_kernel Signed-off-by: rjzhb <rjzhb222@163.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e3c054e5db

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

…ix-cached Signed-off-by: rjzhb <rjzhb222@163.com>

rjzhb added 2 commits May 18, 2026 00:55

feat(trtllm): support mixed prefill/decode batches

70d384b

Signed-off-by: rjzhb <rjzhb222@163.com>

chore(trtllm): drop mixed-batch context-kernel path

f934b45

Signed-off-by: rjzhb <rjzhb222@163.com>

rjzhb requested a review from a team as a code owner May 18, 2026 23:03

rjzhb added 2 commits May 18, 2026 23:27

Merge remote-tracking branch 'upstream/main' into feat/trtllm-mixed-b…

7ca2222

…atch # Conflicts: # python/tokenspeed/runtime/layers/attention/backends/trtllm.py

LorrinWWW reviewed May 19, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/layers/attention/backends/trtllm.py

LorrinWWW reviewed May 19, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/layers/attention/backends/trtllm.py Outdated

Merge branch 'main' into feat/trtllm-mixed-batch

b2df0ce

chatgpt-codex-connector Bot reviewed May 19, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/layers/attention/backends/trtllm.py Outdated

refactor(trtllm): fold MIXED into prefill forward path

5f2f670

Signed-off-by: rjzhb <rjzhb222@163.com>

rjzhb and others added 3 commits May 21, 2026 18:39

refactor(trtllm): use split context+decode kernels for MIXED

70dd645

Signed-off-by: rjzhb <rjzhb222@163.com>

refactor(trtllm): rename _forward_mixed_split_kernel to _forward_mixe…

3ccf63c

…d_kernel Signed-off-by: rjzhb <rjzhb222@163.com>

Merge branch 'main' into feat/trtllm-mixed-batch

e3c054e

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/layers/attention/backends/trtllm.py Outdated

fix(trtllm): skip context kernel when all MIXED prefill rows are pref…

80ffe57

…ix-cached Signed-off-by: rjzhb <rjzhb222@163.com>

LorrinWWW self-assigned this May 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(trtllm-MHA): support mixed prefill/decode batches#176

feat(trtllm-MHA): support mixed prefill/decode batches#176
rjzhb wants to merge 10 commits into
lightseekorg:mainfrom
rjzhb:feat/trtllm-mixed-batch

rjzhb commented May 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

rjzhb commented May 20, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rjzhb commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

CUDA-graph capture

Test Plan

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

rjzhb commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance & Quality

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rjzhb commented May 18, 2026 •

edited

Loading

rjzhb commented May 20, 2026 •

edited

Loading