Skip to content

feat(trtllm-MHA): support mixed prefill/decode batches#176

Open
rjzhb wants to merge 10 commits into
lightseekorg:mainfrom
rjzhb:feat/trtllm-mixed-batch
Open

feat(trtllm-MHA): support mixed prefill/decode batches#176
rjzhb wants to merge 10 commits into
lightseekorg:mainfrom
rjzhb:feat/trtllm-mixed-batch

Conversation

@rjzhb
Copy link
Copy Markdown

@rjzhb rjzhb commented May 18, 2026

Summary

Adds ForwardMode.MIXED handling to TRTLLMMHAAttnBackend so
--enable-mixed-batch can co-schedule prefill and decode rows in a
single step on MHA models. Pairs with the DSv4 implementation in #122.

CUDA-graph capture

Not added in this PR — capture entrypoint raises NotImplementedError
for is_extend_or_mixed(), so MIXED runs eagerly (same as EXTEND
post-#164).

Test Plan

  • Unit tests for _init_mixed_metadata three-slot layout and
    _forward_mixed_split_kernel slicing + sentinel restore + concat
  • e2e on Qwen3-4B TP=1 — single-prompt and multi-prompt MIXED
  • e2e on MiniMax-M2.5 TP=2 ts serve with sweep workload
  • Existing EXTEND-only / DECODE-only paths unaffected

rjzhb added 2 commits May 18, 2026 00:55
Signed-off-by: rjzhb <rjzhb222@163.com>
Signed-off-by: rjzhb <rjzhb222@163.com>
@rjzhb rjzhb requested a review from a team as a code owner May 18, 2026 23:03
rjzhb added 2 commits May 18, 2026 23:27
…atch

# Conflicts:
#	python/tokenspeed/runtime/layers/attention/backends/trtllm.py
Drops the split-kernel routing (decode rows through the decode kernel) in
favor of feeding the whole ragged batch through trtllm_batch_context_with_kv_cache.
Bench on B200: ~3.6x faster on small batches, ~1.06x on TP=2 MiniMax-M2.5
at 16k. Numerical: bf16 reduction-order drift vs split path (typical
kernel-pair noise, not a correctness issue).

Signed-off-by: rjzhb <rjzhb222@163.com>
Comment thread python/tokenspeed/runtime/layers/attention/backends/trtllm.py
Comment thread python/tokenspeed/runtime/layers/attention/backends/trtllm.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b2df0ce0e4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/layers/attention/backends/trtllm.py Outdated
Signed-off-by: rjzhb <rjzhb222@163.com>
@rjzhb
Copy link
Copy Markdown
Author

rjzhb commented May 20, 2026

Performance & Quality

Setup: MiniMax-M2.5 bf16, B200 TP=2, trtllm backend. Two server instances on disjoint GPU pairs, A with --enable-mixed-batch, B without. Both modes use the same prefill-priority scheduler; the flag only controls whether decode requests may co-execute in the same step as in-progress prefills (A: yes; B: defer all decode to a later step whenever any prefill op is scheduled). Closed-loop HTTP load (Poisson), same seed/workload for both.

Workload profiles:

  • Prefill-heavy — long prompts (~12k token avg, p99 ≥ 100k), ~100:1 prefill:decode token ratio, QPS sweep 0.1–0.5.
  • Decode-heavy — short prompts (~5k token avg), many concurrent decoders, QPS sweep 2–10.

Tables format: baseline → mixed. Latency rows in p50/p90 ms; positive Δ on throughput = win, negative Δ on latency = win.

Prefill-heavy results:

QPS gen_tps prefill_tps TTFT p50/p90 (ms) E2E p50/p90 (ms)
0.1 92.7 → 91.3 (-1.5%) 2310 → 2270 (-1.7%) 273/1183 → 271/1257 4232/12215 → 4242/12286
0.2 75.9 → 76.1 (≈) 8007 → 8133 (+1.6%) 412/2291 → 442/2318 5214/13038 → 5214/13095
0.3 61.3 → 60.0 (-2.1%) 15202 → 15162 (≈) 550/4319 → 602/4303 7607/24879 → 7933/25095
0.4 49.5 → 48.1 (-2.8%) 20390 → 20604 (+1.1%) 865/6819 → 1007/7079 10176/31784 → 10467/33303
0.5 37.3 → 39.4 (+5.6%) 22965 → 22760 (≈) 3601/38416 → 2756/23551 19975/57752 → 16802/47124

Decode-heavy results:

QPS gen_tps prefill_tps TTFT p50/p90 (ms) E2E p50/p90 (ms)
2 79.0 → 77.5 (-1.9%) 1240 → 1182 (-4.7%) 101/165 → 113/181 1191/2502 → 1213/2579
4 50.1 → 50.2 (≈) 2574 → 2584 (≈) 138/207 → 157/244 1874/4108 → 1884/4073
6 26.9 → 29.0 (+7.8%) 3869 → 3898 (≈) 175/284 → 205/330 3369/7638 → 3143/7095
8 15.4 → 16.6 (+7.8%) 4592 → 4718 (+2.7%) 235/351 → 298/429 6349/14836 → 5880/13492
10 14.1 → 14.5 (+2.8%) 4787 → 4896 (+2.3%) 243/365 → 307/440 6758/15758 → 6619/15144

Findings:

  • MIXED delivers throughput + end-to-end-latency wins when the baseline defers decode behind active prefills: prefill-heavy @ qps=0.5 (+5.6% gen_tps, -15.9% E2E p50, -18% E2E p90, -39% TTFT p90), decode-heavy @ qps=6–8 (+7.8% gen_tps, -6~7% E2E p50/p90).
  • TTFT regresses 7–27% at p50 under moderate load — the split-kernel path runs the context kernel and the decode kernel sequentially on the same CUDA stream, extending the step that completes the new request's last prefill chunk. Tail TTFT (p90) actually improves under high load on prefill-heavy because decode no longer queues up behind long prefill bursts.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e3c054e5db

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/layers/attention/backends/trtllm.py Outdated
…ix-cached

Signed-off-by: rjzhb <rjzhb222@163.com>
@LorrinWWW LorrinWWW self-assigned this May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants