fix(deepseek-v4): close MTP acceptance gap by Xiangyi1996 · Pull Request #207 · lightseekorg/tokenspeed

Xiangyi1996 · 2026-05-21T11:31:23Z

Summary

This PR closes the DeepSeek V4 MTP acceptance gap between TokenSpeed and TRTLLM.

Root cause:

The remaining gap was not from compressed KV / CSA indexer cache.
It came from MTP draft decode using stale/incorrect V4 paged KV cache metadata.
V4 has multiple cache tables; the SWA compact table could be observed with the wrong request context during draft/target-verify transitions.

Fix:

Make target-verify/draft-extend forward modes explicit.
Refresh paged cache group metadata for MTP draft/target paths.
Carry V4 SWA/compressed KV/CSA metadata consistently through draft decode.
Keep target-verify logits/hidden states correctly for speculative decoding.
Add tests for V4 SWA slot sanitization / paged metadata behavior.

Validation

pre-commit run --all-files: passed
py_compile on touched runtime/test files: passed
Acceptance rerun after rebase:
- Decoded Tok/Iter = 2.8447
- Spec Accept Rate = 0.6485
- In TRTLLM 2.8-2.9 range

chatgpt-codex-connector

💡 Codex Review

tokenspeed/python/tokenspeed/runtime/layers/attention/registry.py

Lines 433 to 436 in 63e22c5

    
           draft_cache_cell_size = ( 
        
               draft_attn_config.cache_cell_size() 
        
               * draft_model_config.num_attention_layers 
        
           )

Use V4 grouped draft cache size in page-budget profiling

When the draft model is also DeepSeek V4 (the new is_deepseek_v4_draft_model path), this branch still computes draft_cache_cell_size from draft_attn_config.cache_cell_size(), which is the generic MLA estimate and does not include V4 grouped caches (SWA/compressed/indexer/state). profile_deepseek_v4_max_num_pages then overestimates available KV pages for target+draft, so deployments can admit a token/page budget that exceeds real GPU memory and fail with OOM under load; this should use the V4-specific draft size (draft_profile_cache_cell_size / layout-based sizing) instead.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c01d2a08dd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f8108eed5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2494dc30ac

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cb37d86925

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e4223a006a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

lightseek-bot · 2026-05-22T08:59:13Z

@Xiangyi1996 please fix the conflicts thanks!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c08927dc81

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Co-authored-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com> Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

…ken_id R1-0528-NVFP4-v2 marks q_a_proj / kv_a_proj_with_mqa in exclude_modules (stored as bf16 at logical shape), but DeepseekV3FusedQkvAProjWithMqa allocates an NVFP4-packed buffer because the fused prefix is not in exclude_modules. Detect component-level exclusion and pass through quant_config=None to fall back to bf16. Also add get_hot_token_id() returning None to DeepseekV3ForCausalLMNextN to match the EAGLE3/MTP drafter contract (mirrors qwen3_5_nextn.py). Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5abff4a8cd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5fa924b9da

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

lightseek-bot · 2026-05-25T09:03:04Z

+        return False
+    if disaggregation_mode == "prefill":
+        return False
+    if speculative_algorithm is not None and paged_cache_groups:


Why don’t we support overlap for DeepSeek V4 MTP

The current PR is focused on closing the V4 MTP acceptance gap and fixing the paged-cache metadata propagation for target-verify / draft-extend paths. Overlap changes the scheduling and metadata lifetime further, so I would prefer not to enable it in the same PR before we have separate validation.

We may first land the correctness/acceptance fix, then validate V4 MTP + overlap in a follow-up perf/serving PR.

Xiangyi1996 force-pushed the xiangyi/v4-mtp-gap-rebased branch from c285d95 to 63e22c5 Compare May 22, 2026 05:34

Xiangyi1996 marked this pull request as ready for review May 22, 2026 05:37

Xiangyi1996 requested a review from a team as a code owner May 22, 2026 05:37

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/execution/forward_batch_info.py Outdated

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/execution/drafter/eagle.py Outdated

Comment thread python/tokenspeed/runtime/execution/cuda_graph_wrapper.py Outdated

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/execution/model_executor.py Outdated

Comment thread python/tokenspeed/runtime/execution/cuda_graph_wrapper.py

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/execution/cuda_graph_wrapper.py Outdated

Xiangyi1996 force-pushed the xiangyi/v4-mtp-gap-rebased branch from f77d47a to e4223a0 Compare May 22, 2026 07:20

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/models/deepseek_v4.py Outdated

lightseek-bot requested review from SimonCqk and dongjiyingdjy May 22, 2026 08:40

Xiangyi1996 force-pushed the xiangyi/v4-mtp-gap-rebased branch from e4223a0 to c08927d Compare May 22, 2026 13:48

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/execution/cuda_graph_wrapper.py

lightseek-bot mentioned this pull request May 22, 2026

[Draft]feat(deepseek-v4): support MTP speculative decoding #123

Closed

yechank-nvidia and others added 8 commits May 24, 2026 18:45

feat(deepseek-v4): support mtp speculative decoding

2c29647

Co-authored-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com> Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

fix(deepseek-v4): refresh mtp draft cache metadata

97c618c

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

fix(deepseek-v4): profile grouped draft cache size

edbad86

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

fix(spec): gate target-verify mode to v4

3103584

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

fix(spec): preserve non-v4 draft backend mode

087f9c2

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

fix(spec): align idle and replay metadata modes

b3e1e36

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

fix(spec): preserve draft seq-lens alias

5abff4a

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

Xiangyi1996 force-pushed the xiangyi/v4-mtp-gap-rebased branch from c08927d to 5abff4a Compare May 25, 2026 02:09

chatgpt-codex-connector Bot reviewed May 25, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/models/deepseek_v4.py Outdated

Comment thread python/tokenspeed/runtime/models/deepseek_v4.py Outdated

Xiangyi1996 added 3 commits May 24, 2026 19:21

fix(deepseek-v4): restore speculative metadata token count

c4b710b

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

fix(deepseek-v4): use cache metadata directly

5fa924b

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

fix(deepseek-v4): prefer current draft metadata shape

be183bf

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

chatgpt-codex-connector Bot reviewed May 25, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/models/deepseek_v4.py Outdated

Xiangyi1996 added 2 commits May 24, 2026 19:55

fix(deepseek-v4): remove undefined decode guard

28751f1

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

fix(deepseek-v4): treat spec verify indexer as decode

31a7575

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>

lightseek-bot reviewed May 25, 2026

View reviewed changes

	draft_cache_cell_size = (
	draft_attn_config.cache_cell_size()
	* draft_model_config.num_attention_layers
	)

Conversation

Xiangyi1996 commented May 21, 2026

Summary

Validation

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

lightseek-bot commented May 22, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

lightseek-bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Xiangyi1996 May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants