Skip to content

fix(deepseek-v4): close MTP acceptance gap#207

Open
Xiangyi1996 wants to merge 13 commits into
lightseekorg:mainfrom
Xiangyi1996:xiangyi/v4-mtp-gap-rebased
Open

fix(deepseek-v4): close MTP acceptance gap#207
Xiangyi1996 wants to merge 13 commits into
lightseekorg:mainfrom
Xiangyi1996:xiangyi/v4-mtp-gap-rebased

Conversation

@Xiangyi1996
Copy link
Copy Markdown

Summary

This PR closes the DeepSeek V4 MTP acceptance gap between TokenSpeed and TRTLLM.

Root cause:

  • The remaining gap was not from compressed KV / CSA indexer cache.
  • It came from MTP draft decode using stale/incorrect V4 paged KV cache metadata.
  • V4 has multiple cache tables; the SWA compact table could be observed with the wrong request context during draft/target-verify transitions.

Fix:

  • Make target-verify/draft-extend forward modes explicit.
  • Refresh paged cache group metadata for MTP draft/target paths.
  • Carry V4 SWA/compressed KV/CSA metadata consistently through draft decode.
  • Keep target-verify logits/hidden states correctly for speculative decoding.
  • Add tests for V4 SWA slot sanitization / paged metadata behavior.

Validation

  • pre-commit run --all-files: passed
  • py_compile on touched runtime/test files: passed
  • Acceptance rerun after rebase:
    • Decoded Tok/Iter = 2.8447
    • Spec Accept Rate = 0.6485
    • In TRTLLM 2.8-2.9 range

@Xiangyi1996 Xiangyi1996 force-pushed the xiangyi/v4-mtp-gap-rebased branch from c285d95 to 63e22c5 Compare May 22, 2026 05:34
@Xiangyi1996 Xiangyi1996 marked this pull request as ready for review May 22, 2026 05:37
@Xiangyi1996 Xiangyi1996 requested a review from a team as a code owner May 22, 2026 05:37
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

draft_cache_cell_size = (
draft_attn_config.cache_cell_size()
* draft_model_config.num_attention_layers
)

P1 Badge Use V4 grouped draft cache size in page-budget profiling

When the draft model is also DeepSeek V4 (the new is_deepseek_v4_draft_model path), this branch still computes draft_cache_cell_size from draft_attn_config.cache_cell_size(), which is the generic MLA estimate and does not include V4 grouped caches (SWA/compressed/indexer/state). profile_deepseek_v4_max_num_pages then overestimates available KV pages for target+draft, so deployments can admit a token/page budget that exceeds real GPU memory and fail with OOM under load; this should use the V4-specific draft size (draft_profile_cache_cell_size / layout-based sizing) instead.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c01d2a08dd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/execution/forward_batch_info.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f8108eed5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/execution/drafter/eagle.py Outdated
Comment thread python/tokenspeed/runtime/execution/cuda_graph_wrapper.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2494dc30ac

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/execution/model_executor.py Outdated
Comment thread python/tokenspeed/runtime/execution/cuda_graph_wrapper.py
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cb37d86925

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/execution/cuda_graph_wrapper.py Outdated
@Xiangyi1996 Xiangyi1996 force-pushed the xiangyi/v4-mtp-gap-rebased branch from f77d47a to e4223a0 Compare May 22, 2026 07:20
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e4223a006a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/models/deepseek_v4.py Outdated
@lightseek-bot
Copy link
Copy Markdown
Contributor

@Xiangyi1996 please fix the conflicts thanks!

@Xiangyi1996 Xiangyi1996 force-pushed the xiangyi/v4-mtp-gap-rebased branch from e4223a0 to c08927d Compare May 22, 2026 13:48
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c08927dc81

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/execution/cuda_graph_wrapper.py
yechank-nvidia and others added 8 commits May 24, 2026 18:45
Co-authored-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
…ken_id

R1-0528-NVFP4-v2 marks q_a_proj / kv_a_proj_with_mqa in exclude_modules
(stored as bf16 at logical shape), but DeepseekV3FusedQkvAProjWithMqa
allocates an NVFP4-packed buffer because the fused prefix is not in
exclude_modules. Detect component-level exclusion and pass through
quant_config=None to fall back to bf16.

Also add get_hot_token_id() returning None to DeepseekV3ForCausalLMNextN
to match the EAGLE3/MTP drafter contract (mirrors qwen3_5_nextn.py).

Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
@Xiangyi1996 Xiangyi1996 force-pushed the xiangyi/v4-mtp-gap-rebased branch from c08927d to 5abff4a Compare May 25, 2026 02:09
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5abff4a8cd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/models/deepseek_v4.py Outdated
Comment thread python/tokenspeed/runtime/models/deepseek_v4.py Outdated
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5fa924b9da

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/models/deepseek_v4.py Outdated
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
return False
if disaggregation_mode == "prefill":
return False
if speculative_algorithm is not None and paged_cache_groups:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don’t we support overlap for DeepSeek V4 MTP

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current PR is focused on closing the V4 MTP acceptance gap and fixing the paged-cache metadata propagation for target-verify / draft-extend paths. Overlap changes the scheduling and metadata lifetime further, so I would prefer not to enable it in the same PR before we have separate validation.

We may first land the correctness/acceptance fix, then validate V4 MTP + overlap in a follow-up perf/serving PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants