Skip to content

[Bug] DeepSeek-V4-Pro multi_node_tep renders invalid TP=16 command for FP8 weights #497

@observerw

Description

@observerw

Problem

The current DeepSeek-V4-Pro recipe renders an invalid multi_node_tep command for a 2-node H200 deployment.

On the recipes site, selecting:

  • model: deepseek-ai/DeepSeek-V4-Pro
  • hardware: H200
  • nodes: 2
  • strategy: multi_node_tep

renders a command with:

--tensor-parallel-size 16

For the default FP8 checkpoint, this is not a valid configuration.

Why this is wrong

DeepSeek V4 Pro uses moe_intermediate_size = 3072, and the default FP8 weight format requires the partitioned input dimension to be compatible with the FP8 block size.

With TP=16:

  • 3072 / 16 = 192

This fails for the default FP8 weights. The related upstream vLLM issue is:

There is also an upstream PR addressing cross-node TP=16 FP8 serving:

Until that lands and is available in a released vLLM version, the recipe should not render TP=16 as the default multi-node TEP configuration for this model.

Suggested fix

For DeepSeek-V4-Pro, the multi_node_tep recipe should be overridden to use:

--tensor-parallel-size 8

for the current default FP8 weights, instead of deriving TP from total GPUs across nodes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions