Problem
The current DeepSeek-V4-Pro recipe renders an invalid multi_node_tep command for a 2-node H200 deployment.
On the recipes site, selecting:
- model:
deepseek-ai/DeepSeek-V4-Pro
- hardware:
H200
- nodes:
2
- strategy:
multi_node_tep
renders a command with:
--tensor-parallel-size 16
For the default FP8 checkpoint, this is not a valid configuration.
Why this is wrong
DeepSeek V4 Pro uses moe_intermediate_size = 3072, and the default FP8 weight format requires the partitioned input dimension to be compatible with the FP8 block size.
With TP=16:
This fails for the default FP8 weights. The related upstream vLLM issue is:
There is also an upstream PR addressing cross-node TP=16 FP8 serving:
Until that lands and is available in a released vLLM version, the recipe should not render TP=16 as the default multi-node TEP configuration for this model.
Suggested fix
For DeepSeek-V4-Pro, the multi_node_tep recipe should be overridden to use:
for the current default FP8 weights, instead of deriving TP from total GPUs across nodes.
Problem
The current
DeepSeek-V4-Prorecipe renders an invalidmulti_node_tepcommand for a 2-node H200 deployment.On the recipes site, selecting:
deepseek-ai/DeepSeek-V4-ProH2002multi_node_teprenders a command with:
For the default FP8 checkpoint, this is not a valid configuration.
Why this is wrong
DeepSeek V4 Pro uses
moe_intermediate_size = 3072, and the default FP8 weight format requires the partitioned input dimension to be compatible with the FP8 block size.With
TP=16:3072 / 16 = 192This fails for the default FP8 weights. The related upstream vLLM issue is:
There is also an upstream PR addressing cross-node TP=16 FP8 serving:
Until that lands and is available in a released vLLM version, the recipe should not render
TP=16as the default multi-node TEP configuration for this model.Suggested fix
For
DeepSeek-V4-Pro, themulti_node_teprecipe should be overridden to use:for the current default FP8 weights, instead of deriving TP from total GPUs across nodes.