[Bug] DeepSeek-V4-Pro multi_node_tep renders invalid TP=16 command for FP8 weights

### Problem

The current `DeepSeek-V4-Pro` recipe renders an invalid `multi_node_tep` command for a 2-node H200 deployment.

On the recipes site, selecting:

- model: `deepseek-ai/DeepSeek-V4-Pro`
- hardware: `H200`
- nodes: `2`
- strategy: `multi_node_tep`

renders a command with:

```bash
--tensor-parallel-size 16
```

For the default FP8 checkpoint, this is not a valid configuration.

### Why this is wrong

DeepSeek V4 Pro uses `moe_intermediate_size = 3072`, and the default FP8 weight format requires the partitioned input dimension to be compatible with the FP8 block size.

With `TP=16`:

- `3072 / 16 = 192`

This fails for the default FP8 weights. The related upstream vLLM issue is:

- https://github.com/vllm-project/vllm/issues/40955

There is also an upstream PR addressing cross-node TP=16 FP8 serving:

- https://github.com/vllm-project/vllm/pull/41312

Until that lands and is available in a released vLLM version, the recipe should not render `TP=16` as the default multi-node TEP configuration for this model.

### Suggested fix

For `DeepSeek-V4-Pro`, the `multi_node_tep` recipe should be overridden to use:

```bash
--tensor-parallel-size 8
```

for the current default FP8 weights, instead of deriving TP from total GPUs across nodes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] DeepSeek-V4-Pro multi_node_tep renders invalid TP=16 command for FP8 weights #497

Problem

Why this is wrong

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] DeepSeek-V4-Pro multi_node_tep renders invalid TP=16 command for FP8 weights #497

Description

Problem

Why this is wrong

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions