Skip to content

fix: round-robin DP rank assignment for service discovery mode#181

Open
raviguptaamd wants to merge 1 commit into
vllm-project:mainfrom
raviguptaamd:ravgupta/discovery-dp-rank-roundrobin
Open

fix: round-robin DP rank assignment for service discovery mode#181
raviguptaamd wants to merge 1 commit into
vllm-project:mainfrom
raviguptaamd:ravgupta/discovery-dp-rank-roundrobin

Conversation

@raviguptaamd

Copy link
Copy Markdown

Summary

When using ZMQ service discovery with --intra-node-data-parallel-size > 1, worker addresses arrive without @rank suffixes. The router left prefill_dp_rank as None, so:

  • The prefill request lacked X-data-parallel-rank header
  • The decode kv_transfer_params omitted remote_dp_rank

This caused the MoRI-IO WRITE mode handshake to always target DP rank 0, deadlocking all other DP ranks.

Fix

Add an atomic round-robin counter (discovery_dp_rank_counter) to VllmPDRouter that assigns DP ranks when extract_base_http_and_dp_rank returns None and DP size > 1. The assigned rank flows into:

  1. Prefill request's X-data-parallel-rank header
  2. Decode request's remote_dp_rank in kv_transfer_params

This matches the behaviour of the toy proxy which already does round-robin DP rank assignment.

Changes

  • src/routers/http/vllm_pd_router.rs:
    • Add discovery_dp_rank_counter: AtomicUsize field to VllmPDRouter
    • Initialize to 0 in both constructor paths (discovery and non-discovery)
    • In process_vllm_two_stage_request_discovered(), generate round-robin DP rank when prefill_dp_rank.is_none() and intra_node_data_parallel_size > 1

Test plan

  • 1P/1D DeepSeek-V3-5layer on MI300X (8 DP ranks per node, MoRI EP, WRITE mode) — warmup and benchmark pass with correct per-rank MoRI-IO handshakes
  • Unit tests for round-robin assignment with DP size > 1
  • Verify no regression for non-discovery (static URL) mode where @rank suffixes are present

Builds on top of #157 (MoRI WRITE mode concurrent dispatch).

When using ZMQ service discovery with intra_node_data_parallel_size > 1,
worker addresses arrive without @rank suffixes. The router left
prefill_dp_rank as None, so the prefill request lacked
X-data-parallel-rank and the decode kv_transfer_params omitted
remote_dp_rank. This caused the MoRI-IO handshake to always target
DP rank 0, deadlocking all other ranks.

Add an atomic round-robin counter (discovery_dp_rank_counter) that
assigns DP ranks when extract_base_http_and_dp_rank returns None and
DP size > 1. The assigned rank flows into the prefill header and the
decode's remote_dp_rank, matching the behaviour of the toy proxy.

Tested with 1P/1D DeepSeek-V3-5layer on MI300X (8 DP ranks per node).

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant