Skip to content

Add custom Ray node placement#106

Merged
yubofredwang merged 1 commit into
lightseekorg:mainfrom
zhenglw02:custom-ray-node-placement
May 22, 2026
Merged

Add custom Ray node placement#106
yubofredwang merged 1 commit into
lightseekorg:mainfrom
zhenglw02:custom-ray-node-placement

Conversation

@zhenglw02
Copy link
Copy Markdown
Contributor

@zhenglw02 zhenglw02 commented May 21, 2026

Summary

This PR adds custom Ray node placement support for TorchSpec training and inference workloads.

It addresses #25, where users need a way to explicitly control which Ray nodes are used for training and which nodes are used for inference. This is important for multi-node deployments where the default Ray PACK placement can assign actors to undesired nodes, causing unstable role-to-node mapping, suboptimal network locality, or incorrect multi-node inference node_rank ordering.

What Changed

  • Add explicit node placement via:
    • training.training_node_ips
    • inference.inference_node_ips
  • Add Ray label selector placement via:
    • training.training_node_selectors
    • inference.inference_node_selectors
  • Add training.placement_strategy: custom to enable custom placement fields.
  • Preserve unified placement group semantics for non-colocated training/inference placement.
    • Training and inference bundles are still reserved together in one placement group.
    • The result is then sliced into training and inference bundle ranges.
  • Preserve configured node order.
    • This controls multi-node inference actor order.
    • It also controls the node_rank passed to SGLang or vLLM.
  • Validate custom placement usage:
    • Custom fields require training.placement_strategy: custom.
    • A role cannot set both *_node_ips and *_node_selectors.
    • Configured node count must match the expected number of nodes.
  • Improve Ray initialization behavior:
    • If RAY_ADDRESS is explicitly set, connection failure no longer silently falls back to local Ray.
  • Add focused placement group tests.
  • Update config and Ray documentation.

Issue Coverage

This PR supports the requirement from #25 by allowing users to bind training and inference roles to specific Ray nodes.

Example:

training:
  placement_strategy: custom
  training_num_nodes: 1
  training_num_gpus_per_node: 8
  training_node_ips:
    - 10.0.0.1

inference:
  inference_num_gpus: 16
  inference_num_gpus_per_node: 8
  inference_node_ips:
    - 10.0.0.2
    - 10.0.0.3

This ensures training actors are placed on the configured training node(s), while inference actors are placed on the configured inference node(s), with node order preserved for distributed inference.

Tests

pytest -q tests/test_placement_group.py
python3 -m py_compile torchspec/ray/placement_group.py torchspec/config/train_config.py torchspec/config/inference_config.py tests/test_placement_group.py

@zhenglw02 zhenglw02 force-pushed the custom-ray-node-placement branch from 6757f22 to 420fbb6 Compare May 21, 2026 11:59
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6757f2256e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread torchspec/ray/placement_group.py Outdated

def _create_custom_unified_placement_group(args, num_training_gpus: int, num_inference_gpus: int):
training_constraints = _normalize_node_constraints(args, "training", required=True)
inference_constraints = _normalize_node_constraints(args, "inference", required=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Permit empty inference constraints when inference GPUs are zero

In custom unified mode, _create_custom_unified_placement_group() always calls _normalize_node_constraints(..., required=True) for inference, so configs with inference_num_gpus=0 fail with a ValueError asking for inference_node_* even though no inference bundles are needed. This regresses valid zero-inference runs that previously produced an empty inference slice and is inconsistent with _expected_node_count() already handling total_gpus == 0.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This was a real edge case in the custom unified path.

I updated _create_custom_unified_placement_group() so node constraints are only required when the corresponding role actually needs GPUs:

  • training constraints are required only when num_training_gpus > 0
  • inference constraints are required only when num_inference_gpus > 0

This allows inference_num_gpus=0 to produce an empty inference bundle slice without requiring inference_node_* fields. I also added a regression test covering custom unified
placement with zero inference GPUs.

Comment thread torchspec/ray/placement_group.py Outdated
Comment on lines +519 to +520
_wait_for_gpu_resources(_get_expected_gpu_count(args))
_validate_custom_strategy_usage(args)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate custom placement config before waiting for GPUs

create_placement_groups() waits for cluster GPUs before checking whether custom placement fields are used with placement_strategy=custom. If the config is invalid and the cluster is underprovisioned, users get a long timeout/error from _wait_for_gpu_resources instead of the immediate configuration error, which makes misconfiguration hard to diagnose and can add a 300s startup delay.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The custom placement strategy validation is a pure config check and should run before waiting for cluster GPU resources.

I moved _validate_custom_strategy_usage(args) before _wait_for_gpu_resources(...) in create_placement_groups(), so invalid custom placement configs fail immediately instead of
potentially waiting for the GPU resource timeout first.

I also updated the existing test to assert that _wait_for_gpu_resources() is not called when custom placement fields are used without placement_strategy=custom.

@zhenglw02 zhenglw02 force-pushed the custom-ray-node-placement branch from 420fbb6 to be34ab5 Compare May 21, 2026 12:12
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: be34ab5307

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +523 to 524
_validate_custom_strategy_usage(args)
_wait_for_gpu_resources(_get_expected_gpu_count(args))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate custom constraints before waiting for GPUs

In create_placement_groups, GPU waiting happens before custom-node constraint validation, but most custom validation (*_node_ips vs *_node_selectors, required fields, and expected node counts) is deferred to _normalize_node_constraints/_build_custom_bundles in later branches. With placement_strategy="custom" and an invalid custom config, a resource-constrained cluster can hit the full wait timeout and raise a GPU availability error before surfacing the actual config error, which significantly delays diagnosis.

Useful? React with 👍 / 👎.

Comment thread torchspec/ray/placement_group.py Outdated
Comment on lines +494 to +496
gpus_per_node = (
args.training_num_gpus_per_node if role == "training" else args.inference_num_gpus_per_node
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use training topology for custom colocate node validation

In custom colocate mode, num_gpus is computed from training settings, but when only inference constraints are supplied this code validates them against inference_num_gpus_per_node. If training and inference per-node values differ, valid colocate constraints are rejected with a false node-count mismatch. This breaks the documented "set either training_node_* or inference_node_*" flow unless users manually keep both per-node knobs identical.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. _validate_custom_strategy_usage() only caught one class of invalid custom placement configs, while the rest of the custom constraint validation was still deferred until
after GPU resource waiting.

I added _validate_custom_placement_constraints(args) and call it before _wait_for_gpu_resources(...). This helper validates the relevant custom placement branch without creating
any Ray placement group, including mutually exclusive IP/selector fields, required role constraints, expected node counts, and per-node GPU settings.

I also added a regression test to ensure invalid custom placement constraints fail before _wait_for_gpu_resources() is called.

Good catch. In colocate mode the shared placement group size is derived from the training topology, so node-count validation must use training_num_gpus_per_node even when the user
supplies only inference_node_* constraints.

I updated the custom colocate path to always validate/build bundles with the training topology while still allowing either training or inference constraints to select the nodes. I
also added a regression test where only inference constraints are provided and training_num_gpus_per_node differs from inference_num_gpus_per_node.

Signed-off-by: zhengliwei <liweizheng02@gmail.com>
@zhenglw02 zhenglw02 force-pushed the custom-ray-node-placement branch from be34ab5 to d83a179 Compare May 21, 2026 12:24
Copy link
Copy Markdown
Collaborator

@yubofredwang yubofredwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the great contribution

@yubofredwang yubofredwang merged commit 8a3c3b4 into lightseekorg:main May 22, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants