Add custom Ray node placement by zhenglw02 · Pull Request #106 · lightseekorg/TorchSpec

zhenglw02 · 2026-05-21T11:55:30Z

Summary

This PR adds custom Ray node placement support for TorchSpec training and inference workloads.

It addresses #25, where users need a way to explicitly control which Ray nodes are used for training and which nodes are used for inference. This is important for multi-node deployments where the default Ray PACK placement can assign actors to undesired nodes, causing unstable role-to-node mapping, suboptimal network locality, or incorrect multi-node inference node_rank ordering.

What Changed

Add explicit node placement via:
- training.training_node_ips
- inference.inference_node_ips
Add Ray label selector placement via:
- training.training_node_selectors
- inference.inference_node_selectors
Add training.placement_strategy: custom to enable custom placement fields.
Preserve unified placement group semantics for non-colocated training/inference placement.
- Training and inference bundles are still reserved together in one placement group.
- The result is then sliced into training and inference bundle ranges.
Preserve configured node order.
- This controls multi-node inference actor order.
- It also controls the node_rank passed to SGLang or vLLM.
Validate custom placement usage:
- Custom fields require training.placement_strategy: custom.
- A role cannot set both *_node_ips and *_node_selectors.
- Configured node count must match the expected number of nodes.
Improve Ray initialization behavior:
- If RAY_ADDRESS is explicitly set, connection failure no longer silently falls back to local Ray.
Add focused placement group tests.
Update config and Ray documentation.

Issue Coverage

This PR supports the requirement from #25 by allowing users to bind training and inference roles to specific Ray nodes.

Example:

training:
  placement_strategy: custom
  training_num_nodes: 1
  training_num_gpus_per_node: 8
  training_node_ips:
    - 10.0.0.1

inference:
  inference_num_gpus: 16
  inference_num_gpus_per_node: 8
  inference_node_ips:
    - 10.0.0.2
    - 10.0.0.3

This ensures training actors are placed on the configured training node(s), while inference actors are placed on the configured inference node(s), with node order preserved for distributed inference.

Tests

pytest -q tests/test_placement_group.py
python3 -m py_compile torchspec/ray/placement_group.py torchspec/config/train_config.py torchspec/config/inference_config.py tests/test_placement_group.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6757f2256e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-21T12:01:44Z

+
+def _create_custom_unified_placement_group(args, num_training_gpus: int, num_inference_gpus: int):
+    training_constraints = _normalize_node_constraints(args, "training", required=True)
+    inference_constraints = _normalize_node_constraints(args, "inference", required=True)


Permit empty inference constraints when inference GPUs are zero

In custom unified mode, _create_custom_unified_placement_group() always calls _normalize_node_constraints(..., required=True) for inference, so configs with inference_num_gpus=0 fail with a ValueError asking for inference_node_* even though no inference bundles are needed. This regresses valid zero-inference runs that previously produced an empty inference slice and is inconsistent with _expected_node_count() already handling total_gpus == 0.

Useful? React with 👍 / 👎.

Good catch. This was a real edge case in the custom unified path.

I updated _create_custom_unified_placement_group() so node constraints are only required when the corresponding role actually needs GPUs:

training constraints are required only when num_training_gpus > 0

inference constraints are required only when num_inference_gpus > 0

This allows inference_num_gpus=0 to produce an empty inference bundle slice without requiring inference_node_* fields. I also added a regression test covering custom unified
placement with zero inference GPUs.

chatgpt-codex-connector · 2026-05-21T12:01:44Z

    _wait_for_gpu_resources(_get_expected_gpu_count(args))
+    _validate_custom_strategy_usage(args)


Validate custom placement config before waiting for GPUs

create_placement_groups() waits for cluster GPUs before checking whether custom placement fields are used with placement_strategy=custom. If the config is invalid and the cluster is underprovisioned, users get a long timeout/error from _wait_for_gpu_resources instead of the immediate configuration error, which makes misconfiguration hard to diagnose and can add a 300s startup delay.

Useful? React with 👍 / 👎.

Agreed. The custom placement strategy validation is a pure config check and should run before waiting for cluster GPU resources.

I moved _validate_custom_strategy_usage(args) before _wait_for_gpu_resources(...) in create_placement_groups(), so invalid custom placement configs fail immediately instead of
potentially waiting for the GPU resource timeout first.

I also updated the existing test to assert that _wait_for_gpu_resources() is not called when custom placement fields are used without placement_strategy=custom.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: be34ab5307

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-21T12:16:11Z

+    _validate_custom_strategy_usage(args)
    _wait_for_gpu_resources(_get_expected_gpu_count(args))


Validate custom constraints before waiting for GPUs

In create_placement_groups, GPU waiting happens before custom-node constraint validation, but most custom validation (*_node_ips vs *_node_selectors, required fields, and expected node counts) is deferred to _normalize_node_constraints/_build_custom_bundles in later branches. With placement_strategy="custom" and an invalid custom config, a resource-constrained cluster can hit the full wait timeout and raise a GPU availability error before surfacing the actual config error, which significantly delays diagnosis.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-21T12:16:11Z

+    gpus_per_node = (
+        args.training_num_gpus_per_node if role == "training" else args.inference_num_gpus_per_node
+    )


Use training topology for custom colocate node validation

In custom colocate mode, num_gpus is computed from training settings, but when only inference constraints are supplied this code validates them against inference_num_gpus_per_node. If training and inference per-node values differ, valid colocate constraints are rejected with a false node-count mismatch. This breaks the documented "set either training_node_* or inference_node_*" flow unless users manually keep both per-node knobs identical.

Useful? React with 👍 / 👎.

Agreed. _validate_custom_strategy_usage() only caught one class of invalid custom placement configs, while the rest of the custom constraint validation was still deferred until
after GPU resource waiting.

I added _validate_custom_placement_constraints(args) and call it before _wait_for_gpu_resources(...). This helper validates the relevant custom placement branch without creating
any Ray placement group, including mutually exclusive IP/selector fields, required role constraints, expected node counts, and per-node GPU settings.

I also added a regression test to ensure invalid custom placement constraints fail before _wait_for_gpu_resources() is called.

Good catch. In colocate mode the shared placement group size is derived from the training topology, so node-count validation must use training_num_gpus_per_node even when the user
supplies only inference_node_* constraints.

I updated the custom colocate path to always validate/build bundles with the training topology while still allowing either training or inference constraints to select the nodes. I
also added a regression test where only inference constraints are provided and training_num_gpus_per_node differs from inference_num_gpus_per_node.

Signed-off-by: zhengliwei <liweizheng02@gmail.com>

yubofredwang

LGTM! Thanks for the great contribution

zhenglw02 force-pushed the custom-ray-node-placement branch from 6757f22 to 420fbb6 Compare May 21, 2026 11:59

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

zhenglw02 force-pushed the custom-ray-node-placement branch from 420fbb6 to be34ab5 Compare May 21, 2026 12:12

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

Add custom Ray node placement

d83a179

Signed-off-by: zhengliwei <liweizheng02@gmail.com>

zhenglw02 force-pushed the custom-ray-node-placement branch from be34ab5 to d83a179 Compare May 21, 2026 12:24

yubofredwang approved these changes May 22, 2026

View reviewed changes

yubofredwang merged commit 8a3c3b4 into lightseekorg:main May 22, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add custom Ray node placement#106

Add custom Ray node placement#106
yubofredwang merged 1 commit into
lightseekorg:mainfrom
zhenglw02:custom-ray-node-placement

zhenglw02 commented May 21, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Uh oh!

zhenglw02 May 21, 2026

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Uh oh!

zhenglw02 May 21, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Uh oh!

zhenglw02 May 21, 2026

Uh oh!

yubofredwang left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		_wait_for_gpu_resources(_get_expected_gpu_count(args))
		_validate_custom_strategy_usage(args)

Conversation

zhenglw02 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Issue Coverage

Tests

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

zhenglw02 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

zhenglw02 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

zhenglw02 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

yubofredwang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhenglw02 commented May 21, 2026 •

edited

Loading