Add custom Ray node placement#106
Conversation
6757f22 to
420fbb6
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6757f2256e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| def _create_custom_unified_placement_group(args, num_training_gpus: int, num_inference_gpus: int): | ||
| training_constraints = _normalize_node_constraints(args, "training", required=True) | ||
| inference_constraints = _normalize_node_constraints(args, "inference", required=True) |
There was a problem hiding this comment.
Permit empty inference constraints when inference GPUs are zero
In custom unified mode, _create_custom_unified_placement_group() always calls _normalize_node_constraints(..., required=True) for inference, so configs with inference_num_gpus=0 fail with a ValueError asking for inference_node_* even though no inference bundles are needed. This regresses valid zero-inference runs that previously produced an empty inference slice and is inconsistent with _expected_node_count() already handling total_gpus == 0.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch. This was a real edge case in the custom unified path.
I updated _create_custom_unified_placement_group() so node constraints are only required when the corresponding role actually needs GPUs:
- training constraints are required only when
num_training_gpus > 0 - inference constraints are required only when
num_inference_gpus > 0
This allows inference_num_gpus=0 to produce an empty inference bundle slice without requiring inference_node_* fields. I also added a regression test covering custom unified
placement with zero inference GPUs.
| _wait_for_gpu_resources(_get_expected_gpu_count(args)) | ||
| _validate_custom_strategy_usage(args) |
There was a problem hiding this comment.
Validate custom placement config before waiting for GPUs
create_placement_groups() waits for cluster GPUs before checking whether custom placement fields are used with placement_strategy=custom. If the config is invalid and the cluster is underprovisioned, users get a long timeout/error from _wait_for_gpu_resources instead of the immediate configuration error, which makes misconfiguration hard to diagnose and can add a 300s startup delay.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Agreed. The custom placement strategy validation is a pure config check and should run before waiting for cluster GPU resources.
I moved _validate_custom_strategy_usage(args) before _wait_for_gpu_resources(...) in create_placement_groups(), so invalid custom placement configs fail immediately instead of
potentially waiting for the GPU resource timeout first.
I also updated the existing test to assert that _wait_for_gpu_resources() is not called when custom placement fields are used without placement_strategy=custom.
420fbb6 to
be34ab5
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: be34ab5307
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| _validate_custom_strategy_usage(args) | ||
| _wait_for_gpu_resources(_get_expected_gpu_count(args)) |
There was a problem hiding this comment.
Validate custom constraints before waiting for GPUs
In create_placement_groups, GPU waiting happens before custom-node constraint validation, but most custom validation (*_node_ips vs *_node_selectors, required fields, and expected node counts) is deferred to _normalize_node_constraints/_build_custom_bundles in later branches. With placement_strategy="custom" and an invalid custom config, a resource-constrained cluster can hit the full wait timeout and raise a GPU availability error before surfacing the actual config error, which significantly delays diagnosis.
Useful? React with 👍 / 👎.
| gpus_per_node = ( | ||
| args.training_num_gpus_per_node if role == "training" else args.inference_num_gpus_per_node | ||
| ) |
There was a problem hiding this comment.
Use training topology for custom colocate node validation
In custom colocate mode, num_gpus is computed from training settings, but when only inference constraints are supplied this code validates them against inference_num_gpus_per_node. If training and inference per-node values differ, valid colocate constraints are rejected with a false node-count mismatch. This breaks the documented "set either training_node_* or inference_node_*" flow unless users manually keep both per-node knobs identical.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Agreed. _validate_custom_strategy_usage() only caught one class of invalid custom placement configs, while the rest of the custom constraint validation was still deferred until
after GPU resource waiting.
I added _validate_custom_placement_constraints(args) and call it before _wait_for_gpu_resources(...). This helper validates the relevant custom placement branch without creating
any Ray placement group, including mutually exclusive IP/selector fields, required role constraints, expected node counts, and per-node GPU settings.
I also added a regression test to ensure invalid custom placement constraints fail before _wait_for_gpu_resources() is called.
Good catch. In colocate mode the shared placement group size is derived from the training topology, so node-count validation must use training_num_gpus_per_node even when the user
supplies only inference_node_* constraints.
I updated the custom colocate path to always validate/build bundles with the training topology while still allowing either training or inference constraints to select the nodes. I
also added a regression test where only inference constraints are provided and training_num_gpus_per_node differs from inference_num_gpus_per_node.
Signed-off-by: zhengliwei <liweizheng02@gmail.com>
be34ab5 to
d83a179
Compare
yubofredwang
left a comment
There was a problem hiding this comment.
LGTM! Thanks for the great contribution
Summary
This PR adds custom Ray node placement support for TorchSpec training and inference workloads.
It addresses #25, where users need a way to explicitly control which Ray nodes are used for training and which nodes are used for inference. This is important for multi-node deployments where the default Ray
PACKplacement can assign actors to undesired nodes, causing unstable role-to-node mapping, suboptimal network locality, or incorrect multi-node inferencenode_rankordering.What Changed
training.training_node_ipsinference.inference_node_ipstraining.training_node_selectorsinference.inference_node_selectorstraining.placement_strategy: customto enable custom placement fields.node_rankpassed to SGLang or vLLM.training.placement_strategy: custom.*_node_ipsand*_node_selectors.RAY_ADDRESSis explicitly set, connection failure no longer silently falls back to local Ray.Issue Coverage
This PR supports the requirement from #25 by allowing users to bind training and inference roles to specific Ray nodes.
Example:
This ensures training actors are placed on the configured training node(s), while inference actors are placed on the configured inference node(s), with node order preserved for distributed inference.
Tests
pytest -q tests/test_placement_group.py
python3 -m py_compile torchspec/ray/placement_group.py torchspec/config/train_config.py torchspec/config/inference_config.py tests/test_placement_group.py