Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 45 additions & 0 deletions configs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,51 @@ python -m torchspec.train_entry --config configs/sglang_qwen3_8b.yaml training.l
| `inference.sglang` | `tp_size`, `mem_fraction_static`, `extra_args` | SGLang engine settings (nested under inference) |
| `mooncake` | `protocol`, `device_name` | Mooncake transfer engine settings |

## Custom Ray placement

Use `training.placement_strategy: custom` when training and inference must run
on explicitly chosen Ray nodes. This is useful when the default `PACK` placement
would put actors on nodes with the wrong network locality, cache state, or GPU
partition.

IP-based placement uses Ray's built-in `node:<ip>` resource and does not require
custom Ray labels:

```yaml
training:
placement_strategy: custom
training_num_nodes: 1
training_num_gpus_per_node: 8
training_node_ips:
- 10.0.0.1

inference:
inference_num_gpus: 16
inference_num_gpus_per_node: 8
inference_node_ips:
- 10.0.0.2
- 10.0.0.3
```

Ray label selectors are also supported when your Ray version supports placement
group `bundle_label_selector`:

```yaml
training:
placement_strategy: custom
training_node_selectors:
- {"torchspec/node": "trainer-0"}

inference:
inference_node_selectors:
- {"torchspec/node": "infer-0"}
- {"torchspec/node": "infer-1"}
```

For each role, set either `*_node_ips` or `*_node_selectors`, not both. The
configured node order is preserved; for multi-node inference it determines the
engine actor order and therefore the `node_rank` passed to SGLang or vLLM.

## SGLang engine configuration

SGLang settings live under `inference.sglang` and are split into two tiers:
Expand Down
7 changes: 5 additions & 2 deletions docs/code_architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ torchspec/
├── ray/ # Ray infrastructure (shared across all packages)
│ ├── ray_actor.py # RayActor base class (GPU setup, network utils)
│ ├── train_group.py # RayTrainGroup (manages training actor group)
│ └── placement_group.py # Placement group creation, GPU resource management
│ └── placement_group.py # Placement group creation, GPU resource management, custom node placement
├── controller/ # Async pipeline orchestration
│ ├── training_controller.py # AsyncTrainingController (Ray actor)
│ ├── inference_manager.py # AsyncInferenceManager (Ray actor)
Expand Down Expand Up @@ -209,11 +209,14 @@ training:
ttt_length: 7 # Speculative depth
train_backend: fsdp
fsdp_strategy: REPLICATE
placement_strategy: training_first # or inference_first/custom
training_node_ips: null # custom placement only

inference:
inference_engine_type: hf # or "sgl"
inference_batch_size: 1
inference_num_gpus: 4
inference_node_ips: null # custom placement only
sglang: # nested under inference
tp_size: 8
extra_args: # power-user passthrough to sgl.Engine
Expand Down Expand Up @@ -258,7 +261,7 @@ python train.py --config base.yaml --config experiment.yaml training.learning_ra
|--------|---------|
| `torchspec/ray/ray_actor.py` | `RayActor` base class (GPU setup, IP/port utils, master addr negotiation) |
| `torchspec/ray/train_group.py` | `RayTrainGroup` - Manages a group of training actors |
| `torchspec/ray/placement_group.py` | Placement group creation, GPU resource waiting, `create_placement_groups()`, `create_train_group()` |
| `torchspec/ray/placement_group.py` | Placement group creation, GPU resource waiting, custom node placement, `create_placement_groups()`, `create_train_group()` |

### Controller

Expand Down
64 changes: 62 additions & 2 deletions docs/ray.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,12 +35,13 @@ Placement groups reserve GPUs for training and inference as a unit and place the

| Mode | Training GPUs | Inference GPUs | Use case |
|------|--------------|----------------|----------|
| Default (separate) | Dedicated PG | Dedicated PG | Production: no GPU contention |
| Default | Sliced from unified PG | Sliced from unified PG | Production: deterministic node-to-role assignment |
| `custom` | Sliced from custom unified PG | Sliced from custom unified PG | Production: explicit node choice with the same unified reservation semantics |
| `colocate` | Shared PG | Shared PG | Dev: share GPUs between train & inference |
| `debug_train_only` | Dedicated PG | Empty | Debug training without inference |
| `debug_inference_only` | Empty | Dedicated PG | Debug inference without training |

Each placement group probes bundles with a temporary `InfoActor` to discover the actual (node IP, GPU ID) mapping, then sorts by (node, GPU ID) for deterministic ordering.
Each placement group probes bundles with a temporary `InfoActor` to discover the actual (node IP, GPU ID) mapping, then sorts by (node, GPU ID) for deterministic ordering. In `custom` mode, TorchSpec sorts by the configured node order first and by physical GPU ID within each selected node.

## Ray Cluster Setup

Expand Down Expand Up @@ -134,6 +135,65 @@ The PACK placement strategy spreads them across nodes automatically.
| `training.training_num_nodes` | 1 | Number of training nodes |
| `training.training_num_gpus_per_node` | 1 | GPUs per training node |

### Custom node placement

By default, TorchSpec creates a unified placement group with Ray's `PACK`
strategy, probes the resulting bundles, and assigns the ordered bundles to
training or inference according to `training.placement_strategy`
(`training_first` or `inference_first`). Set
`training.placement_strategy: custom` to explicitly choose the nodes for each
role while still reserving the non-colocated training and inference bundles in a
single unified placement group.

IP-based placement uses Ray's per-node resource labels (`node:<ip>`) and does
not require custom Ray labels:

```yaml
training:
placement_strategy: custom
training_num_nodes: 2
training_num_gpus_per_node: 8
training_node_ips:
- 10.0.0.1
- 10.0.0.3

inference:
inference_num_gpus: 16
inference_num_gpus_per_node: 8
inference_node_ips:
- 10.0.0.2
- 10.0.0.4
```

Ray label selectors are also supported when the installed Ray version supports
placement group `bundle_label_selector`. Start Ray nodes with labels, then use
matching selectors in the config:

```yaml
training:
placement_strategy: custom
training_num_nodes: 2
training_num_gpus_per_node: 8
training_node_selectors:
- {"torchspec/node": "trainer-0"}
- {"torchspec/node": "trainer-1"}

inference:
inference_node_selectors:
- {"torchspec/node": "infer-0"}
- {"torchspec/node": "infer-1"}
```

The configured node order is preserved. For multi-node inference, this order
determines the order of inference engine actors and therefore the `node_rank`
passed to SGLang or vLLM. Within each selected node, bundles are ordered by the
actual GPU ID discovered by `InfoActor`.

The number of configured training nodes must equal
`training.training_num_nodes`. The number of configured inference nodes must
match `ceil(inference.inference_num_gpus / inference.inference_num_gpus_per_node)`.
For each role, set only one of `*_node_ips` or `*_node_selectors`.

### Inference across nodes (SglEngine multi-node TP)

When a single model is too large for one node, SglEngine supports multi-node
Expand Down
Loading
Loading