Skip to content

experimental(slurm): add eval_ray_cluster multi-node bootstrap for agentic Gym evals#1016

Open
agronskiy wants to merge 2 commits into
mainfrom
agronskiy/experimental/eval-ray-cluster
Open

experimental(slurm): add eval_ray_cluster multi-node bootstrap for agentic Gym evals#1016
agronskiy wants to merge 2 commits into
mainfrom
agronskiy/experimental/eval-ray-cluster

Conversation

@agronskiy
Copy link
Copy Markdown
Collaborator

@agronskiy agronskiy commented May 15, 2026

Agentic Gym evals like GDPVal run with deployment: none (no model server to spin up) but their Stirrup actors are Ray actors and want a multi-node cluster co-located with the eval client. NEL only had vllm_ray's pre_cmd path for that, which is tied to a model deployment — so eval-only multi-node was unreachable.

This adds an opt-in bootstrap that mirrors vllm_ray but inside the eval container: per-node srun starts Ray head + N-1 workers, head-only workload srun runs the eval in the same container so actors share the driver's Python/venv (no pickle ABI skew).

New cfg.execution knobs, default off: eval_ray_cluster, eval_ray_port, eval_ray_dashboard_port, eval_ray_ready_timeout, eval_per_node_pre_cmd, eval_ray_pre_start_cmd, eval_ray_head_workload_cmd. Combining with aux deployments is rejected at config load.

Second commit emits eval_reexport_cmd before the bootstrap srun so pyxis --container-env propagation sees task-scoped vars (PERSIST_DELIVERABLES_DIR etc.). Without it, GDPVal's Stirrup wrapper saw an empty path and tripped is_absolute() → ValidationError.

Example: packages/nemo-evaluator-launcher/examples/slurm_eval_only_ray_cluster.yaml.

Slurm unit tests: 180/180 pass; new TestEvalRayCluster covers srun emission, skip paths, env propagation, cleanup, aux-combo rejection. Full multi-node GDPVal run on Slurm still in progress.

agronskiy added 2 commits May 15, 2026 08:31
…ployment: none agentic Gym evals

Adds an opt-in NEL feature that mirrors the canonical vllm_ray pre_cmd
pattern for the CPU eval-only case: a single per-node bootstrap srun
brings up a Ray cluster (head + N-1 workers) with the eval container
image, and a head-only workload srun (eval_ray_head_workload_cmd) runs
ng_e2e_collect_rollouts inside the head's bootstrap container so Stirrup
actors share the driver's Python/venv set (no pickle ABI skew).

New cfg.execution knobs (all default off):
- eval_ray_cluster: bool — opt-in master switch
- eval_ray_port / eval_ray_dashboard_port — Ray GCS + dashboard ports
- eval_ray_ready_timeout: int — TCP wait timeout
- eval_per_node_pre_cmd: str|null — per-node ephemeral pre_cmd
- eval_ray_pre_start_cmd: str|null — runs inside each bootstrap
  container before `ray start` (apt installs, venv source, install_
  on_the_fly git checkouts, etc.)
- eval_ray_head_workload_cmd: str|null — replaces `sleep infinity` in
  head's inner_cmd; use to bash a lustre-rendezvous deployment script
  built by the eval client's command:

Internal fixes carried by this branch (16 tryouts of iteration):
- Single-quote escape `'` → `'\''` in _srun_in_eval_container so
  inner_cmds with awk patterns don't break the outer `bash -c '...'`
- Inline TCP /dev/tcp wait handler (no extra srun on PRIMARY_NODE) to
  avoid pyxis container-extraction deadlock
- Per-node host-local /tmp/ray-$SLURM_JOB_ID bind into eval-container
  sruns (Ray session files; pyxis isolates /tmp per container)
- Host-side mkdir + bracket-regex pkill cleanup (avoids self-suicide
  from `pkill -f raylet` matching its own cmdline)
- Fixed Ray component ports matching canonical (--node-manager-port,
  --object-manager-port, etc.)
- export NEL_INVOCATION_ID in batch script (bootstrap sruns saw empty
  value otherwise)
- cfg.execution.mounts.mount_home propagated to bootstrap so
  --no-container-mount-home keeps the baked uv-managed Python visible
- cfg.evaluation.pre_cmd propagated to bootstrap inner_cmds (env/pkg
  symmetry between driver and Ray actors)
- ray_cluster_is_unsafe folded into is_potentially_unsafe aggregation
- Rejects eval_ray_cluster + aux deployment combos at config load

180/180 slurm unit tests pass. New TestEvalRayCluster class covers
the flag-off path, head+worker srun emission, single-node skip,
deployment!=none skip, pre_cmd safety, ADAPTER_HOST export, eval-srun
env propagation, inline TCP wait handler, cleanup pkill, ray-stop
idempotency, and aux-combo rejection.

Example yaml at packages/nemo-evaluator-launcher/examples/slurm_eval_only_ray_cluster.yaml
demonstrates the canonical use_absolute_ip + policy_base_url +
ray_head_node_address pattern.

Signed-off-by: Alex Gronskiy <agronskiy@nvidia.com>
…srun

NEL's .secrets.env defines task-scoped env vars (e.g.
`PERSIST_DELIVERABLES_DIR_<sha>_NEMO_GYM_0`) so multiple tasks in one
sbatch don't clobber each other's vars. The unsuffixed names are
re-exported just before each task's srun via `eval_reexport_cmd` (line
~1062 of executor.py, in the main sbatch builder).

In the eval_ray_cluster path the bootstrap/ray-head/ray-worker sruns
fire BEFORE that re-export, so pyxis's `--container-env PERSIST_DELIVERABLES_DIR,...`
propagation reads the unsuffixed var from the sbatch shell and finds it
unset. Result: the bootstrap container starts with `PERSIST_DELIVERABLES_DIR=""`.
OmegaConf's `${oc.env:PERSIST_DELIVERABLES_DIR,output/gdpval/deliverables}`
in benchmarks/gdpval/config.yaml then resolves to the relative default,
which trips StirrupAgentWrapper.model_post_init's `is_absolute()` check
→ pydantic ValidationError → `Process gdpval_stirrup_agent finished
unexpectedly!` → ng_e2e_collect_rollouts dies.

Fix: emit eval_reexport_cmd right before the bootstrap srun, so the
unsuffixed names exist in the sbatch shell when pyxis evaluates them.
The eval-client srun's later re-export (line ~1062) is now redundant
but idempotent — leave it for clarity.

Live reproduction:
- probe1 (no PERSIST_DELIVERABLES_DIR set) → StirrupAgentWrapper dies.
- probe2 (PERSIST_DELIVERABLES_DIR=<abs>) → stirrup_agent starts cleanly.

180/180 slurm unit tests pass.

Signed-off-by: Alex Gronskiy <agronskiy@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant