experimental(slurm): add eval_ray_cluster multi-node bootstrap for agentic Gym evals by agronskiy · Pull Request #1016 · NVIDIA-NeMo/Evaluator

agronskiy · 2026-05-15T14:17:02Z

Agentic Gym evals like GDPVal run with deployment: none (no model server to spin up) but their Stirrup actors are Ray actors and want a multi-node cluster co-located with the eval client. NEL only had vllm_ray's pre_cmd path for that, which is tied to a model deployment — so eval-only multi-node was unreachable.

This adds an opt-in bootstrap that mirrors vllm_ray but inside the eval container: per-node srun starts Ray head + N-1 workers, head-only workload srun runs the eval in the same container so actors share the driver's Python/venv (no pickle ABI skew).

New cfg.execution knobs, default off: eval_ray_cluster, eval_ray_port, eval_ray_dashboard_port, eval_ray_ready_timeout, eval_per_node_pre_cmd, eval_ray_pre_start_cmd, eval_ray_head_workload_cmd. Combining with aux deployments is rejected at config load.

Second commit emits eval_reexport_cmd before the bootstrap srun so pyxis --container-env propagation sees task-scoped vars (PERSIST_DELIVERABLES_DIR etc.). Without it, GDPVal's Stirrup wrapper saw an empty path and tripped is_absolute() → ValidationError.

Example: packages/nemo-evaluator-launcher/examples/slurm_eval_only_ray_cluster.yaml.

Slurm unit tests: 180/180 pass; new TestEvalRayCluster covers srun emission, skip paths, env propagation, cleanup, aux-combo rejection. Full multi-node GDPVal run on Slurm still in progress.

…ployment: none agentic Gym evals Adds an opt-in NEL feature that mirrors the canonical vllm_ray pre_cmd pattern for the CPU eval-only case: a single per-node bootstrap srun brings up a Ray cluster (head + N-1 workers) with the eval container image, and a head-only workload srun (eval_ray_head_workload_cmd) runs ng_e2e_collect_rollouts inside the head's bootstrap container so Stirrup actors share the driver's Python/venv set (no pickle ABI skew). New cfg.execution knobs (all default off): - eval_ray_cluster: bool — opt-in master switch - eval_ray_port / eval_ray_dashboard_port — Ray GCS + dashboard ports - eval_ray_ready_timeout: int — TCP wait timeout - eval_per_node_pre_cmd: str|null — per-node ephemeral pre_cmd - eval_ray_pre_start_cmd: str|null — runs inside each bootstrap container before `ray start` (apt installs, venv source, install_ on_the_fly git checkouts, etc.) - eval_ray_head_workload_cmd: str|null — replaces `sleep infinity` in head's inner_cmd; use to bash a lustre-rendezvous deployment script built by the eval client's command: Internal fixes carried by this branch (16 tryouts of iteration): - Single-quote escape `'` → `'\''` in _srun_in_eval_container so inner_cmds with awk patterns don't break the outer `bash -c '...'` - Inline TCP /dev/tcp wait handler (no extra srun on PRIMARY_NODE) to avoid pyxis container-extraction deadlock - Per-node host-local /tmp/ray-$SLURM_JOB_ID bind into eval-container sruns (Ray session files; pyxis isolates /tmp per container) - Host-side mkdir + bracket-regex pkill cleanup (avoids self-suicide from `pkill -f raylet` matching its own cmdline) - Fixed Ray component ports matching canonical (--node-manager-port, --object-manager-port, etc.) - export NEL_INVOCATION_ID in batch script (bootstrap sruns saw empty value otherwise) - cfg.execution.mounts.mount_home propagated to bootstrap so --no-container-mount-home keeps the baked uv-managed Python visible - cfg.evaluation.pre_cmd propagated to bootstrap inner_cmds (env/pkg symmetry between driver and Ray actors) - ray_cluster_is_unsafe folded into is_potentially_unsafe aggregation - Rejects eval_ray_cluster + aux deployment combos at config load 180/180 slurm unit tests pass. New TestEvalRayCluster class covers the flag-off path, head+worker srun emission, single-node skip, deployment!=none skip, pre_cmd safety, ADAPTER_HOST export, eval-srun env propagation, inline TCP wait handler, cleanup pkill, ray-stop idempotency, and aux-combo rejection. Example yaml at packages/nemo-evaluator-launcher/examples/slurm_eval_only_ray_cluster.yaml demonstrates the canonical use_absolute_ip + policy_base_url + ray_head_node_address pattern. Signed-off-by: Alex Gronskiy <agronskiy@nvidia.com>

…srun NEL's .secrets.env defines task-scoped env vars (e.g. `PERSIST_DELIVERABLES_DIR_<sha>_NEMO_GYM_0`) so multiple tasks in one sbatch don't clobber each other's vars. The unsuffixed names are re-exported just before each task's srun via `eval_reexport_cmd` (line ~1062 of executor.py, in the main sbatch builder). In the eval_ray_cluster path the bootstrap/ray-head/ray-worker sruns fire BEFORE that re-export, so pyxis's `--container-env PERSIST_DELIVERABLES_DIR,...` propagation reads the unsuffixed var from the sbatch shell and finds it unset. Result: the bootstrap container starts with `PERSIST_DELIVERABLES_DIR=""`. OmegaConf's `${oc.env:PERSIST_DELIVERABLES_DIR,output/gdpval/deliverables}` in benchmarks/gdpval/config.yaml then resolves to the relative default, which trips StirrupAgentWrapper.model_post_init's `is_absolute()` check → pydantic ValidationError → `Process gdpval_stirrup_agent finished unexpectedly!` → ng_e2e_collect_rollouts dies. Fix: emit eval_reexport_cmd right before the bootstrap srun, so the unsuffixed names exist in the sbatch shell when pyxis evaluates them. The eval-client srun's later re-export (line ~1062) is now redundant but idempotent — leave it for clarity. Live reproduction: - probe1 (no PERSIST_DELIVERABLES_DIR set) → StirrupAgentWrapper dies. - probe2 (PERSIST_DELIVERABLES_DIR=<abs>) → stirrup_agent starts cleanly. 180/180 slurm unit tests pass. Signed-off-by: Alex Gronskiy <agronskiy@nvidia.com>

agronskiy added 2 commits May 15, 2026 08:31

agronskiy requested review from a team as code owners May 15, 2026 14:17

github-actions Bot added nemo-evaluator-launcher tests labels May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experimental(slurm): add eval_ray_cluster multi-node bootstrap for agentic Gym evals#1016

experimental(slurm): add eval_ray_cluster multi-node bootstrap for agentic Gym evals#1016
agronskiy wants to merge 2 commits into
mainfrom
agronskiy/experimental/eval-ray-cluster

agronskiy commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

agronskiy commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

agronskiy commented May 15, 2026 •

edited

Loading