experimental(slurm): add eval_ray_cluster multi-node bootstrap for agentic Gym evals#1016
Open
agronskiy wants to merge 2 commits into
Open
experimental(slurm): add eval_ray_cluster multi-node bootstrap for agentic Gym evals#1016agronskiy wants to merge 2 commits into
agronskiy wants to merge 2 commits into
Conversation
…ployment: none agentic Gym evals Adds an opt-in NEL feature that mirrors the canonical vllm_ray pre_cmd pattern for the CPU eval-only case: a single per-node bootstrap srun brings up a Ray cluster (head + N-1 workers) with the eval container image, and a head-only workload srun (eval_ray_head_workload_cmd) runs ng_e2e_collect_rollouts inside the head's bootstrap container so Stirrup actors share the driver's Python/venv set (no pickle ABI skew). New cfg.execution knobs (all default off): - eval_ray_cluster: bool — opt-in master switch - eval_ray_port / eval_ray_dashboard_port — Ray GCS + dashboard ports - eval_ray_ready_timeout: int — TCP wait timeout - eval_per_node_pre_cmd: str|null — per-node ephemeral pre_cmd - eval_ray_pre_start_cmd: str|null — runs inside each bootstrap container before `ray start` (apt installs, venv source, install_ on_the_fly git checkouts, etc.) - eval_ray_head_workload_cmd: str|null — replaces `sleep infinity` in head's inner_cmd; use to bash a lustre-rendezvous deployment script built by the eval client's command: Internal fixes carried by this branch (16 tryouts of iteration): - Single-quote escape `'` → `'\''` in _srun_in_eval_container so inner_cmds with awk patterns don't break the outer `bash -c '...'` - Inline TCP /dev/tcp wait handler (no extra srun on PRIMARY_NODE) to avoid pyxis container-extraction deadlock - Per-node host-local /tmp/ray-$SLURM_JOB_ID bind into eval-container sruns (Ray session files; pyxis isolates /tmp per container) - Host-side mkdir + bracket-regex pkill cleanup (avoids self-suicide from `pkill -f raylet` matching its own cmdline) - Fixed Ray component ports matching canonical (--node-manager-port, --object-manager-port, etc.) - export NEL_INVOCATION_ID in batch script (bootstrap sruns saw empty value otherwise) - cfg.execution.mounts.mount_home propagated to bootstrap so --no-container-mount-home keeps the baked uv-managed Python visible - cfg.evaluation.pre_cmd propagated to bootstrap inner_cmds (env/pkg symmetry between driver and Ray actors) - ray_cluster_is_unsafe folded into is_potentially_unsafe aggregation - Rejects eval_ray_cluster + aux deployment combos at config load 180/180 slurm unit tests pass. New TestEvalRayCluster class covers the flag-off path, head+worker srun emission, single-node skip, deployment!=none skip, pre_cmd safety, ADAPTER_HOST export, eval-srun env propagation, inline TCP wait handler, cleanup pkill, ray-stop idempotency, and aux-combo rejection. Example yaml at packages/nemo-evaluator-launcher/examples/slurm_eval_only_ray_cluster.yaml demonstrates the canonical use_absolute_ip + policy_base_url + ray_head_node_address pattern. Signed-off-by: Alex Gronskiy <agronskiy@nvidia.com>
…srun
NEL's .secrets.env defines task-scoped env vars (e.g.
`PERSIST_DELIVERABLES_DIR_<sha>_NEMO_GYM_0`) so multiple tasks in one
sbatch don't clobber each other's vars. The unsuffixed names are
re-exported just before each task's srun via `eval_reexport_cmd` (line
~1062 of executor.py, in the main sbatch builder).
In the eval_ray_cluster path the bootstrap/ray-head/ray-worker sruns
fire BEFORE that re-export, so pyxis's `--container-env PERSIST_DELIVERABLES_DIR,...`
propagation reads the unsuffixed var from the sbatch shell and finds it
unset. Result: the bootstrap container starts with `PERSIST_DELIVERABLES_DIR=""`.
OmegaConf's `${oc.env:PERSIST_DELIVERABLES_DIR,output/gdpval/deliverables}`
in benchmarks/gdpval/config.yaml then resolves to the relative default,
which trips StirrupAgentWrapper.model_post_init's `is_absolute()` check
→ pydantic ValidationError → `Process gdpval_stirrup_agent finished
unexpectedly!` → ng_e2e_collect_rollouts dies.
Fix: emit eval_reexport_cmd right before the bootstrap srun, so the
unsuffixed names exist in the sbatch shell when pyxis evaluates them.
The eval-client srun's later re-export (line ~1062) is now redundant
but idempotent — leave it for clarity.
Live reproduction:
- probe1 (no PERSIST_DELIVERABLES_DIR set) → StirrupAgentWrapper dies.
- probe2 (PERSIST_DELIVERABLES_DIR=<abs>) → stirrup_agent starts cleanly.
180/180 slurm unit tests pass.
Signed-off-by: Alex Gronskiy <agronskiy@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Agentic Gym evals like GDPVal run with
deployment: none(no model server to spin up) but their Stirrup actors are Ray actors and want a multi-node cluster co-located with the eval client. NEL only hadvllm_ray's pre_cmd path for that, which is tied to a model deployment — so eval-only multi-node was unreachable.This adds an opt-in bootstrap that mirrors
vllm_raybut inside the eval container: per-node srun starts Ray head + N-1 workers, head-only workload srun runs the eval in the same container so actors share the driver's Python/venv (no pickle ABI skew).New
cfg.executionknobs, default off:eval_ray_cluster,eval_ray_port,eval_ray_dashboard_port,eval_ray_ready_timeout,eval_per_node_pre_cmd,eval_ray_pre_start_cmd,eval_ray_head_workload_cmd. Combining with aux deployments is rejected at config load.Second commit emits
eval_reexport_cmdbefore the bootstrap srun so pyxis--container-envpropagation sees task-scoped vars (PERSIST_DELIVERABLES_DIRetc.). Without it, GDPVal's Stirrup wrapper saw an empty path and trippedis_absolute()→ ValidationError.Example:
packages/nemo-evaluator-launcher/examples/slurm_eval_only_ray_cluster.yaml.Slurm unit tests: 180/180 pass; new
TestEvalRayClustercovers srun emission, skip paths, env propagation, cleanup, aux-combo rejection. Full multi-node GDPVal run on Slurm still in progress.