Skip to content

GPU scheduler#237

Open
sdubagun-amd wants to merge 36 commits into
mainfrom
gwiab-scheduler
Open

GPU scheduler#237
sdubagun-amd wants to merge 36 commits into
mainfrom
gwiab-scheduler

Conversation

@sdubagun-amd
Copy link
Copy Markdown
Collaborator

Summary

  • GPU lease scheduler: New centralized GPUManager that decouples agent thread count from GPU count, with lease-based tracking, a background reaper for expired leases, CPU pressure gating, split queue/exec timeouts, and LLM concurrency caps to prevent TPM/RPM blowout under oversubscription
  • JSONL event logging & counters: Per-run structured event log (event_log_path) with per-outcome counters for observability (Phase 2)
  • v3 preprocessor & sandboxing fixes: Sandbox bash/editor tools to rewrite paths into worktrees, isolate preflight in temp worktrees, fix baseline file generation, harden preprocess context, and stage only patch files during apply
  • COMMANDMENT parity: Shared section builders for Path A / Path B, fixing flag substitution, harness_path propagation, and kernel-profile wrapping
  • Docker/registry: Allowlist subagents/ in .dockerignore, fix registry discovery

Test plan

  • New test suites: test_gpu_manager.py, test_section_builders.py, test_task_parser.py, test_preprocess_v3_bugfixes.py
  • Run existing test suite (test_finalize_apply.py, test_mini.py) — note test_prune_old_runs.py was removed
  • End-to-end: run a multi-kernel parallel job with gpu_oversubscribe enabled, confirm lease acquisition/release and event log output
  • Verify v3 Path A sandboxing: agent writes stay in worktree, no mutations to original repo
  • The title is 63 chars, so it fits the 70-char limit. The summary groups the ~40 non-merge commits into the five logical themes that define this branch.

sdubagun-amd and others added 23 commits May 19, 2026 13:35
Introduce GPUManager that owns the GPU pool and schedules GpuJobs on
demand. Sub-agent count (M) and GPU count (N) are now independent —
M defaults to ceil(N * gpu_oversubscribe). Each GPU-bound operation
(test runs, profiling) acquires a GPU from the manager for the
subprocess duration only, then releases it.

Phase 1: GPUManager core + wiring through pipeline/dispatch/agents,
save_and_test test execution routed through GpuJob when manager exists.
Phase 2: Per-patch profiling os.environ race fixed — _run_patch_profile
now routes through run_profiler_with_handle (clean subprocess) via
GpuJob instead of mutating os.environ in-process.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…bscription

When M agents > N GPUs, all M agents can fire LLM requests simultaneously
during their non-GPU phases. This adds a threading.Semaphore around
model.query() calls in both DefaultAgent and OptimizationAgent, capped
at max_concurrent_llm (defaults to num_parallel). Threaded from mini.py
through PipelineContext -> dispatch -> parallel_agent -> run_pool -> agent.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…urrent_llm

Users can now set these knobs from task prompts (e.g., "use 2x
oversubscription", "cap LLM concurrency at 8"). The LLM extraction
prompt, normalization, config merge, and interactive editor all handle
the new fields. Task-extracted values override YAML defaults with
prompt > CLI > YAML priority.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…ee slot

Three fixes from real-run feedback:

1. parallel_helpers.py: Remove static GPU pinning (gpu_queue) from
   execute_task(). Agents no longer receive HIP_VISIBLE_DEVICES or
   GEAK_GPU_DEVICE in their env — GPU assignment happens per-job at
   benchmark time via GPUManager in save_and_test. max_workers changed
   from n_slots (GPU count) to n_tasks so all agents run concurrently.
   Worktree slot uses task_id directly instead of task_id % n_slots to
   avoid collisions when M > N.

2. gpu_manager.py: Change stats logging from INFO to DEBUG to reduce
   log noise (was emitting utilization every 30s even at 0%).

3. adapter.py: Resolve relative kernel_url against repo path, and pass
   repo to legacy resolver. Fixes "Kernel file not found" when
   kernel_url is relative and CWD differs from repo directory.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The SubagentRegistry's _default_root() walked up from the installed
package path looking for pyproject.toml + subagents/, which doesn't
exist in pip-installed environments. The harness-generator and
harness-verifier YAML specs were never found in containers.

Two fixes:
1. Dockerfile: COPY subagents/ into /workspace/subagents/
2. registry.py: Add /workspace/subagents/preprocess as a fallback
   discovery path for Docker containers, plus a GEAK_SUBAGENTS_ROOT
   env var for explicit override.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Agents in worktrees could write to the original repo when the LLM
passed an absolute path under GEAK_REPO_ROOT to the editor tool.
This silently corrupted the baseline and caused patch-apply failures
in the postprocessor. Redirect such paths to GEAK_WORK_DIR instead.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The str_replace_editor sandbox alone was insufficient — agents
discovered they could bypass it by using bash commands (cat >, cp,
sed -i) to write directly to the original repo. Task_3 in the topk
run explicitly copied its optimized file to /sgl-workspace/aiter
after reasoning that "the test runs from the main repo."

Rewrite all GEAK_REPO_ROOT occurrences in bash commands to
GEAK_WORK_DIR. This is safe because agent bash commands (cd, python,
cp) work identically with the worktree path, and PYTHONPATH/run.sh
scripts read $GEAK_REPO_ROOT via shell expansion, not from the
command string.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
… to worktree

The v3 preprocessor's `commandment_from_user_command` (Path A) had SETUP
set to `true`, skipping run.sh creation entirely. This caused agents to
run user commands without PYTHONPATH isolation, writing to the original
repo instead of their worktrees.

Match the legacy `generate_commandment_from_commands` contract:
- SETUP creates run.sh with PYTHONPATH and HIP_VISIBLE_DEVICES
- Section bodies prepend `cd ${GEAK_WORK_DIR} &&`
- Hardcoded repo-root paths in user commands rewritten to ${GEAK_WORK_DIR}

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…process time)

GEAK_REPO_ROOT is only set for agent subprocesses in parallel_helpers.py,
not during preprocessing. Use agent.config.repo instead, which is resolved
from the adapter's repo_root parameter and available when the commandment
is generated.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…ewritten

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…acts

Replace `git add -A` with targeted staging of exactly the files the patch
touches (parsed from diff headers after artifact stripping). Falls back to
`git add -u` if the patch can't be parsed.

Prevents untracked runtime artifacts (run.sh, JIT caches, flydsl_cache/)
from being accidentally committed alongside the patch, while still
handling new files created by translation (e.g. Triton → HIP).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Two bugs in the Path-A short-circuit caused broken evaluation feedback:

1. When the LLM put all modes in modes_covered (instead of
   inferred_modes), the Benchmark/Profile/Full Benchmark sections
   got the raw --correctness command with no flag substitution.
   Fix: _substitute_mode_flag() deterministically replaces any
   mismatched harness flag regardless of mode categorization.

2. finish_preprocess unconditionally cleared harness_path on Path A,
   so between-rounds Metrix profiling was always skipped.
   Fix: _extract_harness_from_command() recovers the harness path
   from the user's command when it contains a standard harness flag,
   preserving it for the evaluation phase.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…n.sh wrapping

The Path-A COMMANDMENT had two issues: (1) the orchestrator LLM wasn't
told the harness supports all four modes, causing it to miscategorize
modes and produce wrong flags, and (2) commands used bare `cd && python`
instead of `${GEAK_WORK_DIR}/run.sh` wrapping, skipping env setup.

- Enrich harness hint in adapter.py to tell the LLM the harness is
  pre-validated with all four standard CLI modes
- Add Case A exception in orchestrator system prompt for pre-validated
  harnesses
- Fix run.sh body to include cd + exec python3 (matching Path-B)
- Use run.sh wrapping for promoted harness commands
- Revert unsafe flag-append fallback in _substitute_mode_flag

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…ent_baseline

preflight_commandment_contract now runs in a disposable git worktree so
SETUP side-effects (run.sh, JIT caches) never dirty the original repo.
recapture_commandment_baseline calls removed — the preprocessor baseline
is the single source of truth.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…file collection

Three fixes for the verified-speedup evaluation pipeline:

1. Path A now calls collect_baseline and collect_profile when a standard
   harness is available — previously skipped entirely, breaking downstream
   verified-speedup computation.

2. The v3 adapter writes benchmark_baseline.txt and full_benchmark_baseline.txt
   from BaselineMetrics.raw_outputs — previously hardcoded to None.

3. Full-benchmark baseline uses --full-benchmark stdout (via
   capture_full_benchmark_stdout) so the config set matches the
   postprocess evaluator's FULL_BENCHMARK run.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Path A's commandment_from_user_command generated a bare `--profile`
flag substitution, missing the warmup + kernel-profile wrapper that
Path B uses. Without the wrapper, profile.json is never written and
post-round evaluation cannot access hardware counter data.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Extract build_correctness_body, build_profile_body, build_benchmark_body,
build_full_benchmark_body, warmup_block, and strip_mode_flags into a new
zero-dependency module (run/section_builders.py). Both Path A
(commandment_from_user_command) and Path B (_generate_simple,
_generate_inner_kernel) now call these shared builders, eliminating four
divergences: missing GEAK_BENCHMARK_EXTRA_ARGS, --benchmark vs
--full-benchmark mismatch, hardcoded warmup count, and hardcoded
profile replays.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Resolve conflicts in finalize_apply.py, adapter.py, and tools.py:
- finalize_apply: take gwiab's refactored structure, drop redundant _extract_patch_file_paths call
- adapter: keep gwiab-scheduler's separate benchmark/full_benchmark baseline paths, remove dead _write_benchmark_baseline
- tools: keep gwiab-scheduler's section_builders approach, adopt gwiab's bash -lc exec in run.sh

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Addresses B1-B3, B7-B9 in the GPU scheduler:
- B1: _run_job checks fut.cancelled() before executing, wraps set_result/set_exception with InvalidStateError handling
- B2: Split timeout into execution timeout (lease deadline) and queue_timeout (caller wait). Queue wait no longer eats into execution time.
- B3: Lease-based release with double-release detection replaces raw GPU list release
- B7: CPU pressure gate checks os.getloadavg() before dispatching GPU jobs
- B8: Lease reaper thread kills hung subprocesses and reclaims GPUs from expired leases
- B9: Dispatcher try/finally drains remaining queue items on every exit path

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Add observability to GPU manager: lifecycle events (queued, leased,
started, completed/failed, released) written to optional JSONL file
and logger.debug. Per-outcome counters (succeeded, failed, reaped,
cancelled) replace opaque total_completed in stats.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
GPUManager now writes gpu_events.jsonl to the run output directory and
uses a GPU-count-proportional CPU pressure threshold instead of the
default 0.8 * cpu_count (which was ~307 on 384-core machines).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
# Conflicts:
#	src/minisweagent/run/preprocess_v3/adapter.py
#	src/minisweagent/run/preprocess_v3/tools.py
@sdubagun-amd sdubagun-amd requested review from Umangatamd and yueliu14 and removed request for Umangatamd May 20, 2026 16:12
Umangatamd and others added 3 commits May 20, 2026 16:42
…ernel_url drops the extension

parse_task_info's LLM extractor occasionally returns a bare basename
(e.g. "silu" for a prompt mentioning "silu.hip"). The existence-check
clears kernel_url when the bare name doesn't resolve on disk, but
kernel_name (also bare) leaks back through mini.py's
`kernel_target = ... or parsed_config.get("kernel_name")` fallback,
re-entering _resolve_kernel_and_repo as a bare-name kernel_url.

Without this fix, _resolve_kernel_and_repo's repo-relative candidate
(repo/kernel_url) is not a file, and the legacy URL resolver also
doesn't try extensions, so the run dies with
"resolve-kernel-url failed: Kernel file not found: <repo>/<bare>".

When the candidate has no suffix, probe each extension from
_KERNEL_TYPE_TO_EXT under the repo and promote if exactly one
matches. Refuse to guess when multiple match.

Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit bb36462)
…r from LLM-hallucinated argument names

Observed failure: on a silu run, the orchestrator LLM called
`commandment_from_user_command` six times with hallucinated keyword
names (`user_command`, `command`, `cmd`, `raw_command`, `harness_command`,
`kernel_path`) before giving up, burning ~28 minutes of preprocess
budget. The tool's TypeError reply included only `<type>: <message>`,
which named the *bad* argument but never the correct one.

Three changes:

1. tools.py — `_schema_commandment_from_user_command`: add a STRICT
   ARGUMENT NAMING block to both the tool description and the per-property
   descriptions, explicitly listing the synonyms the LLM was inventing
   and naming the canonical arg (`run_command`, `out_path`).

2. orchestrator.py Case A prompt: add a fenced example of the exact
   keyword-arg signature with the same do-NOT list.

3. orchestrator.py `_dispatch_tool`: on any tool exception, return a
   structured error containing the canonical schema's
   `expected_arguments` and `required_arguments`, the names actually
   passed, and the traceback tail. For TypeError specifically, add an
   explicit hint reminding the LLM that the schema is authoritative.
   This gives the LLM enough signal to self-correct on the next turn
   instead of cycling through more synonyms.

Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit 689d602)
…th A

Brings the legacy AKA `batch_test_hip_kernel.sh` workflow back to working
on v3 by surfacing existing legacy modules at v3 call sites. No code
duplicated; only call sites added.

1) Shell-contract harness synthesis (`tools._try_synthesize_shell_contract_harness`)
   When the user's run_command is a compound shell pipeline (e.g.
   `python3 scripts/task_runner.py compile && correctness && performance`)
   without any GEAK harness flag, mirror legacy
   `resolve_shell_eval_commands` (rsplit on last &&) and call
   `eval_contract_adapter.materialize_shell_contract_harness` to write
   `_geak_shell_contract_harness.py` exposing the standard 4-mode CLI.
   This unblocks the legacy AKA prompt that previously died with
   "v3 preprocess failed: No harness_path available".

2) Static `validate_harness` gate at two call sites
   - `commandment_from_user_command`: validate user-supplied or synthesized
     harness; reject malformed paths so finish_preprocess doesn't silently
     thread bogus paths downstream.
   - `adapter._recover_harness_path`: validate the path picked by
     legacy `extract_harness_path` so a greedy match on `task_runner.py`
     is rejected instead of breaking profile/benchmark.

3) Correctness gate before baseline (`baseline._CORRECTNESS_GATE_TIMEOUT_S`)
   `collect_baseline_metrics` now runs `--correctness` once with a short
   timeout (default 120s, override via GEAK_CORRECTNESS_GATE_TIMEOUT) before
   the expensive benchmark loop. Broken kernels fail in seconds instead of
   minutes. Bypass via GEAK_SKIP_CORRECTNESS_GATE=1.

4) Compile-command extraction for synthesized harness
   `_try_synthesize_shell_contract_harness` calls legacy
   `contract_normalize.infer_compile_command_from_eval` to extract the
   build prefix and re-prepend it to the performance shell so a standalone
   `--benchmark` invocation rebuilds when needed.

5) `build_baseline_metrics` enrichment in `_project_baseline`
   When a profile result is also available, project legacy
   `build_baseline_metrics(include_all=True)` keys (`bottleneck`,
   `top_kernels`, `kernel_name`, `kernel_names`, `metrics`, `observations`)
   into the baseline_metrics dict. Restores fields that
   `inject_pipeline_context` consumes downstream which were silently empty
   on v3.

All five legacy modules (`eval_contract_adapter`, `harness_utils`,
`contract_normalize`, `baseline.build_baseline_metrics`) are already in
tree; v3 just wasn't calling them.

Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit 0e1eadad242c52b727ddd6a662dd75b789e7f39f)
(cherry picked from commit 64f3fa4)
Base automatically changed from gwiab to main May 21, 2026 11:21
sdubagun-amd and others added 2 commits May 21, 2026 11:29
Resolve conflicts in geak.yaml, dispatch.py, adapter.py, orchestrator.py,
and tools.py — mostly formatting from ruff lint pass on main. Keep
gwiab-scheduler's section_builders approach and GPU scheduler config.

Remove unused top-level shlex import in tools.py (inline import remains).
Remove duplicate _write_benchmark_baseline dead code in adapter.py.
Update test_path_a_commandment to accept section_builders output format.

All tests pass (1036 passed, 39 skipped, 2 xfailed).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- gpu_manager.py: remove unused `field` import, apply ruff formatting
- str_replace_editor.py: remove duplicate `_sandbox_path` staticmethod
  (instance method kept — it supports `_env_override`)
- test_gpu_manager.py: replace time.sleep synchronization with
  threading.Event barriers to eliminate race conditions under
  pytest-xdist on CI; increase future timeouts from 10s to 30s

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
sdubagun-amd and others added 8 commits May 27, 2026 11:07
_extract_harness_from_command only matches `python|python3 <path>` and
returns just the path. The run.sh template execs its args via
`bash -lc "$*"`, which would try to exec the .py directly (Permission
denied, rc=126) without an interpreter prefix.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Compare loadavg / cpu_count against the threshold instead of raw
loadavg, so the gate behaves the same on a 4-core dev box and a
256-core server. Raw loadavg is meaningless without normalizing by
cpu_count: a load of 23 means oversubscribed on 4 cores, ~9% utilized
on 256.

The previous mini.py default of max(num_gpus * 4, 8) silently wedged
single-GPU runs on large hosts where baseline loadavg comfortably
exceeds 8. Drop the YAML knob; the threshold is now an internal
constant (0.8 = ~80% saturated). Warn if a config still sets it.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
build_pool() previously received len(ctx.gpu_ids), so a run with
--num-parallel=4 --gpu-ids=0 passed num_gpus=1 and tripped the
single-slot early-return — the LLM planner was skipped entirely and
the dispatcher replicated one canonical task across all four
subagents with no strategy diversity. Pass n_workers instead and
rename num_gpus to num_slots end-to-end so the parameter name no
longer lies about what it counts.

task_planner.build_pool now always appends canonical_fixed to the
candidate pool when the LLM planner runs, so pool.fixed is never
empty. selector.py's pad branch is simplified to assert this
invariant and dispatch fixed[0].body verbatim — removing the
prior task=\"\" fallback that silently sent empty bodies to subagents.

Verified by single-GPU + 4-subagent smoke: round 1 produced 2
planned + 1 canonical + 1 canonical-fill (byte-identical body),
zero fixed-pad-* tasks.
Three tightenings in the task-generation prompt: (1) require reads
of COMMANDMENT.md when its path is provided instead of marking it
"Optionally", (2) replace "must be rejected" with explicit zero-
exception revert language in the COMMANDMENT-adherence rule, (3)
append the same revert-and-report-as-failure consequence to the
per-task verification checklist. Pure prompt-text changes; no
behavioural logic touched.
Auto-fix from `ruff format` to satisfy the pre-commit / CI check.
No semantic change.
… bodies

When the user-provided run_command is a compound shell pipeline using a
subcommand-style runner (e.g. AKA's hip2hip pattern
`task_runner.py compile && correctness && performance`),
`_extract_harness_from_command` returns None because the command has no
GEAK mode flag to anchor on. The Path-A tool then correctly synthesizes
a 4-mode shell-contract wrapper via the legacy
`materialize_shell_contract_harness` — but the section-body builder
re-extracted from the same `cmd`, got None again, and fell through to
`strip_mode_flags(cmd)` + `build_correctness_body`, which appends
`--correctness` to the entire compound. The result was a COMMANDMENT
whose Correctness section ran `task_runner.py performance --correctness`,
which the bare subcommand-style runner rejected with
"unrecognized arguments: --correctness", killing preflight before any
sub-agent could run.

Fix: track the synthesized wrapper across the section-builder branch
and route the COMMANDMENT body through it (`run.sh python3 <wrapper>`)
when it exists. The wrapper accepts the 4-mode flags and internally
dispatches the user's subcommand-style runner, so preflight + every
later eval call now exit 0.

Also: make the wrapper honor `GEAK_WORK_DIR` (falling back to the
synthesis-time-baked `REPO_ROOT`) so preflight worktree isolation and
agent-side worktrees are preserved instead of always executing in the
original repo dir.

Verified by single-GPU + 4-subagent HIP smoke (hip2hip/others/knn):
preflight_commandment_contract: PASS, Round 1/5 (mode=mixed,
workers=4) launched cleanly.
Route all Python model-name fallbacks through get_model_name so
GEAK_MODEL is honored consistently. Previously pipeline_helpers /
harness_utils pre-resolved GEAK_MODEL into input_model_name, which
inverted get_model_name's priority chain and let env override YAML pins.
Now the helpers pass model_name straight through to get_model and let
get_model_name decide: explicit arg > config dict > GEAK_MODEL >
MSWEA_MODEL_NAME > fallback.

Also bump the hardcoded default and the six main-agent YAML pins from
claude-opus-4.6 to claude-opus-4.7. Pipeline-worker subagent configs
(translator, kernel_analysis, harness_builder, swebench, reverse_kl,
pytorch_translation) keep their intentional pins.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants