GPU scheduler by sdubagun-amd · Pull Request #237 · AMD-AGI/GEAK

sdubagun-amd · 2026-05-20T15:58:13Z

Summary

GPU lease scheduler: New centralized GPUManager that decouples agent thread count from GPU count, with lease-based tracking, a background reaper for expired leases, CPU pressure gating, split queue/exec timeouts, and LLM concurrency caps to prevent TPM/RPM blowout under oversubscription
JSONL event logging & counters: Per-run structured event log (event_log_path) with per-outcome counters for observability (Phase 2)
v3 preprocessor & sandboxing fixes: Sandbox bash/editor tools to rewrite paths into worktrees, isolate preflight in temp worktrees, fix baseline file generation, harden preprocess context, and stage only patch files during apply
COMMANDMENT parity: Shared section builders for Path A / Path B, fixing flag substitution, harness_path propagation, and kernel-profile wrapping
Docker/registry: Allowlist subagents/ in .dockerignore, fix registry discovery

Test plan

New test suites: test_gpu_manager.py, test_section_builders.py, test_task_parser.py, test_preprocess_v3_bugfixes.py
Run existing test suite (test_finalize_apply.py, test_mini.py) — note test_prune_old_runs.py was removed
End-to-end: run a multi-kernel parallel job with gpu_oversubscribe enabled, confirm lease acquisition/release and event log output
Verify v3 Path A sandboxing: agent writes stay in worktree, no mutations to original repo
The title is 63 chars, so it fits the 70-char limit. The summary groups the ~40 non-merge commits into the five logical themes that define this branch.

Introduce GPUManager that owns the GPU pool and schedules GpuJobs on demand. Sub-agent count (M) and GPU count (N) are now independent — M defaults to ceil(N * gpu_oversubscribe). Each GPU-bound operation (test runs, profiling) acquires a GPU from the manager for the subprocess duration only, then releases it. Phase 1: GPUManager core + wiring through pipeline/dispatch/agents, save_and_test test execution routed through GpuJob when manager exists. Phase 2: Per-patch profiling os.environ race fixed — _run_patch_profile now routes through run_profiler_with_handle (clean subprocess) via GpuJob instead of mutating os.environ in-process. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

…bscription When M agents > N GPUs, all M agents can fire LLM requests simultaneously during their non-GPU phases. This adds a threading.Semaphore around model.query() calls in both DefaultAgent and OptimizationAgent, capped at max_concurrent_llm (defaults to num_parallel). Threaded from mini.py through PipelineContext -> dispatch -> parallel_agent -> run_pool -> agent. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

…urrent_llm Users can now set these knobs from task prompts (e.g., "use 2x oversubscription", "cap LLM concurrency at 8"). The LLM extraction prompt, normalization, config merge, and interactive editor all handle the new fields. Task-extracted values override YAML defaults with prompt > CLI > YAML priority. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

…ee slot Three fixes from real-run feedback: 1. parallel_helpers.py: Remove static GPU pinning (gpu_queue) from execute_task(). Agents no longer receive HIP_VISIBLE_DEVICES or GEAK_GPU_DEVICE in their env — GPU assignment happens per-job at benchmark time via GPUManager in save_and_test. max_workers changed from n_slots (GPU count) to n_tasks so all agents run concurrently. Worktree slot uses task_id directly instead of task_id % n_slots to avoid collisions when M > N. 2. gpu_manager.py: Change stats logging from INFO to DEBUG to reduce log noise (was emitting utilization every 30s even at 0%). 3. adapter.py: Resolve relative kernel_url against repo path, and pass repo to legacy resolver. Fixes "Kernel file not found" when kernel_url is relative and CWD differs from repo directory. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

The SubagentRegistry's _default_root() walked up from the installed package path looking for pyproject.toml + subagents/, which doesn't exist in pip-installed environments. The harness-generator and harness-verifier YAML specs were never found in containers. Two fixes: 1. Dockerfile: COPY subagents/ into /workspace/subagents/ 2. registry.py: Add /workspace/subagents/preprocess as a fallback discovery path for Docker containers, plus a GEAK_SUBAGENTS_ROOT env var for explicit override. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Agents in worktrees could write to the original repo when the LLM passed an absolute path under GEAK_REPO_ROOT to the editor tool. This silently corrupted the baseline and caused patch-apply failures in the postprocessor. Redirect such paths to GEAK_WORK_DIR instead. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

The str_replace_editor sandbox alone was insufficient — agents discovered they could bypass it by using bash commands (cat >, cp, sed -i) to write directly to the original repo. Task_3 in the topk run explicitly copied its optimized file to /sgl-workspace/aiter after reasoning that "the test runs from the main repo." Rewrite all GEAK_REPO_ROOT occurrences in bash commands to GEAK_WORK_DIR. This is safe because agent bash commands (cd, python, cp) work identically with the worktree path, and PYTHONPATH/run.sh scripts read $GEAK_REPO_ROOT via shell expansion, not from the command string. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

… to worktree The v3 preprocessor's `commandment_from_user_command` (Path A) had SETUP set to `true`, skipping run.sh creation entirely. This caused agents to run user commands without PYTHONPATH isolation, writing to the original repo instead of their worktrees. Match the legacy `generate_commandment_from_commands` contract: - SETUP creates run.sh with PYTHONPATH and HIP_VISIBLE_DEVICES - Section bodies prepend `cd ${GEAK_WORK_DIR} &&` - Hardcoded repo-root paths in user commands rewritten to ${GEAK_WORK_DIR} Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

…process time) GEAK_REPO_ROOT is only set for agent subprocesses in parallel_helpers.py, not during preprocessing. Use agent.config.repo instead, which is resolved from the adapter's repo_root parameter and available when the commandment is generated. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

…ewritten Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

…acts Replace `git add -A` with targeted staging of exactly the files the patch touches (parsed from diff headers after artifact stripping). Falls back to `git add -u` if the patch can't be parsed. Prevents untracked runtime artifacts (run.sh, JIT caches, flydsl_cache/) from being accidentally committed alongside the patch, while still handling new files created by translation (e.g. Triton → HIP). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Two bugs in the Path-A short-circuit caused broken evaluation feedback: 1. When the LLM put all modes in modes_covered (instead of inferred_modes), the Benchmark/Profile/Full Benchmark sections got the raw --correctness command with no flag substitution. Fix: _substitute_mode_flag() deterministically replaces any mismatched harness flag regardless of mode categorization. 2. finish_preprocess unconditionally cleared harness_path on Path A, so between-rounds Metrix profiling was always skipped. Fix: _extract_harness_from_command() recovers the harness path from the user's command when it contains a standard harness flag, preserving it for the evaluation phase. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

…n.sh wrapping The Path-A COMMANDMENT had two issues: (1) the orchestrator LLM wasn't told the harness supports all four modes, causing it to miscategorize modes and produce wrong flags, and (2) commands used bare `cd && python` instead of `${GEAK_WORK_DIR}/run.sh` wrapping, skipping env setup. - Enrich harness hint in adapter.py to tell the LLM the harness is pre-validated with all four standard CLI modes - Add Case A exception in orchestrator system prompt for pre-validated harnesses - Fix run.sh body to include cd + exec python3 (matching Path-B) - Use run.sh wrapping for promoted harness commands - Revert unsafe flag-append fallback in _substitute_mode_flag Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

…ent_baseline preflight_commandment_contract now runs in a disposable git worktree so SETUP side-effects (run.sh, JIT caches) never dirty the original repo. recapture_commandment_baseline calls removed — the preprocessor baseline is the single source of truth. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

…file collection Three fixes for the verified-speedup evaluation pipeline: 1. Path A now calls collect_baseline and collect_profile when a standard harness is available — previously skipped entirely, breaking downstream verified-speedup computation. 2. The v3 adapter writes benchmark_baseline.txt and full_benchmark_baseline.txt from BaselineMetrics.raw_outputs — previously hardcoded to None. 3. Full-benchmark baseline uses --full-benchmark stdout (via capture_full_benchmark_stdout) so the config set matches the postprocess evaluator's FULL_BENCHMARK run. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Path A's commandment_from_user_command generated a bare `--profile` flag substitution, missing the warmup + kernel-profile wrapper that Path B uses. Without the wrapper, profile.json is never written and post-round evaluation cannot access hardware counter data. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Extract build_correctness_body, build_profile_body, build_benchmark_body, build_full_benchmark_body, warmup_block, and strip_mode_flags into a new zero-dependency module (run/section_builders.py). Both Path A (commandment_from_user_command) and Path B (_generate_simple, _generate_inner_kernel) now call these shared builders, eliminating four divergences: missing GEAK_BENCHMARK_EXTRA_ARGS, --benchmark vs --full-benchmark mismatch, hardcoded warmup count, and hardcoded profile replays. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Resolve conflicts in finalize_apply.py, adapter.py, and tools.py: - finalize_apply: take gwiab's refactored structure, drop redundant _extract_patch_file_paths call - adapter: keep gwiab-scheduler's separate benchmark/full_benchmark baseline paths, remove dead _write_benchmark_baseline - tools: keep gwiab-scheduler's section_builders approach, adopt gwiab's bash -lc exec in run.sh Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Addresses B1-B3, B7-B9 in the GPU scheduler: - B1: _run_job checks fut.cancelled() before executing, wraps set_result/set_exception with InvalidStateError handling - B2: Split timeout into execution timeout (lease deadline) and queue_timeout (caller wait). Queue wait no longer eats into execution time. - B3: Lease-based release with double-release detection replaces raw GPU list release - B7: CPU pressure gate checks os.getloadavg() before dispatching GPU jobs - B8: Lease reaper thread kills hung subprocesses and reclaims GPUs from expired leases - B9: Dispatcher try/finally drains remaining queue items on every exit path Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Add observability to GPU manager: lifecycle events (queued, leased, started, completed/failed, released) written to optional JSONL file and logger.debug. Per-outcome counters (succeeded, failed, reaped, cancelled) replace opaque total_completed in stats. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

GPUManager now writes gpu_events.jsonl to the run output directory and uses a GPU-count-proportional CPU pressure threshold instead of the default 0.8 * cpu_count (which was ~307 on 384-core machines). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

# Conflicts: # src/minisweagent/run/preprocess_v3/adapter.py # src/minisweagent/run/preprocess_v3/tools.py

…ernel_url drops the extension parse_task_info's LLM extractor occasionally returns a bare basename (e.g. "silu" for a prompt mentioning "silu.hip"). The existence-check clears kernel_url when the bare name doesn't resolve on disk, but kernel_name (also bare) leaks back through mini.py's `kernel_target = ... or parsed_config.get("kernel_name")` fallback, re-entering _resolve_kernel_and_repo as a bare-name kernel_url. Without this fix, _resolve_kernel_and_repo's repo-relative candidate (repo/kernel_url) is not a file, and the legacy URL resolver also doesn't try extensions, so the run dies with "resolve-kernel-url failed: Kernel file not found: <repo>/<bare>". When the candidate has no suffix, probe each extension from _KERNEL_TYPE_TO_EXT under the repo and promote if exactly one matches. Refuse to guess when multiple match. Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit bb36462)

…r from LLM-hallucinated argument names Observed failure: on a silu run, the orchestrator LLM called `commandment_from_user_command` six times with hallucinated keyword names (`user_command`, `command`, `cmd`, `raw_command`, `harness_command`, `kernel_path`) before giving up, burning ~28 minutes of preprocess budget. The tool's TypeError reply included only `<type>: <message>`, which named the *bad* argument but never the correct one. Three changes: 1. tools.py — `_schema_commandment_from_user_command`: add a STRICT ARGUMENT NAMING block to both the tool description and the per-property descriptions, explicitly listing the synonyms the LLM was inventing and naming the canonical arg (`run_command`, `out_path`). 2. orchestrator.py Case A prompt: add a fenced example of the exact keyword-arg signature with the same do-NOT list. 3. orchestrator.py `_dispatch_tool`: on any tool exception, return a structured error containing the canonical schema's `expected_arguments` and `required_arguments`, the names actually passed, and the traceback tail. For TypeError specifically, add an explicit hint reminding the LLM that the schema is authoritative. This gives the LLM enough signal to self-correct on the next turn instead of cycling through more synonyms. Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 689d602)

…th A Brings the legacy AKA `batch_test_hip_kernel.sh` workflow back to working on v3 by surfacing existing legacy modules at v3 call sites. No code duplicated; only call sites added. 1) Shell-contract harness synthesis (`tools._try_synthesize_shell_contract_harness`) When the user's run_command is a compound shell pipeline (e.g. `python3 scripts/task_runner.py compile && correctness && performance`) without any GEAK harness flag, mirror legacy `resolve_shell_eval_commands` (rsplit on last &&) and call `eval_contract_adapter.materialize_shell_contract_harness` to write `_geak_shell_contract_harness.py` exposing the standard 4-mode CLI. This unblocks the legacy AKA prompt that previously died with "v3 preprocess failed: No harness_path available". 2) Static `validate_harness` gate at two call sites - `commandment_from_user_command`: validate user-supplied or synthesized harness; reject malformed paths so finish_preprocess doesn't silently thread bogus paths downstream. - `adapter._recover_harness_path`: validate the path picked by legacy `extract_harness_path` so a greedy match on `task_runner.py` is rejected instead of breaking profile/benchmark. 3) Correctness gate before baseline (`baseline._CORRECTNESS_GATE_TIMEOUT_S`) `collect_baseline_metrics` now runs `--correctness` once with a short timeout (default 120s, override via GEAK_CORRECTNESS_GATE_TIMEOUT) before the expensive benchmark loop. Broken kernels fail in seconds instead of minutes. Bypass via GEAK_SKIP_CORRECTNESS_GATE=1. 4) Compile-command extraction for synthesized harness `_try_synthesize_shell_contract_harness` calls legacy `contract_normalize.infer_compile_command_from_eval` to extract the build prefix and re-prepend it to the performance shell so a standalone `--benchmark` invocation rebuilds when needed. 5) `build_baseline_metrics` enrichment in `_project_baseline` When a profile result is also available, project legacy `build_baseline_metrics(include_all=True)` keys (`bottleneck`, `top_kernels`, `kernel_name`, `kernel_names`, `metrics`, `observations`) into the baseline_metrics dict. Restores fields that `inject_pipeline_context` consumes downstream which were silently empty on v3. All five legacy modules (`eval_contract_adapter`, `harness_utils`, `contract_normalize`, `baseline.build_baseline_metrics`) are already in tree; v3 just wasn't calling them. Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 0e1eadad242c52b727ddd6a662dd75b789e7f39f) (cherry picked from commit 64f3fa4)

Resolve conflicts in geak.yaml, dispatch.py, adapter.py, orchestrator.py, and tools.py — mostly formatting from ruff lint pass on main. Keep gwiab-scheduler's section_builders approach and GPU scheduler config. Remove unused top-level shlex import in tools.py (inline import remains). Remove duplicate _write_benchmark_baseline dead code in adapter.py. Update test_path_a_commandment to accept section_builders output format. All tests pass (1036 passed, 39 skipped, 2 xfailed). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

- gpu_manager.py: remove unused `field` import, apply ruff formatting - str_replace_editor.py: remove duplicate `_sandbox_path` staticmethod (instance method kept — it supports `_env_override`) - test_gpu_manager.py: replace time.sleep synchronization with threading.Event barriers to eliminate race conditions under pytest-xdist on CI; increase future timeouts from 10s to 30s Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

_extract_harness_from_command only matches `python|python3 <path>` and returns just the path. The run.sh template execs its args via `bash -lc "$*"`, which would try to exec the .py directly (Permission denied, rc=126) without an interpreter prefix. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Compare loadavg / cpu_count against the threshold instead of raw loadavg, so the gate behaves the same on a 4-core dev box and a 256-core server. Raw loadavg is meaningless without normalizing by cpu_count: a load of 23 means oversubscribed on 4 cores, ~9% utilized on 256. The previous mini.py default of max(num_gpus * 4, 8) silently wedged single-GPU runs on large hosts where baseline loadavg comfortably exceeds 8. Drop the YAML knob; the threshold is now an internal constant (0.8 = ~80% saturated). Warn if a config still sets it. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

build_pool() previously received len(ctx.gpu_ids), so a run with --num-parallel=4 --gpu-ids=0 passed num_gpus=1 and tripped the single-slot early-return — the LLM planner was skipped entirely and the dispatcher replicated one canonical task across all four subagents with no strategy diversity. Pass n_workers instead and rename num_gpus to num_slots end-to-end so the parameter name no longer lies about what it counts. task_planner.build_pool now always appends canonical_fixed to the candidate pool when the LLM planner runs, so pool.fixed is never empty. selector.py's pad branch is simplified to assert this invariant and dispatch fixed[0].body verbatim — removing the prior task=\"\" fallback that silently sent empty bodies to subagents. Verified by single-GPU + 4-subagent smoke: round 1 produced 2 planned + 1 canonical + 1 canonical-fill (byte-identical body), zero fixed-pad-* tasks.

Three tightenings in the task-generation prompt: (1) require reads of COMMANDMENT.md when its path is provided instead of marking it "Optionally", (2) replace "must be rejected" with explicit zero- exception revert language in the COMMANDMENT-adherence rule, (3) append the same revert-and-report-as-failure consequence to the per-task verification checklist. Pure prompt-text changes; no behavioural logic touched.

Auto-fix from `ruff format` to satisfy the pre-commit / CI check. No semantic change.

… bodies When the user-provided run_command is a compound shell pipeline using a subcommand-style runner (e.g. AKA's hip2hip pattern `task_runner.py compile && correctness && performance`), `_extract_harness_from_command` returns None because the command has no GEAK mode flag to anchor on. The Path-A tool then correctly synthesizes a 4-mode shell-contract wrapper via the legacy `materialize_shell_contract_harness` — but the section-body builder re-extracted from the same `cmd`, got None again, and fell through to `strip_mode_flags(cmd)` + `build_correctness_body`, which appends `--correctness` to the entire compound. The result was a COMMANDMENT whose Correctness section ran `task_runner.py performance --correctness`, which the bare subcommand-style runner rejected with "unrecognized arguments: --correctness", killing preflight before any sub-agent could run. Fix: track the synthesized wrapper across the section-builder branch and route the COMMANDMENT body through it (`run.sh python3 <wrapper>`) when it exists. The wrapper accepts the 4-mode flags and internally dispatches the user's subcommand-style runner, so preflight + every later eval call now exit 0. Also: make the wrapper honor `GEAK_WORK_DIR` (falling back to the synthesis-time-baked `REPO_ROOT`) so preflight worktree isolation and agent-side worktrees are preserved instead of always executing in the original repo dir. Verified by single-GPU + 4-subagent HIP smoke (hip2hip/others/knn): preflight_commandment_contract: PASS, Round 1/5 (mode=mixed, workers=4) launched cleanly.

Route all Python model-name fallbacks through get_model_name so GEAK_MODEL is honored consistently. Previously pipeline_helpers / harness_utils pre-resolved GEAK_MODEL into input_model_name, which inverted get_model_name's priority chain and let env override YAML pins. Now the helpers pass model_name straight through to get_model and let get_model_name decide: explicit arg > config dict > GEAK_MODEL > MSWEA_MODEL_NAME > fallback. Also bump the hardcoded default and the six main-agent YAML pins from claude-opus-4.6 to claude-opus-4.7. Pipeline-worker subagent configs (translator, kernel_analysis, harness_builder, swebench, reverse_kl, pytorch_translation) keep their intentional pins. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

sdubagun-amd and others added 23 commits May 19, 2026 13:35

fix: allowlist subagents/ in .dockerignore so COPY succeeds

1ee1ed9

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

fix: avoid duplicate cd ${GEAK_WORK_DIR} when repo-root was already r…

e6402c8

…ewritten Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/gwiab' into gwiab-scheduler

9c225fc

# Conflicts: # src/minisweagent/run/preprocess_v3/adapter.py # src/minisweagent/run/preprocess_v3/tools.py

sdubagun-amd requested review from Umangatamd and yueliu14 and removed request for Umangatamd May 20, 2026 16:12

Umangatamd and others added 3 commits May 20, 2026 16:42

Base automatically changed from gwiab to main May 21, 2026 11:21

sdubagun-amd and others added 2 commits May 21, 2026 11:29

sdubagun-amd force-pushed the gwiab-scheduler branch from 2181bc8 to f79e4fb Compare May 21, 2026 13:40

sdubagun-amd and others added 8 commits May 27, 2026 11:07

Merge remote-tracking branch 'origin/main' into gwiab-scheduler

73ae1f2

Ruff format gpu_manager.py

67b5632

Auto-fix from `ruff format` to satisfy the pre-commit / CI check. No semantic change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU scheduler#237

GPU scheduler#237
sdubagun-amd wants to merge 36 commits into
mainfrom
gwiab-scheduler

sdubagun-amd commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sdubagun-amd commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants