GPU scheduler#237
Open
sdubagun-amd wants to merge 36 commits into
Open
Conversation
Introduce GPUManager that owns the GPU pool and schedules GpuJobs on demand. Sub-agent count (M) and GPU count (N) are now independent — M defaults to ceil(N * gpu_oversubscribe). Each GPU-bound operation (test runs, profiling) acquires a GPU from the manager for the subprocess duration only, then releases it. Phase 1: GPUManager core + wiring through pipeline/dispatch/agents, save_and_test test execution routed through GpuJob when manager exists. Phase 2: Per-patch profiling os.environ race fixed — _run_patch_profile now routes through run_profiler_with_handle (clean subprocess) via GpuJob instead of mutating os.environ in-process. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…bscription When M agents > N GPUs, all M agents can fire LLM requests simultaneously during their non-GPU phases. This adds a threading.Semaphore around model.query() calls in both DefaultAgent and OptimizationAgent, capped at max_concurrent_llm (defaults to num_parallel). Threaded from mini.py through PipelineContext -> dispatch -> parallel_agent -> run_pool -> agent. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…urrent_llm Users can now set these knobs from task prompts (e.g., "use 2x oversubscription", "cap LLM concurrency at 8"). The LLM extraction prompt, normalization, config merge, and interactive editor all handle the new fields. Task-extracted values override YAML defaults with prompt > CLI > YAML priority. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…ee slot Three fixes from real-run feedback: 1. parallel_helpers.py: Remove static GPU pinning (gpu_queue) from execute_task(). Agents no longer receive HIP_VISIBLE_DEVICES or GEAK_GPU_DEVICE in their env — GPU assignment happens per-job at benchmark time via GPUManager in save_and_test. max_workers changed from n_slots (GPU count) to n_tasks so all agents run concurrently. Worktree slot uses task_id directly instead of task_id % n_slots to avoid collisions when M > N. 2. gpu_manager.py: Change stats logging from INFO to DEBUG to reduce log noise (was emitting utilization every 30s even at 0%). 3. adapter.py: Resolve relative kernel_url against repo path, and pass repo to legacy resolver. Fixes "Kernel file not found" when kernel_url is relative and CWD differs from repo directory. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The SubagentRegistry's _default_root() walked up from the installed package path looking for pyproject.toml + subagents/, which doesn't exist in pip-installed environments. The harness-generator and harness-verifier YAML specs were never found in containers. Two fixes: 1. Dockerfile: COPY subagents/ into /workspace/subagents/ 2. registry.py: Add /workspace/subagents/preprocess as a fallback discovery path for Docker containers, plus a GEAK_SUBAGENTS_ROOT env var for explicit override. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Agents in worktrees could write to the original repo when the LLM passed an absolute path under GEAK_REPO_ROOT to the editor tool. This silently corrupted the baseline and caused patch-apply failures in the postprocessor. Redirect such paths to GEAK_WORK_DIR instead. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The str_replace_editor sandbox alone was insufficient — agents discovered they could bypass it by using bash commands (cat >, cp, sed -i) to write directly to the original repo. Task_3 in the topk run explicitly copied its optimized file to /sgl-workspace/aiter after reasoning that "the test runs from the main repo." Rewrite all GEAK_REPO_ROOT occurrences in bash commands to GEAK_WORK_DIR. This is safe because agent bash commands (cd, python, cp) work identically with the worktree path, and PYTHONPATH/run.sh scripts read $GEAK_REPO_ROOT via shell expansion, not from the command string. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
… to worktree
The v3 preprocessor's `commandment_from_user_command` (Path A) had SETUP
set to `true`, skipping run.sh creation entirely. This caused agents to
run user commands without PYTHONPATH isolation, writing to the original
repo instead of their worktrees.
Match the legacy `generate_commandment_from_commands` contract:
- SETUP creates run.sh with PYTHONPATH and HIP_VISIBLE_DEVICES
- Section bodies prepend `cd ${GEAK_WORK_DIR} &&`
- Hardcoded repo-root paths in user commands rewritten to ${GEAK_WORK_DIR}
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…process time) GEAK_REPO_ROOT is only set for agent subprocesses in parallel_helpers.py, not during preprocessing. Use agent.config.repo instead, which is resolved from the adapter's repo_root parameter and available when the commandment is generated. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…ewritten Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…acts Replace `git add -A` with targeted staging of exactly the files the patch touches (parsed from diff headers after artifact stripping). Falls back to `git add -u` if the patch can't be parsed. Prevents untracked runtime artifacts (run.sh, JIT caches, flydsl_cache/) from being accidentally committed alongside the patch, while still handling new files created by translation (e.g. Triton → HIP). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Two bugs in the Path-A short-circuit caused broken evaluation feedback: 1. When the LLM put all modes in modes_covered (instead of inferred_modes), the Benchmark/Profile/Full Benchmark sections got the raw --correctness command with no flag substitution. Fix: _substitute_mode_flag() deterministically replaces any mismatched harness flag regardless of mode categorization. 2. finish_preprocess unconditionally cleared harness_path on Path A, so between-rounds Metrix profiling was always skipped. Fix: _extract_harness_from_command() recovers the harness path from the user's command when it contains a standard harness flag, preserving it for the evaluation phase. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…n.sh wrapping
The Path-A COMMANDMENT had two issues: (1) the orchestrator LLM wasn't
told the harness supports all four modes, causing it to miscategorize
modes and produce wrong flags, and (2) commands used bare `cd && python`
instead of `${GEAK_WORK_DIR}/run.sh` wrapping, skipping env setup.
- Enrich harness hint in adapter.py to tell the LLM the harness is
pre-validated with all four standard CLI modes
- Add Case A exception in orchestrator system prompt for pre-validated
harnesses
- Fix run.sh body to include cd + exec python3 (matching Path-B)
- Use run.sh wrapping for promoted harness commands
- Revert unsafe flag-append fallback in _substitute_mode_flag
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…ent_baseline preflight_commandment_contract now runs in a disposable git worktree so SETUP side-effects (run.sh, JIT caches) never dirty the original repo. recapture_commandment_baseline calls removed — the preprocessor baseline is the single source of truth. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…file collection Three fixes for the verified-speedup evaluation pipeline: 1. Path A now calls collect_baseline and collect_profile when a standard harness is available — previously skipped entirely, breaking downstream verified-speedup computation. 2. The v3 adapter writes benchmark_baseline.txt and full_benchmark_baseline.txt from BaselineMetrics.raw_outputs — previously hardcoded to None. 3. Full-benchmark baseline uses --full-benchmark stdout (via capture_full_benchmark_stdout) so the config set matches the postprocess evaluator's FULL_BENCHMARK run. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Path A's commandment_from_user_command generated a bare `--profile` flag substitution, missing the warmup + kernel-profile wrapper that Path B uses. Without the wrapper, profile.json is never written and post-round evaluation cannot access hardware counter data. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Extract build_correctness_body, build_profile_body, build_benchmark_body, build_full_benchmark_body, warmup_block, and strip_mode_flags into a new zero-dependency module (run/section_builders.py). Both Path A (commandment_from_user_command) and Path B (_generate_simple, _generate_inner_kernel) now call these shared builders, eliminating four divergences: missing GEAK_BENCHMARK_EXTRA_ARGS, --benchmark vs --full-benchmark mismatch, hardcoded warmup count, and hardcoded profile replays. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Resolve conflicts in finalize_apply.py, adapter.py, and tools.py: - finalize_apply: take gwiab's refactored structure, drop redundant _extract_patch_file_paths call - adapter: keep gwiab-scheduler's separate benchmark/full_benchmark baseline paths, remove dead _write_benchmark_baseline - tools: keep gwiab-scheduler's section_builders approach, adopt gwiab's bash -lc exec in run.sh Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Addresses B1-B3, B7-B9 in the GPU scheduler: - B1: _run_job checks fut.cancelled() before executing, wraps set_result/set_exception with InvalidStateError handling - B2: Split timeout into execution timeout (lease deadline) and queue_timeout (caller wait). Queue wait no longer eats into execution time. - B3: Lease-based release with double-release detection replaces raw GPU list release - B7: CPU pressure gate checks os.getloadavg() before dispatching GPU jobs - B8: Lease reaper thread kills hung subprocesses and reclaims GPUs from expired leases - B9: Dispatcher try/finally drains remaining queue items on every exit path Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Add observability to GPU manager: lifecycle events (queued, leased, started, completed/failed, released) written to optional JSONL file and logger.debug. Per-outcome counters (succeeded, failed, reaped, cancelled) replace opaque total_completed in stats. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
GPUManager now writes gpu_events.jsonl to the run output directory and uses a GPU-count-proportional CPU pressure threshold instead of the default 0.8 * cpu_count (which was ~307 on 384-core machines). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
# Conflicts: # src/minisweagent/run/preprocess_v3/adapter.py # src/minisweagent/run/preprocess_v3/tools.py
…ernel_url drops the extension
parse_task_info's LLM extractor occasionally returns a bare basename
(e.g. "silu" for a prompt mentioning "silu.hip"). The existence-check
clears kernel_url when the bare name doesn't resolve on disk, but
kernel_name (also bare) leaks back through mini.py's
`kernel_target = ... or parsed_config.get("kernel_name")` fallback,
re-entering _resolve_kernel_and_repo as a bare-name kernel_url.
Without this fix, _resolve_kernel_and_repo's repo-relative candidate
(repo/kernel_url) is not a file, and the legacy URL resolver also
doesn't try extensions, so the run dies with
"resolve-kernel-url failed: Kernel file not found: <repo>/<bare>".
When the candidate has no suffix, probe each extension from
_KERNEL_TYPE_TO_EXT under the repo and promote if exactly one
matches. Refuse to guess when multiple match.
Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit bb36462)
…r from LLM-hallucinated argument names Observed failure: on a silu run, the orchestrator LLM called `commandment_from_user_command` six times with hallucinated keyword names (`user_command`, `command`, `cmd`, `raw_command`, `harness_command`, `kernel_path`) before giving up, burning ~28 minutes of preprocess budget. The tool's TypeError reply included only `<type>: <message>`, which named the *bad* argument but never the correct one. Three changes: 1. tools.py — `_schema_commandment_from_user_command`: add a STRICT ARGUMENT NAMING block to both the tool description and the per-property descriptions, explicitly listing the synonyms the LLM was inventing and naming the canonical arg (`run_command`, `out_path`). 2. orchestrator.py Case A prompt: add a fenced example of the exact keyword-arg signature with the same do-NOT list. 3. orchestrator.py `_dispatch_tool`: on any tool exception, return a structured error containing the canonical schema's `expected_arguments` and `required_arguments`, the names actually passed, and the traceback tail. For TypeError specifically, add an explicit hint reminding the LLM that the schema is authoritative. This gives the LLM enough signal to self-correct on the next turn instead of cycling through more synonyms. Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 689d602)
…th A
Brings the legacy AKA `batch_test_hip_kernel.sh` workflow back to working
on v3 by surfacing existing legacy modules at v3 call sites. No code
duplicated; only call sites added.
1) Shell-contract harness synthesis (`tools._try_synthesize_shell_contract_harness`)
When the user's run_command is a compound shell pipeline (e.g.
`python3 scripts/task_runner.py compile && correctness && performance`)
without any GEAK harness flag, mirror legacy
`resolve_shell_eval_commands` (rsplit on last &&) and call
`eval_contract_adapter.materialize_shell_contract_harness` to write
`_geak_shell_contract_harness.py` exposing the standard 4-mode CLI.
This unblocks the legacy AKA prompt that previously died with
"v3 preprocess failed: No harness_path available".
2) Static `validate_harness` gate at two call sites
- `commandment_from_user_command`: validate user-supplied or synthesized
harness; reject malformed paths so finish_preprocess doesn't silently
thread bogus paths downstream.
- `adapter._recover_harness_path`: validate the path picked by
legacy `extract_harness_path` so a greedy match on `task_runner.py`
is rejected instead of breaking profile/benchmark.
3) Correctness gate before baseline (`baseline._CORRECTNESS_GATE_TIMEOUT_S`)
`collect_baseline_metrics` now runs `--correctness` once with a short
timeout (default 120s, override via GEAK_CORRECTNESS_GATE_TIMEOUT) before
the expensive benchmark loop. Broken kernels fail in seconds instead of
minutes. Bypass via GEAK_SKIP_CORRECTNESS_GATE=1.
4) Compile-command extraction for synthesized harness
`_try_synthesize_shell_contract_harness` calls legacy
`contract_normalize.infer_compile_command_from_eval` to extract the
build prefix and re-prepend it to the performance shell so a standalone
`--benchmark` invocation rebuilds when needed.
5) `build_baseline_metrics` enrichment in `_project_baseline`
When a profile result is also available, project legacy
`build_baseline_metrics(include_all=True)` keys (`bottleneck`,
`top_kernels`, `kernel_name`, `kernel_names`, `metrics`, `observations`)
into the baseline_metrics dict. Restores fields that
`inject_pipeline_context` consumes downstream which were silently empty
on v3.
All five legacy modules (`eval_contract_adapter`, `harness_utils`,
`contract_normalize`, `baseline.build_baseline_metrics`) are already in
tree; v3 just wasn't calling them.
Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit 0e1eadad242c52b727ddd6a662dd75b789e7f39f)
(cherry picked from commit 64f3fa4)
Resolve conflicts in geak.yaml, dispatch.py, adapter.py, orchestrator.py, and tools.py — mostly formatting from ruff lint pass on main. Keep gwiab-scheduler's section_builders approach and GPU scheduler config. Remove unused top-level shlex import in tools.py (inline import remains). Remove duplicate _write_benchmark_baseline dead code in adapter.py. Update test_path_a_commandment to accept section_builders output format. All tests pass (1036 passed, 39 skipped, 2 xfailed). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- gpu_manager.py: remove unused `field` import, apply ruff formatting - str_replace_editor.py: remove duplicate `_sandbox_path` staticmethod (instance method kept — it supports `_env_override`) - test_gpu_manager.py: replace time.sleep synchronization with threading.Event barriers to eliminate race conditions under pytest-xdist on CI; increase future timeouts from 10s to 30s Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2181bc8 to
f79e4fb
Compare
_extract_harness_from_command only matches `python|python3 <path>` and returns just the path. The run.sh template execs its args via `bash -lc "$*"`, which would try to exec the .py directly (Permission denied, rc=126) without an interpreter prefix. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Compare loadavg / cpu_count against the threshold instead of raw loadavg, so the gate behaves the same on a 4-core dev box and a 256-core server. Raw loadavg is meaningless without normalizing by cpu_count: a load of 23 means oversubscribed on 4 cores, ~9% utilized on 256. The previous mini.py default of max(num_gpus * 4, 8) silently wedged single-GPU runs on large hosts where baseline loadavg comfortably exceeds 8. Drop the YAML knob; the threshold is now an internal constant (0.8 = ~80% saturated). Warn if a config still sets it. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
build_pool() previously received len(ctx.gpu_ids), so a run with --num-parallel=4 --gpu-ids=0 passed num_gpus=1 and tripped the single-slot early-return — the LLM planner was skipped entirely and the dispatcher replicated one canonical task across all four subagents with no strategy diversity. Pass n_workers instead and rename num_gpus to num_slots end-to-end so the parameter name no longer lies about what it counts. task_planner.build_pool now always appends canonical_fixed to the candidate pool when the LLM planner runs, so pool.fixed is never empty. selector.py's pad branch is simplified to assert this invariant and dispatch fixed[0].body verbatim — removing the prior task=\"\" fallback that silently sent empty bodies to subagents. Verified by single-GPU + 4-subagent smoke: round 1 produced 2 planned + 1 canonical + 1 canonical-fill (byte-identical body), zero fixed-pad-* tasks.
Three tightenings in the task-generation prompt: (1) require reads of COMMANDMENT.md when its path is provided instead of marking it "Optionally", (2) replace "must be rejected" with explicit zero- exception revert language in the COMMANDMENT-adherence rule, (3) append the same revert-and-report-as-failure consequence to the per-task verification checklist. Pure prompt-text changes; no behavioural logic touched.
Auto-fix from `ruff format` to satisfy the pre-commit / CI check. No semantic change.
… bodies When the user-provided run_command is a compound shell pipeline using a subcommand-style runner (e.g. AKA's hip2hip pattern `task_runner.py compile && correctness && performance`), `_extract_harness_from_command` returns None because the command has no GEAK mode flag to anchor on. The Path-A tool then correctly synthesizes a 4-mode shell-contract wrapper via the legacy `materialize_shell_contract_harness` — but the section-body builder re-extracted from the same `cmd`, got None again, and fell through to `strip_mode_flags(cmd)` + `build_correctness_body`, which appends `--correctness` to the entire compound. The result was a COMMANDMENT whose Correctness section ran `task_runner.py performance --correctness`, which the bare subcommand-style runner rejected with "unrecognized arguments: --correctness", killing preflight before any sub-agent could run. Fix: track the synthesized wrapper across the section-builder branch and route the COMMANDMENT body through it (`run.sh python3 <wrapper>`) when it exists. The wrapper accepts the 4-mode flags and internally dispatches the user's subcommand-style runner, so preflight + every later eval call now exit 0. Also: make the wrapper honor `GEAK_WORK_DIR` (falling back to the synthesis-time-baked `REPO_ROOT`) so preflight worktree isolation and agent-side worktrees are preserved instead of always executing in the original repo dir. Verified by single-GPU + 4-subagent HIP smoke (hip2hip/others/knn): preflight_commandment_contract: PASS, Round 1/5 (mode=mixed, workers=4) launched cleanly.
Route all Python model-name fallbacks through get_model_name so GEAK_MODEL is honored consistently. Previously pipeline_helpers / harness_utils pre-resolved GEAK_MODEL into input_model_name, which inverted get_model_name's priority chain and let env override YAML pins. Now the helpers pass model_name straight through to get_model and let get_model_name decide: explicit arg > config dict > GEAK_MODEL > MSWEA_MODEL_NAME > fallback. Also bump the hardcoded default and the six main-agent YAML pins from claude-opus-4.6 to claude-opus-4.7. Pipeline-worker subagent configs (translator, kernel_analysis, harness_builder, swebench, reverse_kl, pytorch_translation) keep their intentional pins. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test plan