Make GEAK evaluate worktree (patched) kernels, not the baseline by chao-xu-spec · Pull Request #249 · AMD-AGI/GEAK

chao-xu-spec · 2026-05-29T04:06:25Z

Close #246

Problem

GEAK copies a repo into a worktree, applies the candidate patch, then runs
harness.py. In several setups the harness silently imported the BASELINE
kernel instead of the worktree, so every speedup measured baseline-vs-baseline
(~1.00x). Three independent failure modes were found:

aiter (JIT C++/HIP): .so modules were never rebuilt from worktree
source (AITER_REBUILD unset; module_aiter_core on a hardcoded
rebuild-allowlist), so edits never entered the runtime binary.
sgl-kernel / monorepos: only the FIRST installable sub-project was
editable-installed, so import sgl_kernel resolved to the site-packages
wheel — baseline again.
Harness hardcoded sys.path.insert(0, "/sgl-workspace/sglang/python"):
a literal absolute path at sys.path[0] shadowed both the worktree on
PYTHONPATH and the editable install (observed in
rotary_embedding_kernel_202605290819).

vLLM adds a fourth constraint: it ships as a wheel-only install (no setup.py,
multi-GB .so), so neither git worktree nor pip install -e applies.

Changes

1. Force JIT kernels to rebuild from worktree source

_compile_bootstrap/{__init__,sitecustomize}.py (new): stdlib-only bootstrap
auto-loaded via PYTHONPATH; sets AITER_REBUILD=2 and installs an import
hook clearing aiter's rebuilded_list (covers module_aiter_core).
run/preprocess/run_harness.py, tools/save_and_test.py,
run/preprocess_v3/baseline.py: inject the bootstrap dir + AITER_REBUILD
into every harness subprocess (setdefault — explicit overrides win).

2. Editable-install every installable sub-project in the worktree

run/preprocess/worktree_install.py (new): bounded recursive walk installs
every setup.py/pyproject.toml sub-project; two-tier strategy
(pip install -e → setup_rocm.py develop fallback); snapshots & restores
the original wheels.
run/mini.py: geak --cleanup restores the original wheel-installed
packages (best-effort).
run/utils/generated_artifacts.py: keep editable-install build side-effects
(hipify-modified files, generated metadata) out of captured patches.

3. vLLM wheel-only support via shadow worktree

kernel_packages/{__init__,profile,shadow_worktree,vllm_profile}.py (new):
PackageProfile registry; shadow tree copies .py (writable) and symlinks
.so (immutable baseline), git-inits for clean git diff; vLLM profile
detects wheel-only layout, skips editable install, injects runtime env.
run/task_file.py: create_worktree dispatches to a profile's
make_worktree when one matches.
agents/parallel_agent.py: skip git init for shadow-tree profiles (never
pollute system site-packages).

4. Reject harnesses that bypass the worktree (the 1.00x bug)

run/preprocess/unit_test_agent.py: forbid hardcoded
sys.path.insert(0, '/abs') in triton/asm/unknown/cuda guidance (only
hip/ck had it); steer toward GEAK_WORK_DIR.
kernel_languages/contract.py: find_hardcoded_syspath_inserts +
validate_harness raises ContractViolation → HarnessBuilder regenerates.
run/preprocess/harness_utils.py: same detection → valid=False, so
phases and the Path-A short-circuit drop and regenerate the harness.
pipeline_workers/preprocess/harness_builder.py,
run/preprocess_v3/tools.py: wire stricter validation into Path-A.

5. Atomic editor writes

tools/editor_tool.py: write via temp file + os.replace so edits never
corrupt a hard-linked/shared baseline inode.

6. Bash tool firewall + per-slot env resolution

tools/bash_command.py: L1 wall-clock timeout + L2 filesystem-scan firewall
(reject find /, NFS roots); expand $VAR/${VAR} from the injected env so
GEAK_WORK_DIR resolves per parallel slot.
subagents/**/SYSTEM_PROMPT.md, subagents/_common/search_scope_hint.md,
run/preprocess_v3/subagent.py: tell subagents the allowed search scope.

Cleanup

run/preprocess/commandment.py, run/utils/generated_artifacts.py: remove
the superseded .aiter_jit / AITER_JIT_DIR scratch-dir mechanism (replaced
by AITER_REBUILD + sitecustomize hook); unify cpp/python SETUP template.

Tests

tests/kernel_packages/ (new): compile bootstrap, profile detection,
shadow_worktree layout, end-to-end dispatch.
tests/run/test_worktree_install.py (new): recursive editable install.
tests/run/test_harness_workdir_and_tolerance.py: hardcoded-sys.path gate
(both validators), workdir injection, tolerance cap.
tests/tools/test_bash_command_safety.py: firewall + env expansion.

Made with Cursor

* feat(rdna): add RDNA GPU architecture detection and profiling support

When tasks are submitted via the GEAK API, the top-level model.api_key field is always serialised as an empty string regardless of what is set in model_kwargs.api_key. This means _get_api_key() always falls through to the AMD_LLM_API_KEY env-var lookup, which is also unset in the task container, causing every amd_llm task to fail immediately with: ValueError: API key not provided. Fix: add model_kwargs.get("api_key") as a fallback between self.config.api_key and the env-var lookups so that keys passed through the API's model_kwargs dict are honoured. Similarly, amd_claude._init_client() ignored model_kwargs["api_base"] when constructing the Anthropic client base_url, falling back to the default llm-api.amd.com endpoint instead of the caller-specified one. Add model_kwargs.get("api_base") as a fallback there too. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Wrap long line in _init_client to satisfy ruff format check. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

…base-url fix(amd_llm): read api_key and api_base from model_kwargs as fallback

Fix/unified toolruntime

Bound an entire GEAK run end-to-end (1h quick / 2h full), covering preprocess + heterogeneous/homogeneous optimization. Mode is also extracted from natural-language task content. Four enforcement layers: - Polled Deadline at every loop boundary (run/budget.py) - threading.Timer watchdog flips SoftStop ~5min before deadline - ProcessRegistry (run/state.py): SIGTERM -> 5s -> SIGKILL escalation via os.killpg for tracked Popen/mp.Process/Future - SIGINT handler + try/finally cleanup; 2nd Ctrl-C force-terminates Preprocess: 15min soft cap, may borrow up to 50% of total. Stage-aware soft-stop handler classifies by current_stage + benchmark_baseline.txt presence. Profiler wrapped in mp.Process with os.setsid() so GPU grandchildren can be killpg'd. Mode controls only total_s, finalize_grace_s, preprocess caps, and max_rounds. Step/cost limits stay user-controlled to avoid silent overrides. Documented precedence: CLI > mode > env > default. Plumbed deadline + soft_stop + registry through round loop, run_llm_steps, dispatch_tool_call, parallel_helpers (run_pool, run_parallel_heterogeneous), and DefaultAgent.query. Replaced as_completed with poll loops so soft_stop is observed mid-dispatch. 39 new tests (budget, state, mode presets, integration). Repo-wide ruff cleanup. Final: ruff + pylint -E clean; 527 passed, 39 skipped. Made-with: Cursor

…01e4) Restore the load-bearing ``output = output.resolve()`` that 94c01e4 added to ``_derive_output_dir`` and that 03b5b9e silently removed under its "remove dead code" line item. Without it, a relative ``--output`` (or task-extracted ``output_dir``) on a merged HIP/CUDA kernel breaks the pipeline: RuntimeError: Deterministic harness file not found: /repo_root/.../outputs/silu/test_silu_harness.py because the merged-file split helper writes the new harness to ``output_dir / "test_<stem>_harness.py"`` and ``_resolve_deterministic_harness`` treats that relative path as repo-root-relative. Triton runs hide the bug because non-merged kernels skip the split path and use the user-provided absolute ``test_command``. The unit tests for ``_derive_output_dir`` only exercised ``tmp_path`` (always absolute), so ``.resolve()`` was a no-op and the assertions held either way -- which is exactly how the original 03b5b9e regression slipped through CI. Add three regression tests using relative inputs (relative directory, relative file path, bare relative name) that all assert ``out_dir.is_absolute()``. Confirmed these would have failed against the regressed code. Also add an inline comment in ``_derive_output_dir`` naming both 94c01e4 (intent) and 03b5b9e (the regression) so the next refactor pass doesn't repeat the mistake. Made-with: Cursor

03b5b9e dropped the tty guard, so a CI/scripted run without an API key env would block on prompt() inside interactive setup() instead of failing fast. Restoring the original guard. Made-with: Cursor

Closes the homogeneous-mode gap where a stuck subprocess.run inside a sub-agent let --mode quick run past the 1h budget. A) New BudgetSpec.kill_buffer_s (default 60s) and a second optimization watchdog at opt_deadline + kill_buffer_s that calls registry.terminate_all() then os._exit(124). Wired from mini.py and geak-orchestrate. B) New _tracked_subprocess_run() helper in tools/save_and_test.py: a drop-in for subprocess.run that registers its Popen with the run-level ProcessRegistry (start_new_session=True). Only the long-running test_command call is converted; short git ops keep subprocess.run. Registry threaded via DefaultAgent._registry, set alongside _soft_stop in parallel_helpers / parallel_agent. Tests: +4 hard-kill (test_budget.py), +5 tracked-subprocess (new test_save_and_test_registry.py). Total now 561 passed. Made-with: Cursor

ProcessRegistry.register_future() now appends + adds a done-callback that removes the future on completion. Lock switched to RLock so the already-done synchronous-callback case can't deadlock the submitter. Stops the misleading 'futures=N' in terminate_all's SIGTERM-wave log on clean run finalization. +3 tests. Made-with: Cursor

…runs 1. task_parser: log raw LLM response (truncated to 500 chars) on JSONDecodeError. Without this the only diagnostic was 'char 0', indistinguishable across all non-JSON failure modes. 2. task_parser: promote a directory kernel_url to a kernel file inside it (kernel_name.<ext> -> kernel.{py,hip,cu,flydsl} -> single matching file by kernel_type). Users frequently say 'the kernel is in <DIR>' and the LLM extractor echoes the directory verbatim. Also tighten PARSE_TASK_INFO prompt so the LLM is less likely to do this. 3. preprocessor: filter automated_test_discovery results to repo_root so harnesses from sibling kernel directories (e.g. .../L3/fused_rms_fp8/test_kernel_harness.py while optimizing .../L3/gemm_a16wfp4/kernel.py) are dropped before reaching UTA. +14 tests. 578 passed total. Made-with: Cursor

When the LLM bails on JSON ('let me check the directory...') we lost the 'quick mode' cue and silently fell back to YAML's 'full' default, producing 5 rounds instead of 2. - task_parser._infer_mode_from_text(): regex backstop covering 'quick mode', '1 hour', '--mode full', etc. Fires only when LLM left mode as None; LLM result wins on conflict. - JSON_EXTRACTION_SYSTEM_PROMPT: explicit 'never return prose, never investigate the filesystem, guess if uncertain' so the LLM stops treating ambiguous prompts as research tasks. +13 tests. 591 passed total. Made-with: Cursor

…on failure Guarantees every preprocessor-produced harness exposes a `--iterations N` argparse flag and refuses to silently progress past contract failures. - C-like wrapper and Triton-split harnesses now inject `--iterations` (with `GEAK_BENCHMARK_ITERATIONS` env fallback); UTA prompt requires it. - `REQUIRED_HARNESS_FLAGS` enforces `--iterations` so missing-flag harnesses fail static validation and trigger UTA's regenerate loop. - Preprocessor raises `PreprocessAborted` when no harness mode produced a baseline (escape hatch: `GEAK_ALLOW_BROKEN_HARNESS=1`). - New `CommandmentExecutionError` raised from `run_correctness_and_benchmark` / `run_profile` on subprocess failure or contract-broken stderr signatures (`unrecognized arguments`, `Harness file not found`, etc.); kernel-level correctness failures keep the legacy `correctness_failed` round status. - New `preflight_commandment_contract` smoke-tests SETUP+CORRECTNESS once with `--iterations 1` before sub-agent fan-out; opt-out via `GEAK_SKIP_COMMANDMENT_PREFLIGHT=1`. Made-with: Cursor

…workers When SoftStop fires while sub-agents are mid-LLM-call (which doesn't observe _soft_stop), the dispatcher's 'with ThreadPoolExecutor as ex:' blocks at exit on shutdown(wait=True), preventing the orchestrator from reaching its deadline-finalize path and forcing the hard-kill watchdog to os._exit(124) without writing final_report.json. Replace the with-block with manual try/finally in run_pool, run_parallel_heterogeneous, and ParallelAgent.run_parallel (homogeneous branch). On SoftStop call shutdown(wait=False, cancel_futures=True) so the dispatcher returns immediately, letting the orchestrator finalize within finalize_grace_s instead of being forcibly killed. +3 tests verifying detach is fast (<2s) and normal drain still works. 605 passed total. Co-authored-by: Cursor <cursoragent@cursor.com>

The Lint & Format workflow's `ruff format --check src/` step has been failing on this branch since e56b227. Re-running `ruff format src/` in-place reformats four files that all share two mechanical patterns: - `src/minisweagent/run/postprocess/evaluation.py`: collapse multi-line `raise CommandmentExecutionError(...)` calls and one implicit-string-concat `logger.error(...)` back onto a single line that fits the 120-col line limit. - `src/minisweagent/run/preprocess/harness_utils.py`: rewrite the `_GEAK_ITERATIONS_SHIM` heredoc from `'''...'''` to `"""..."""` per pyproject.toml's `[tool.ruff.format] quote-style = "double"`. - `src/minisweagent/run/preprocess/preprocessor.py`: collapse one implicit-string-concat `logger.error(...)`. - `src/minisweagent/run/utils/task_parser.py`: collapse two implicit-string-concat `logger.info(...)` calls. Formatter-only; no semantic change. `ruff format --check src/` and `ruff check src/` both clean post-change. Verified with the same `ruff 0.15.2` CI installs via `uv pip install ruff` (unpinned). Made-with: Cursor

Deprecates geak-orchestrate and addresses must-fix and select should-fix items from the PR #205 deep review. Deprecation: - Remove geak-orchestrate console script and standalone CLI from run/orchestrator.py. Its --mode total_s semantics diverged from geak (preprocess elapsed always assumed 0, so --mode quick gave a fresh 1h optimization budget instead of the 1h preprocess+optimization budget the same flag gives via geak). run_orchestrator() and _probe_preprocess_dir() are kept for in-process callers and tests. Must-fix bugs: - tool_dispatch_tasks: remove dead inner loop in the improvement-skip block; collapse the task_paths[:1] over-iteration into a single next(). - _stage_found_improvement: replace __import__() trick with a normal top-level import and log JSON parse failures at WARNING. - _compute_verified_speedup: use 'is None' / '<= 0' so a real 0.0 candidate latency is rejected as a broken measurement instead of looking identical to "couldn't parse"; set failure_reason and propagate it through RoundEvaluation.full_benchmark.failure_reason. - _promote_kernel_url_dir_to_file: cap iterdir() at 32 entries to avoid walking large/shared directories on stale paths; refuse to promote unknown kernel_type without a name hint. - mini.py SIGINT handler: restore the original handler before calling terminate_all() so a third Ctrl-C lands on the default handler instead of recursing back into the SIGINT handler mid-escalation. - PreprocessState: add set_stage(stage) helper that bundles "advance + raise on hard_fail"; replace the four direct state.current_stage = assignments in run_preprocessor that duplicated the guard. - tool_collect_results: actually short-circuit on SoftStop -- narrow the scan to a single-round walk instead of the full cross-round summary. - default.py _setup_save_and_test_context: document the idempotent re-init coupling that parallel/heterogeneous helpers rely on. Should-fix: - Deadline.cap() returns 0.0 once SoftStop is set so callers using cap() to size new subprocess timeouts refuse new long-running work without needing a separate soft_stop.is_set() poll. - _filter_discovery_to_repo_root: also catch RecursionError from Path.resolve() (NFS/macOS symlink loops). - mini.py hard-kill watchdog: best-effort write a stub final_report.json ({status: hard_kill, exit_code: 124, elapsed_s, reason}) before os._exit(124) so the operator has something on disk. - apply_mode_presets: log the actual config delta (+key=val, key: before->after) instead of walking the preset tree, which misrepresented dict-replaces-scalar merges. - validate_harness: strip Python comments via tokenize before checking required-flag presence so "# --iterations N not yet supported" doesn't satisfy the validator. Mirror the helper between the duplicated copies in pipeline_helpers and preprocess/harness_utils. Tests: - New test for Deadline.cap() returning 0 under SoftStop. - New test that comment-only --iterations is rejected by validate_harness. Full unit test suite: 607 passed (was 605), 39 skipped, 2 xfailed. ruff check / ruff format --check on src/ clean. Co-authored-by: Cursor <cursoragent@cursor.com>

…strip helper Pylint flagged ``src/minisweagent/run/preprocess/harness_utils.py:1144: E1101: Module 'tokenize' has no 'TokenizeError' member (no-member)``. The exception class is ``tokenize.TokenError`` (singular, no "ize"); the typo would have raised ``AttributeError`` if the ``except`` clause ever fired on a tokenize failure. Tests still pass because the source we feed ``_strip_python_comments`` is well-formed; pylint catches what runtime coverage didn't. Co-authored-by: Cursor <cursoragent@cursor.com>

A harness that doesn't declare --iterations is no longer a hard-fail at validation. GEAK warns once, never rewrites the harness, and silently strips --iterations N from any subprocess invocation of that harness so argparse doesn't crash with "unrecognized arguments". Iteration counts still flow via GEAK_BENCHMARK_ITERATIONS for harnesses that read the env var; harnesses that don't run at their hardcoded default. This replaces the old "your harness must declare --iterations or set GEAK_ALLOW_BROKEN_HARNESS=1" UX with "we'll route around it". Mechanism: - New ``harness_supports_iterations(path)`` detector in ``run/preprocess/harness_utils.py`` (memoized via lru_cache, strips comments before checking, returns False on unreadable paths). Paired with ``reset_harness_support_cache()`` and a small ``_strip_iterations_tokens`` argv-string scrubber. - ``REQUIRED_HARNESS_FLAGS`` keeps the four mode flags (--profile, --correctness, --benchmark, --full-benchmark); a new ``RECOMMENDED_HARNESS_FLAGS = ("--iterations",)`` carries the downgraded contract. Both copies of ``validate_harness`` (preprocess + run/) emit a WARNING on missing --iterations and return valid=True with the warning surfaced in the messages list. - Three EXTRA_ARGS construction sites are gated on ``harness_supports_iterations(harness_path)``: ``build_eval_env`` (postprocess/evaluation.py), ``preflight_commandment_contract`` (postprocess/evaluation.py), and ``run_task_batch`` (run/dispatch.py). All three now also seed ``GEAK_BENCHMARK_ITERATIONS`` as the canonical fallback channel. - Belt-and-suspenders gate in ``run_harness._run_single``: even if a future construction site forgets to gate, we scrub --iterations N out of EXTRA_ARGS before extending argv when the harness lacks the flag. - ``_strip_python_comments`` becomes single-source in ``preprocess/harness_utils.py``; ``run/pipeline_helpers.py`` imports it instead of duplicating (closes review nit 4.1, partial). - UnitTestAgent prompt softened from MANDATORY to STRONGLY RECOMMENDED while keeping the wire-both-channels example so generated harnesses stay clean. Tests: - 5 tests for ``harness_supports_iterations`` (declared / absent / comment-only / unreadable / cache-invalidation). - 3 tests for ``_strip_iterations_tokens``. - 2 tests for ``build_eval_env`` gating. - 3 end-to-end tests via ``run_harness._run_single`` with a real stub harness that echoes its argv (passes --iterations when supported, strips when not, preserves other tokens). - ``validate_harness`` tests rewritten: warns-but-passes when missing --iterations / when only in a comment; new ``test_rejects_missing_required_flag`` locks in that the four mode flags are still required. Verification: 621 unit tests pass (was 607). ruff check, ruff format --check, and pylint --errors-only all clean on src/. Co-authored-by: Cursor <cursoragent@cursor.com>

feat(budget): add --mode quick|full wall-clock budget for runs

feat(tools): add translation tool profile for TranslationAgent

Consolidate flydsl-optimization, flydsl-debug-kernel, and flydsl-tile-programming into a single skills/flydsl/ skill following the pytorch2flydsl-translation pattern (summary SKILL.md + docs/). skills/flydsl/ SKILL.md - unified summary covering the full kernel lifecycle: write (tile programming), optimize (performance), debug (correctness) docs/ flydsl_optimization.md - optimization workflow and strategies flydsl_debug_kernel.md - correctness debugging (NaN, zeros, mismatch, compilation, hangs) flydsl_tile_programming.md - tile programming guide (skeletons, compute, LDS, MFMA) The previous skills/flydsl-optimization/ is absorbed into this unified structure. Tests verify the new layout.

update contribution guidelines

…calls Previously only AmdClaudeModel and AmdGeminiModel sent a "user" request header, and the value resolved to "unknown" inside the Docker container because os.getlogin() raises under `docker exec` and $USER was not forwarded. As a result the gateway could not attribute most requests back to the originating host user. - Add a module-level `get_amd_llm_user()` helper in `amd_base.py` that prefers `$GEAK_USER`, then `$USER`, then `os.getlogin()`, falling back to "unknown". `_get_user` now delegates to it. - Forward `-e USER` and `-e GEAK_USER` from the host in `scripts/run-docker.sh` (existing containers must be `--rebuild`ed). - Send the `"user"` header from `AmdOpenAIModel`, the `LitellmModel` completion path (via `extra_headers`, preserving any explicit override), and the standalone test-discovery MCP server. - Add unit tests for the resolver and for the header construction in each backend (Claude, OpenAI, Gemini via importorskip, LiteLLM). Co-authored-by: Cursor <cursoragent@cursor.com>

Compliance with AMD's guidelines on user request with LLM calls

Port the refactor-test pipeline foundation and subagent-framework registry onto current main as one integrated branch. This keeps existing budget/runtime safeguards while routing through the unified pipeline, YAML subagent registry, language bundles, preprocessing phases, dispatch plan, and renamed pipeline workers. Co-authored-by: Cursor <cursoragent@cursor.com>

Preserve registry agent_name through dispatch, make the subagent CLI executable as a module, and keep harness-only preprocessing from falling back into the legacy monolith. Co-authored-by: Cursor <cursoragent@cursor.com>

… long path name

support multi-gpu usage and minor fixs

This reverts commit 7dceffa.

Refactor mini CLI, add kernel auto-discovery and budget-timeout patch selection

…acts Replace `git add -A` with targeted staging of exactly the files the patch touches (parsed from diff headers after artifact stripping). Falls back to `git add -u` if the patch can't be parsed. Prevents untracked runtime artifacts (run.sh, JIT caches, flydsl_cache/) from being accidentally committed alongside the patch, while still handling new files created by translation (e.g. Triton → HIP). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> (cherry picked from commit 65334f6)

Two bugs in the Path-A short-circuit caused broken evaluation feedback: 1. When the LLM put all modes in modes_covered (instead of inferred_modes), the Benchmark/Profile/Full Benchmark sections got the raw --correctness command with no flag substitution. Fix: _substitute_mode_flag() deterministically replaces any mismatched harness flag regardless of mode categorization. 2. finish_preprocess unconditionally cleared harness_path on Path A, so between-rounds Metrix profiling was always skipped. Fix: _extract_harness_from_command() recovers the harness path from the user's command when it contains a standard harness flag, preserving it for the evaluation phase. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> (cherry picked from commit 0d49d20)

…n.sh wrapping The Path-A COMMANDMENT had two issues: (1) the orchestrator LLM wasn't told the harness supports all four modes, causing it to miscategorize modes and produce wrong flags, and (2) commands used bare `cd && python` instead of `${GEAK_WORK_DIR}/run.sh` wrapping, skipping env setup. - Enrich harness hint in adapter.py to tell the LLM the harness is pre-validated with all four standard CLI modes - Add Case A exception in orchestrator system prompt for pre-validated harnesses - Fix run.sh body to include cd + exec python3 (matching Path-B) - Use run.sh wrapping for promoted harness commands - Revert unsafe flag-append fallback in _substitute_mode_flag Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> (cherry picked from commit 812ecf8)

…ent_baseline preflight_commandment_contract now runs in a disposable git worktree so SETUP side-effects (run.sh, JIT caches) never dirty the original repo. recapture_commandment_baseline calls removed — the preprocessor baseline is the single source of truth. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> (cherry picked from commit 7e39b61)

…file collection Three fixes for the verified-speedup evaluation pipeline: 1. Path A now calls collect_baseline and collect_profile when a standard harness is available — previously skipped entirely, breaking downstream verified-speedup computation. 2. The v3 adapter writes benchmark_baseline.txt and full_benchmark_baseline.txt from BaselineMetrics.raw_outputs — previously hardcoded to None. 3. Full-benchmark baseline uses --full-benchmark stdout (via capture_full_benchmark_stdout) so the config set matches the postprocess evaluator's FULL_BENCHMARK run. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> (cherry picked from commit 20fa341)

Path A's commandment_from_user_command generated a bare `--profile` flag substitution, missing the warmup + kernel-profile wrapper that Path B uses. Without the wrapper, profile.json is never written and post-round evaluation cannot access hardware counter data. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> (cherry picked from commit c17ac40)

…ernel_url drops the extension parse_task_info's LLM extractor occasionally returns a bare basename (e.g. "silu" for a prompt mentioning "silu.hip"). The existence-check clears kernel_url when the bare name doesn't resolve on disk, but kernel_name (also bare) leaks back through mini.py's `kernel_target = ... or parsed_config.get("kernel_name")` fallback, re-entering _resolve_kernel_and_repo as a bare-name kernel_url. Without this fix, _resolve_kernel_and_repo's repo-relative candidate (repo/kernel_url) is not a file, and the legacy URL resolver also doesn't try extensions, so the run dies with "resolve-kernel-url failed: Kernel file not found: <repo>/<bare>". When the candidate has no suffix, probe each extension from _KERNEL_TYPE_TO_EXT under the repo and promote if exactly one matches. Refuse to guess when multiple match. Co-authored-by: Cursor <cursoragent@cursor.com>

…r from LLM-hallucinated argument names Observed failure: on a silu run, the orchestrator LLM called `commandment_from_user_command` six times with hallucinated keyword names (`user_command`, `command`, `cmd`, `raw_command`, `harness_command`, `kernel_path`) before giving up, burning ~28 minutes of preprocess budget. The tool's TypeError reply included only `<type>: <message>`, which named the *bad* argument but never the correct one. Three changes: 1. tools.py — `_schema_commandment_from_user_command`: add a STRICT ARGUMENT NAMING block to both the tool description and the per-property descriptions, explicitly listing the synonyms the LLM was inventing and naming the canonical arg (`run_command`, `out_path`). 2. orchestrator.py Case A prompt: add a fenced example of the exact keyword-arg signature with the same do-NOT list. 3. orchestrator.py `_dispatch_tool`: on any tool exception, return a structured error containing the canonical schema's `expected_arguments` and `required_arguments`, the names actually passed, and the traceback tail. For TypeError specifically, add an explicit hint reminding the LLM that the schema is authoritative. This gives the LLM enough signal to self-correct on the next turn instead of cycling through more synonyms. Co-authored-by: Cursor <cursoragent@cursor.com>

Fix tool unabailable error, set default pipeline mode, edit readme

…th A Brings the legacy AKA `batch_test_hip_kernel.sh` workflow back to working on v3 by surfacing existing legacy modules at v3 call sites. No code duplicated; only call sites added. 1) Shell-contract harness synthesis (`tools._try_synthesize_shell_contract_harness`) When the user's run_command is a compound shell pipeline (e.g. `python3 scripts/task_runner.py compile && correctness && performance`) without any GEAK harness flag, mirror legacy `resolve_shell_eval_commands` (rsplit on last &&) and call `eval_contract_adapter.materialize_shell_contract_harness` to write `_geak_shell_contract_harness.py` exposing the standard 4-mode CLI. This unblocks the legacy AKA prompt that previously died with "v3 preprocess failed: No harness_path available". 2) Static `validate_harness` gate at two call sites - `commandment_from_user_command`: validate user-supplied or synthesized harness; reject malformed paths so finish_preprocess doesn't silently thread bogus paths downstream. - `adapter._recover_harness_path`: validate the path picked by legacy `extract_harness_path` so a greedy match on `task_runner.py` is rejected instead of breaking profile/benchmark. 3) Correctness gate before baseline (`baseline._CORRECTNESS_GATE_TIMEOUT_S`) `collect_baseline_metrics` now runs `--correctness` once with a short timeout (default 120s, override via GEAK_CORRECTNESS_GATE_TIMEOUT) before the expensive benchmark loop. Broken kernels fail in seconds instead of minutes. Bypass via GEAK_SKIP_CORRECTNESS_GATE=1. 4) Compile-command extraction for synthesized harness `_try_synthesize_shell_contract_harness` calls legacy `contract_normalize.infer_compile_command_from_eval` to extract the build prefix and re-prepend it to the performance shell so a standalone `--benchmark` invocation rebuilds when needed. 5) `build_baseline_metrics` enrichment in `_project_baseline` When a profile result is also available, project legacy `build_baseline_metrics(include_all=True)` keys (`bottleneck`, `top_kernels`, `kernel_name`, `kernel_names`, `metrics`, `observations`) into the baseline_metrics dict. Restores fields that `inject_pipeline_context` consumes downstream which were silently empty on v3. All five legacy modules (`eval_contract_adapter`, `harness_utils`, `contract_normalize`, `baseline.build_baseline_metrics`) are already in tree; v3 just wasn't calling them. Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 0e1eadad242c52b727ddd6a662dd75b789e7f39f)

add auto finalize before hard kill

- Remove unused imports (F401) in selector.py, writer.py, unified.py - Fix import sorting (I001) in evaluation.py, selector.py - Replace unnecessary key check with dict.get (RUF019) in adapter.py - Remove quoted type annotation (UP037) and unused variable (F841) in unified.py - Fix test_legacy_context_recovers_harness_path_from_promoted_command: add missing success/full_benchmark_stdout attrs, valid harness content, and update assertions to match current path-based baseline API - Apply ruff format to all PR-affected files Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Merge gwiab to main

fix the tools call list in litellm

sdubagun-amd

Here are some comments.

sdubagun-amd · 2026-05-29T09:20:28Z

Could you remove and rotate these keys immediately?

sdubagun-amd · 2026-05-29T09:57:39Z

    # Budget
-    max_cost: float = 0.50
-    max_steps: int = 100
+    max_cost: float = 0.0


Would this affect working memory in any way? Why are the costs being set to 0?

sdubagun-amd · 2026-05-29T09:59:03Z

+    parallel-agent flow doesn't differ on HOME and the bug we're fixing
+    is specifically about ``GEAK_*``.
+    """
+    expanded = _expand_env_vars(os.path.expanduser(tok), extra_env)


Could you please address the CI issues?

sdubagun-amd · 2026-05-29T10:16:37Z

+# ``sys.path.insert(0, "/sgl-workspace/sglang/python")`` pinned every run
+# to the baseline sglang checkout.
+_HARDCODED_SYSPATH_RE = re.compile(
+    r"""sys\.path\.insert\(\s*\d+\s*,\s*(['"])(?P<path>/[^'"]*)\1\s*\)"""


sys.path.append or extend wouldn't work with this.

mehdi-saeedi and others added 30 commits May 5, 2026 10:23

feat(rdna): enable RDNA in preprocessing and profiling pipeline (#199)

78e94ed

* feat(rdna): add RDNA GPU architecture detection and profiling support

Merge branch 'main' into fix/unified_toolruntime

171d1b3

merge main

a30cc53

style: fix ruff format violation in amd_claude.py

d7841c2

Wrap long line in _init_client to satisfy ruff format check. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Merge pull request #194 from jatseng-ai/fix/model-kwargs-api-key-and-…

714db16

…base-url fix(amd_llm): read api_key and api_base from model_kwargs as fallback

Merge pull request #152 from AMD-AGI/fix/unified_toolruntime

b366369

Fix/unified toolruntime

fix(cli): restore stdin.isatty() guard around configure_if_first_time

6cba3cb

03b5b9e dropped the tty guard, so a CI/scripted run without an API key env would block on prompt() inside interactive setup() instead of failing fast. Restoring the original guard. Made-with: Cursor

Merge pull request #205 from AMD-AGI/feat/run-time-budget

71d781c

feat(budget): add --mode quick|full wall-clock budget for runs

Merge pull request #198 from AMD-AGI/feature/translation-tool-profile

ba80f84

feat(tools): add translation tool profile for TranslationAgent

Merge pull request #211 from AMD-AGI/doc/readme_framework

086875d

update contribution guidelines

Merge pull request #213 from AMD-AGI/feat/llm-gateway-user-header

6a0e912

Compliance with AMD's guidelines on user request with LLM calls

fix: harden integrated GEAK v3 smokes

f893614

Preserve registry agent_name through dispatch, make the subagent CLI executable as a module, and keep harness-only preprocessing from falling back into the legacy monolith. Co-authored-by: Cursor <cursoragent@cursor.com>

add group GPU support for homogenous mode and fix task task_parse for…

b46d56b

… long path name

Merge pull request #216 from AMD-AGI/feature/gpu_group

04f68c3

support multi-gpu usage and minor fixs

yueliu14 and others added 27 commits May 20, 2026 05:36

refacte mini entrance

7dceffa

Revert "refacte mini entrance"

0466f4b

This reverts commit 7dceffa.

modify mini entrance

7e1acf1

Merge pull request #236 from AMD-AGI/gwiab-hip

8c0dacf

Refactor mini CLI, add kernel auto-discovery and budget-timeout patch selection

fix tool unavailable error

db242ef

move model and env to geak.yaml

68d6e5e

default mixed pipeline mode, refine readme

a0053d2

Merge pull request #238 from AMD-AGI/gwiab-hip

10124d1

Fix tool unabailable error, set default pipeline mode, edit readme

add auto finalize before hard kill

f0ebb0b

Merge pull request #239 from AMD-AGI/gwiab-hip

c9c91f7

add auto finalize before hard kill

Merge pull request #240 from AMD-AGI/gwiab

745dd59

Merge gwiab to main

fix the tools call list in litellm

61b071a

update pytest file for litellm_model.py

f458b14

Merge pull request #242 from AMD-AGI/fix/litellm_tools

f32e749

fix the tools call list in litellm

fix harness compile

126ee2f

update bash command: use env. worktree to reject find in / or NFS

859b0f4

fix bug and ruff

ff1355b

sdubagun-amd reviewed May 29, 2026

View reviewed changes

sdubagun-amd requested changes May 29, 2026

View reviewed changes

chao-xu-spec force-pushed the fix/harness_compile branch from 5fae218 to ff1355b Compare May 29, 2026 12:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make GEAK evaluate worktree (patched) kernels, not the baseline#249

Make GEAK evaluate worktree (patched) kernels, not the baseline#249
chao-xu-spec wants to merge 177 commits into
mainfrom
fix/harness_compile

chao-xu-spec commented May 29, 2026

Uh oh!

sdubagun-amd left a comment

Uh oh!

sdubagun-amd May 29, 2026

Uh oh!

sdubagun-amd May 29, 2026

Uh oh!

sdubagun-amd May 29, 2026

Uh oh!

sdubagun-amd May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Conversation

chao-xu-spec commented May 29, 2026

Problem

Changes

1. Force JIT kernels to rebuild from worktree source

2. Editable-install every installable sub-project in the worktree

3. vLLM wheel-only support via shadow worktree

4. Reject harnesses that bypass the worktree (the 1.00x bug)

5. Atomic editor writes

6. Bash tool firewall + per-slot env resolution

Cleanup

Tests

Uh oh!

sdubagun-amd left a comment

Choose a reason for hiding this comment

Uh oh!

sdubagun-amd May 29, 2026

Choose a reason for hiding this comment

Uh oh!

sdubagun-amd May 29, 2026

Choose a reason for hiding this comment

Uh oh!

sdubagun-amd May 29, 2026

Choose a reason for hiding this comment

Uh oh!

sdubagun-amd May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants