Make GEAK evaluate worktree (patched) kernels, not the baseline#249
Open
chao-xu-spec wants to merge 177 commits into
Open
Make GEAK evaluate worktree (patched) kernels, not the baseline#249chao-xu-spec wants to merge 177 commits into
chao-xu-spec wants to merge 177 commits into
Conversation
* feat(rdna): add RDNA GPU architecture detection and profiling support
When tasks are submitted via the GEAK API, the top-level model.api_key
field is always serialised as an empty string regardless of what is set
in model_kwargs.api_key. This means _get_api_key() always falls through
to the AMD_LLM_API_KEY env-var lookup, which is also unset in the task
container, causing every amd_llm task to fail immediately with:
ValueError: API key not provided.
Fix: add model_kwargs.get("api_key") as a fallback between
self.config.api_key and the env-var lookups so that keys passed through
the API's model_kwargs dict are honoured.
Similarly, amd_claude._init_client() ignored model_kwargs["api_base"]
when constructing the Anthropic client base_url, falling back to the
default llm-api.amd.com endpoint instead of the caller-specified one.
Add model_kwargs.get("api_base") as a fallback there too.
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Wrap long line in _init_client to satisfy ruff format check. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…base-url fix(amd_llm): read api_key and api_base from model_kwargs as fallback
Fix/unified toolruntime
Bound an entire GEAK run end-to-end (1h quick / 2h full), covering preprocess + heterogeneous/homogeneous optimization. Mode is also extracted from natural-language task content. Four enforcement layers: - Polled Deadline at every loop boundary (run/budget.py) - threading.Timer watchdog flips SoftStop ~5min before deadline - ProcessRegistry (run/state.py): SIGTERM -> 5s -> SIGKILL escalation via os.killpg for tracked Popen/mp.Process/Future - SIGINT handler + try/finally cleanup; 2nd Ctrl-C force-terminates Preprocess: 15min soft cap, may borrow up to 50% of total. Stage-aware soft-stop handler classifies by current_stage + benchmark_baseline.txt presence. Profiler wrapped in mp.Process with os.setsid() so GPU grandchildren can be killpg'd. Mode controls only total_s, finalize_grace_s, preprocess caps, and max_rounds. Step/cost limits stay user-controlled to avoid silent overrides. Documented precedence: CLI > mode > env > default. Plumbed deadline + soft_stop + registry through round loop, run_llm_steps, dispatch_tool_call, parallel_helpers (run_pool, run_parallel_heterogeneous), and DefaultAgent.query. Replaced as_completed with poll loops so soft_stop is observed mid-dispatch. 39 new tests (budget, state, mode presets, integration). Repo-wide ruff cleanup. Final: ruff + pylint -E clean; 527 passed, 39 skipped. Made-with: Cursor
…01e4) Restore the load-bearing ``output = output.resolve()`` that 94c01e4 added to ``_derive_output_dir`` and that 03b5b9e silently removed under its "remove dead code" line item. Without it, a relative ``--output`` (or task-extracted ``output_dir``) on a merged HIP/CUDA kernel breaks the pipeline: RuntimeError: Deterministic harness file not found: /repo_root/.../outputs/silu/test_silu_harness.py because the merged-file split helper writes the new harness to ``output_dir / "test_<stem>_harness.py"`` and ``_resolve_deterministic_harness`` treats that relative path as repo-root-relative. Triton runs hide the bug because non-merged kernels skip the split path and use the user-provided absolute ``test_command``. The unit tests for ``_derive_output_dir`` only exercised ``tmp_path`` (always absolute), so ``.resolve()`` was a no-op and the assertions held either way -- which is exactly how the original 03b5b9e regression slipped through CI. Add three regression tests using relative inputs (relative directory, relative file path, bare relative name) that all assert ``out_dir.is_absolute()``. Confirmed these would have failed against the regressed code. Also add an inline comment in ``_derive_output_dir`` naming both 94c01e4 (intent) and 03b5b9e (the regression) so the next refactor pass doesn't repeat the mistake. Made-with: Cursor
03b5b9e dropped the tty guard, so a CI/scripted run without an API key env would block on prompt() inside interactive setup() instead of failing fast. Restoring the original guard. Made-with: Cursor
Closes the homogeneous-mode gap where a stuck subprocess.run inside a sub-agent let --mode quick run past the 1h budget. A) New BudgetSpec.kill_buffer_s (default 60s) and a second optimization watchdog at opt_deadline + kill_buffer_s that calls registry.terminate_all() then os._exit(124). Wired from mini.py and geak-orchestrate. B) New _tracked_subprocess_run() helper in tools/save_and_test.py: a drop-in for subprocess.run that registers its Popen with the run-level ProcessRegistry (start_new_session=True). Only the long-running test_command call is converted; short git ops keep subprocess.run. Registry threaded via DefaultAgent._registry, set alongside _soft_stop in parallel_helpers / parallel_agent. Tests: +4 hard-kill (test_budget.py), +5 tracked-subprocess (new test_save_and_test_registry.py). Total now 561 passed. Made-with: Cursor
ProcessRegistry.register_future() now appends + adds a done-callback that removes the future on completion. Lock switched to RLock so the already-done synchronous-callback case can't deadlock the submitter. Stops the misleading 'futures=N' in terminate_all's SIGTERM-wave log on clean run finalization. +3 tests. Made-with: Cursor
…runs
1. task_parser: log raw LLM response (truncated to 500 chars) on
JSONDecodeError. Without this the only diagnostic was 'char 0',
indistinguishable across all non-JSON failure modes.
2. task_parser: promote a directory kernel_url to a kernel file inside
it (kernel_name.<ext> -> kernel.{py,hip,cu,flydsl} -> single matching
file by kernel_type). Users frequently say 'the kernel is in <DIR>'
and the LLM extractor echoes the directory verbatim. Also tighten
PARSE_TASK_INFO prompt so the LLM is less likely to do this.
3. preprocessor: filter automated_test_discovery results to repo_root
so harnesses from sibling kernel directories
(e.g. .../L3/fused_rms_fp8/test_kernel_harness.py while optimizing
.../L3/gemm_a16wfp4/kernel.py) are dropped before reaching UTA.
+14 tests. 578 passed total.
Made-with: Cursor
When the LLM bails on JSON ('let me check the directory...') we lost the
'quick mode' cue and silently fell back to YAML's 'full' default,
producing 5 rounds instead of 2.
- task_parser._infer_mode_from_text(): regex backstop covering 'quick
mode', '1 hour', '--mode full', etc. Fires only when LLM left mode
as None; LLM result wins on conflict.
- JSON_EXTRACTION_SYSTEM_PROMPT: explicit 'never return prose, never
investigate the filesystem, guess if uncertain' so the LLM stops
treating ambiguous prompts as research tasks.
+13 tests. 591 passed total.
Made-with: Cursor
…on failure Guarantees every preprocessor-produced harness exposes a `--iterations N` argparse flag and refuses to silently progress past contract failures. - C-like wrapper and Triton-split harnesses now inject `--iterations` (with `GEAK_BENCHMARK_ITERATIONS` env fallback); UTA prompt requires it. - `REQUIRED_HARNESS_FLAGS` enforces `--iterations` so missing-flag harnesses fail static validation and trigger UTA's regenerate loop. - Preprocessor raises `PreprocessAborted` when no harness mode produced a baseline (escape hatch: `GEAK_ALLOW_BROKEN_HARNESS=1`). - New `CommandmentExecutionError` raised from `run_correctness_and_benchmark` / `run_profile` on subprocess failure or contract-broken stderr signatures (`unrecognized arguments`, `Harness file not found`, etc.); kernel-level correctness failures keep the legacy `correctness_failed` round status. - New `preflight_commandment_contract` smoke-tests SETUP+CORRECTNESS once with `--iterations 1` before sub-agent fan-out; opt-out via `GEAK_SKIP_COMMANDMENT_PREFLIGHT=1`. Made-with: Cursor
…workers When SoftStop fires while sub-agents are mid-LLM-call (which doesn't observe _soft_stop), the dispatcher's 'with ThreadPoolExecutor as ex:' blocks at exit on shutdown(wait=True), preventing the orchestrator from reaching its deadline-finalize path and forcing the hard-kill watchdog to os._exit(124) without writing final_report.json. Replace the with-block with manual try/finally in run_pool, run_parallel_heterogeneous, and ParallelAgent.run_parallel (homogeneous branch). On SoftStop call shutdown(wait=False, cancel_futures=True) so the dispatcher returns immediately, letting the orchestrator finalize within finalize_grace_s instead of being forcibly killed. +3 tests verifying detach is fast (<2s) and normal drain still works. 605 passed total. Co-authored-by: Cursor <cursoragent@cursor.com>
The Lint & Format workflow's `ruff format --check src/` step has been failing on this branch since e56b227. Re-running `ruff format src/` in-place reformats four files that all share two mechanical patterns: - `src/minisweagent/run/postprocess/evaluation.py`: collapse multi-line `raise CommandmentExecutionError(...)` calls and one implicit-string-concat `logger.error(...)` back onto a single line that fits the 120-col line limit. - `src/minisweagent/run/preprocess/harness_utils.py`: rewrite the `_GEAK_ITERATIONS_SHIM` heredoc from `'''...'''` to `"""..."""` per pyproject.toml's `[tool.ruff.format] quote-style = "double"`. - `src/minisweagent/run/preprocess/preprocessor.py`: collapse one implicit-string-concat `logger.error(...)`. - `src/minisweagent/run/utils/task_parser.py`: collapse two implicit-string-concat `logger.info(...)` calls. Formatter-only; no semantic change. `ruff format --check src/` and `ruff check src/` both clean post-change. Verified with the same `ruff 0.15.2` CI installs via `uv pip install ruff` (unpinned). Made-with: Cursor
Deprecates geak-orchestrate and addresses must-fix and select should-fix items from the PR #205 deep review. Deprecation: - Remove geak-orchestrate console script and standalone CLI from run/orchestrator.py. Its --mode total_s semantics diverged from geak (preprocess elapsed always assumed 0, so --mode quick gave a fresh 1h optimization budget instead of the 1h preprocess+optimization budget the same flag gives via geak). run_orchestrator() and _probe_preprocess_dir() are kept for in-process callers and tests. Must-fix bugs: - tool_dispatch_tasks: remove dead inner loop in the improvement-skip block; collapse the task_paths[:1] over-iteration into a single next(). - _stage_found_improvement: replace __import__() trick with a normal top-level import and log JSON parse failures at WARNING. - _compute_verified_speedup: use 'is None' / '<= 0' so a real 0.0 candidate latency is rejected as a broken measurement instead of looking identical to "couldn't parse"; set failure_reason and propagate it through RoundEvaluation.full_benchmark.failure_reason. - _promote_kernel_url_dir_to_file: cap iterdir() at 32 entries to avoid walking large/shared directories on stale paths; refuse to promote unknown kernel_type without a name hint. - mini.py SIGINT handler: restore the original handler before calling terminate_all() so a third Ctrl-C lands on the default handler instead of recursing back into the SIGINT handler mid-escalation. - PreprocessState: add set_stage(stage) helper that bundles "advance + raise on hard_fail"; replace the four direct state.current_stage = assignments in run_preprocessor that duplicated the guard. - tool_collect_results: actually short-circuit on SoftStop -- narrow the scan to a single-round walk instead of the full cross-round summary. - default.py _setup_save_and_test_context: document the idempotent re-init coupling that parallel/heterogeneous helpers rely on. Should-fix: - Deadline.cap() returns 0.0 once SoftStop is set so callers using cap() to size new subprocess timeouts refuse new long-running work without needing a separate soft_stop.is_set() poll. - _filter_discovery_to_repo_root: also catch RecursionError from Path.resolve() (NFS/macOS symlink loops). - mini.py hard-kill watchdog: best-effort write a stub final_report.json ({status: hard_kill, exit_code: 124, elapsed_s, reason}) before os._exit(124) so the operator has something on disk. - apply_mode_presets: log the actual config delta (+key=val, key: before->after) instead of walking the preset tree, which misrepresented dict-replaces-scalar merges. - validate_harness: strip Python comments via tokenize before checking required-flag presence so "# --iterations N not yet supported" doesn't satisfy the validator. Mirror the helper between the duplicated copies in pipeline_helpers and preprocess/harness_utils. Tests: - New test for Deadline.cap() returning 0 under SoftStop. - New test that comment-only --iterations is rejected by validate_harness. Full unit test suite: 607 passed (was 605), 39 skipped, 2 xfailed. ruff check / ruff format --check on src/ clean. Co-authored-by: Cursor <cursoragent@cursor.com>
…strip helper Pylint flagged ``src/minisweagent/run/preprocess/harness_utils.py:1144: E1101: Module 'tokenize' has no 'TokenizeError' member (no-member)``. The exception class is ``tokenize.TokenError`` (singular, no "ize"); the typo would have raised ``AttributeError`` if the ``except`` clause ever fired on a tokenize failure. Tests still pass because the source we feed ``_strip_python_comments`` is well-formed; pylint catches what runtime coverage didn't. Co-authored-by: Cursor <cursoragent@cursor.com>
A harness that doesn't declare --iterations is no longer a hard-fail at
validation. GEAK warns once, never rewrites the harness, and silently
strips --iterations N from any subprocess invocation of that harness so
argparse doesn't crash with "unrecognized arguments". Iteration counts
still flow via GEAK_BENCHMARK_ITERATIONS for harnesses that read the
env var; harnesses that don't run at their hardcoded default. This
replaces the old "your harness must declare --iterations or set
GEAK_ALLOW_BROKEN_HARNESS=1" UX with "we'll route around it".
Mechanism:
- New ``harness_supports_iterations(path)`` detector in
``run/preprocess/harness_utils.py`` (memoized via lru_cache, strips
comments before checking, returns False on unreadable paths). Paired
with ``reset_harness_support_cache()`` and a small
``_strip_iterations_tokens`` argv-string scrubber.
- ``REQUIRED_HARNESS_FLAGS`` keeps the four mode flags (--profile,
--correctness, --benchmark, --full-benchmark); a new
``RECOMMENDED_HARNESS_FLAGS = ("--iterations",)`` carries the
downgraded contract. Both copies of ``validate_harness`` (preprocess +
run/) emit a WARNING on missing --iterations and return valid=True
with the warning surfaced in the messages list.
- Three EXTRA_ARGS construction sites are gated on
``harness_supports_iterations(harness_path)``:
``build_eval_env`` (postprocess/evaluation.py),
``preflight_commandment_contract`` (postprocess/evaluation.py),
and ``run_task_batch`` (run/dispatch.py). All three now also seed
``GEAK_BENCHMARK_ITERATIONS`` as the canonical fallback channel.
- Belt-and-suspenders gate in ``run_harness._run_single``: even if a
future construction site forgets to gate, we scrub --iterations N
out of EXTRA_ARGS before extending argv when the harness lacks the
flag.
- ``_strip_python_comments`` becomes single-source in
``preprocess/harness_utils.py``; ``run/pipeline_helpers.py`` imports
it instead of duplicating (closes review nit 4.1, partial).
- UnitTestAgent prompt softened from MANDATORY to STRONGLY RECOMMENDED
while keeping the wire-both-channels example so generated harnesses
stay clean.
Tests:
- 5 tests for ``harness_supports_iterations`` (declared / absent /
comment-only / unreadable / cache-invalidation).
- 3 tests for ``_strip_iterations_tokens``.
- 2 tests for ``build_eval_env`` gating.
- 3 end-to-end tests via ``run_harness._run_single`` with a real stub
harness that echoes its argv (passes --iterations when supported,
strips when not, preserves other tokens).
- ``validate_harness`` tests rewritten: warns-but-passes when missing
--iterations / when only in a comment; new
``test_rejects_missing_required_flag`` locks in that the four mode
flags are still required.
Verification: 621 unit tests pass (was 607). ruff check, ruff format
--check, and pylint --errors-only all clean on src/.
Co-authored-by: Cursor <cursoragent@cursor.com>
feat(budget): add --mode quick|full wall-clock budget for runs
feat(tools): add translation tool profile for TranslationAgent
Consolidate flydsl-optimization, flydsl-debug-kernel, and
flydsl-tile-programming into a single skills/flydsl/ skill following
the pytorch2flydsl-translation pattern (summary SKILL.md + docs/).
skills/flydsl/
SKILL.md - unified summary covering the full kernel lifecycle:
write (tile programming), optimize (performance), debug (correctness)
docs/
flydsl_optimization.md - optimization workflow and strategies
flydsl_debug_kernel.md - correctness debugging (NaN, zeros, mismatch, compilation, hangs)
flydsl_tile_programming.md - tile programming guide (skeletons, compute, LDS, MFMA)
The previous skills/flydsl-optimization/ is absorbed into this unified
structure. Tests verify the new layout.
update contribution guidelines
…calls Previously only AmdClaudeModel and AmdGeminiModel sent a "user" request header, and the value resolved to "unknown" inside the Docker container because os.getlogin() raises under `docker exec` and $USER was not forwarded. As a result the gateway could not attribute most requests back to the originating host user. - Add a module-level `get_amd_llm_user()` helper in `amd_base.py` that prefers `$GEAK_USER`, then `$USER`, then `os.getlogin()`, falling back to "unknown". `_get_user` now delegates to it. - Forward `-e USER` and `-e GEAK_USER` from the host in `scripts/run-docker.sh` (existing containers must be `--rebuild`ed). - Send the `"user"` header from `AmdOpenAIModel`, the `LitellmModel` completion path (via `extra_headers`, preserving any explicit override), and the standalone test-discovery MCP server. - Add unit tests for the resolver and for the header construction in each backend (Claude, OpenAI, Gemini via importorskip, LiteLLM). Co-authored-by: Cursor <cursoragent@cursor.com>
Compliance with AMD's guidelines on user request with LLM calls
Port the refactor-test pipeline foundation and subagent-framework registry onto current main as one integrated branch. This keeps existing budget/runtime safeguards while routing through the unified pipeline, YAML subagent registry, language bundles, preprocessing phases, dispatch plan, and renamed pipeline workers. Co-authored-by: Cursor <cursoragent@cursor.com>
Preserve registry agent_name through dispatch, make the subagent CLI executable as a module, and keep harness-only preprocessing from falling back into the legacy monolith. Co-authored-by: Cursor <cursoragent@cursor.com>
support multi-gpu usage and minor fixs
This reverts commit 7dceffa.
Refactor mini CLI, add kernel auto-discovery and budget-timeout patch selection
…acts Replace `git add -A` with targeted staging of exactly the files the patch touches (parsed from diff headers after artifact stripping). Falls back to `git add -u` if the patch can't be parsed. Prevents untracked runtime artifacts (run.sh, JIT caches, flydsl_cache/) from being accidentally committed alongside the patch, while still handling new files created by translation (e.g. Triton → HIP). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> (cherry picked from commit 65334f6)
Two bugs in the Path-A short-circuit caused broken evaluation feedback: 1. When the LLM put all modes in modes_covered (instead of inferred_modes), the Benchmark/Profile/Full Benchmark sections got the raw --correctness command with no flag substitution. Fix: _substitute_mode_flag() deterministically replaces any mismatched harness flag regardless of mode categorization. 2. finish_preprocess unconditionally cleared harness_path on Path A, so between-rounds Metrix profiling was always skipped. Fix: _extract_harness_from_command() recovers the harness path from the user's command when it contains a standard harness flag, preserving it for the evaluation phase. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> (cherry picked from commit 0d49d20)
…n.sh wrapping
The Path-A COMMANDMENT had two issues: (1) the orchestrator LLM wasn't
told the harness supports all four modes, causing it to miscategorize
modes and produce wrong flags, and (2) commands used bare `cd && python`
instead of `${GEAK_WORK_DIR}/run.sh` wrapping, skipping env setup.
- Enrich harness hint in adapter.py to tell the LLM the harness is
pre-validated with all four standard CLI modes
- Add Case A exception in orchestrator system prompt for pre-validated
harnesses
- Fix run.sh body to include cd + exec python3 (matching Path-B)
- Use run.sh wrapping for promoted harness commands
- Revert unsafe flag-append fallback in _substitute_mode_flag
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
(cherry picked from commit 812ecf8)
…ent_baseline preflight_commandment_contract now runs in a disposable git worktree so SETUP side-effects (run.sh, JIT caches) never dirty the original repo. recapture_commandment_baseline calls removed — the preprocessor baseline is the single source of truth. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> (cherry picked from commit 7e39b61)
…file collection Three fixes for the verified-speedup evaluation pipeline: 1. Path A now calls collect_baseline and collect_profile when a standard harness is available — previously skipped entirely, breaking downstream verified-speedup computation. 2. The v3 adapter writes benchmark_baseline.txt and full_benchmark_baseline.txt from BaselineMetrics.raw_outputs — previously hardcoded to None. 3. Full-benchmark baseline uses --full-benchmark stdout (via capture_full_benchmark_stdout) so the config set matches the postprocess evaluator's FULL_BENCHMARK run. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> (cherry picked from commit 20fa341)
Path A's commandment_from_user_command generated a bare `--profile` flag substitution, missing the warmup + kernel-profile wrapper that Path B uses. Without the wrapper, profile.json is never written and post-round evaluation cannot access hardware counter data. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> (cherry picked from commit c17ac40)
…ernel_url drops the extension
parse_task_info's LLM extractor occasionally returns a bare basename
(e.g. "silu" for a prompt mentioning "silu.hip"). The existence-check
clears kernel_url when the bare name doesn't resolve on disk, but
kernel_name (also bare) leaks back through mini.py's
`kernel_target = ... or parsed_config.get("kernel_name")` fallback,
re-entering _resolve_kernel_and_repo as a bare-name kernel_url.
Without this fix, _resolve_kernel_and_repo's repo-relative candidate
(repo/kernel_url) is not a file, and the legacy URL resolver also
doesn't try extensions, so the run dies with
"resolve-kernel-url failed: Kernel file not found: <repo>/<bare>".
When the candidate has no suffix, probe each extension from
_KERNEL_TYPE_TO_EXT under the repo and promote if exactly one
matches. Refuse to guess when multiple match.
Co-authored-by: Cursor <cursoragent@cursor.com>
…r from LLM-hallucinated argument names Observed failure: on a silu run, the orchestrator LLM called `commandment_from_user_command` six times with hallucinated keyword names (`user_command`, `command`, `cmd`, `raw_command`, `harness_command`, `kernel_path`) before giving up, burning ~28 minutes of preprocess budget. The tool's TypeError reply included only `<type>: <message>`, which named the *bad* argument but never the correct one. Three changes: 1. tools.py — `_schema_commandment_from_user_command`: add a STRICT ARGUMENT NAMING block to both the tool description and the per-property descriptions, explicitly listing the synonyms the LLM was inventing and naming the canonical arg (`run_command`, `out_path`). 2. orchestrator.py Case A prompt: add a fenced example of the exact keyword-arg signature with the same do-NOT list. 3. orchestrator.py `_dispatch_tool`: on any tool exception, return a structured error containing the canonical schema's `expected_arguments` and `required_arguments`, the names actually passed, and the traceback tail. For TypeError specifically, add an explicit hint reminding the LLM that the schema is authoritative. This gives the LLM enough signal to self-correct on the next turn instead of cycling through more synonyms. Co-authored-by: Cursor <cursoragent@cursor.com>
Fix tool unabailable error, set default pipeline mode, edit readme
…th A
Brings the legacy AKA `batch_test_hip_kernel.sh` workflow back to working
on v3 by surfacing existing legacy modules at v3 call sites. No code
duplicated; only call sites added.
1) Shell-contract harness synthesis (`tools._try_synthesize_shell_contract_harness`)
When the user's run_command is a compound shell pipeline (e.g.
`python3 scripts/task_runner.py compile && correctness && performance`)
without any GEAK harness flag, mirror legacy
`resolve_shell_eval_commands` (rsplit on last &&) and call
`eval_contract_adapter.materialize_shell_contract_harness` to write
`_geak_shell_contract_harness.py` exposing the standard 4-mode CLI.
This unblocks the legacy AKA prompt that previously died with
"v3 preprocess failed: No harness_path available".
2) Static `validate_harness` gate at two call sites
- `commandment_from_user_command`: validate user-supplied or synthesized
harness; reject malformed paths so finish_preprocess doesn't silently
thread bogus paths downstream.
- `adapter._recover_harness_path`: validate the path picked by
legacy `extract_harness_path` so a greedy match on `task_runner.py`
is rejected instead of breaking profile/benchmark.
3) Correctness gate before baseline (`baseline._CORRECTNESS_GATE_TIMEOUT_S`)
`collect_baseline_metrics` now runs `--correctness` once with a short
timeout (default 120s, override via GEAK_CORRECTNESS_GATE_TIMEOUT) before
the expensive benchmark loop. Broken kernels fail in seconds instead of
minutes. Bypass via GEAK_SKIP_CORRECTNESS_GATE=1.
4) Compile-command extraction for synthesized harness
`_try_synthesize_shell_contract_harness` calls legacy
`contract_normalize.infer_compile_command_from_eval` to extract the
build prefix and re-prepend it to the performance shell so a standalone
`--benchmark` invocation rebuilds when needed.
5) `build_baseline_metrics` enrichment in `_project_baseline`
When a profile result is also available, project legacy
`build_baseline_metrics(include_all=True)` keys (`bottleneck`,
`top_kernels`, `kernel_name`, `kernel_names`, `metrics`, `observations`)
into the baseline_metrics dict. Restores fields that
`inject_pipeline_context` consumes downstream which were silently empty
on v3.
All five legacy modules (`eval_contract_adapter`, `harness_utils`,
`contract_normalize`, `baseline.build_baseline_metrics`) are already in
tree; v3 just wasn't calling them.
Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit 0e1eadad242c52b727ddd6a662dd75b789e7f39f)
add auto finalize before hard kill
- Remove unused imports (F401) in selector.py, writer.py, unified.py - Fix import sorting (I001) in evaluation.py, selector.py - Replace unnecessary key check with dict.get (RUF019) in adapter.py - Remove quoted type annotation (UP037) and unused variable (F841) in unified.py - Fix test_legacy_context_recovers_harness_path_from_promoted_command: add missing success/full_benchmark_stdout attrs, valid harness content, and update assertions to match current path-based baseline API - Apply ruff format to all PR-affected files Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Merge gwiab to main
fix the tools call list in litellm
Collaborator
sdubagun-amd
left a comment
There was a problem hiding this comment.
Here are some comments.
Collaborator
There was a problem hiding this comment.
Could you remove and rotate these keys immediately?
| # Budget | ||
| max_cost: float = 0.50 | ||
| max_steps: int = 100 | ||
| max_cost: float = 0.0 |
Collaborator
There was a problem hiding this comment.
Would this affect working memory in any way? Why are the costs being set to 0?
| parallel-agent flow doesn't differ on HOME and the bug we're fixing | ||
| is specifically about ``GEAK_*``. | ||
| """ | ||
| expanded = _expand_env_vars(os.path.expanduser(tok), extra_env) |
Collaborator
There was a problem hiding this comment.
Could you please address the CI issues?
| # ``sys.path.insert(0, "/sgl-workspace/sglang/python")`` pinned every run | ||
| # to the baseline sglang checkout. | ||
| _HARDCODED_SYSPATH_RE = re.compile( | ||
| r"""sys\.path\.insert\(\s*\d+\s*,\s*(['"])(?P<path>/[^'"]*)\1\s*\)""" |
Collaborator
There was a problem hiding this comment.
sys.path.append or extend wouldn't work with this.
sdubagun-amd
requested changes
May 29, 2026
5fae218 to
ff1355b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Close #246
Problem
GEAK copies a repo into a worktree, applies the candidate patch, then runs
harness.py. In several setups the harness silently imported the BASELINEkernel instead of the worktree, so every speedup measured baseline-vs-baseline
(~1.00x). Three independent failure modes were found:
.somodules were never rebuilt from worktreesource (
AITER_REBUILDunset;module_aiter_coreon a hardcodedrebuild-allowlist), so edits never entered the runtime binary.
editable-installed, so
import sgl_kernelresolved to the site-packageswheel — baseline again.
sys.path.insert(0, "/sgl-workspace/sglang/python"):a literal absolute path at sys.path[0] shadowed both the worktree on
PYTHONPATH and the editable install (observed in
rotary_embedding_kernel_202605290819).vLLM adds a fourth constraint: it ships as a wheel-only install (no
setup.py,multi-GB
.so), so neithergit worktreenorpip install -eapplies.Changes
1. Force JIT kernels to rebuild from worktree source
_compile_bootstrap/{__init__,sitecustomize}.py(new): stdlib-only bootstrapauto-loaded via PYTHONPATH; sets
AITER_REBUILD=2and installs an importhook clearing aiter's
rebuilded_list(coversmodule_aiter_core).run/preprocess/run_harness.py,tools/save_and_test.py,run/preprocess_v3/baseline.py: inject the bootstrap dir +AITER_REBUILDinto every harness subprocess (setdefault — explicit overrides win).
2. Editable-install every installable sub-project in the worktree
run/preprocess/worktree_install.py(new): bounded recursive walk installsevery
setup.py/pyproject.tomlsub-project; two-tier strategy(
pip install -e→setup_rocm.py developfallback); snapshots & restoresthe original wheels.
run/mini.py:geak --cleanuprestores the original wheel-installedpackages (best-effort).
run/utils/generated_artifacts.py: keep editable-install build side-effects(hipify-modified files, generated metadata) out of captured patches.
3. vLLM wheel-only support via shadow worktree
kernel_packages/{__init__,profile,shadow_worktree,vllm_profile}.py(new):PackageProfileregistry; shadow tree copies.py(writable) and symlinks.so(immutable baseline), git-inits for cleangit diff; vLLM profiledetects wheel-only layout, skips editable install, injects runtime env.
run/task_file.py:create_worktreedispatches to a profile'smake_worktreewhen one matches.agents/parallel_agent.py: skipgit initfor shadow-tree profiles (neverpollute system site-packages).
4. Reject harnesses that bypass the worktree (the 1.00x bug)
run/preprocess/unit_test_agent.py: forbid hardcodedsys.path.insert(0, '/abs')in triton/asm/unknown/cuda guidance (onlyhip/ck had it); steer toward
GEAK_WORK_DIR.kernel_languages/contract.py:find_hardcoded_syspath_inserts+validate_harnessraisesContractViolation→ HarnessBuilder regenerates.run/preprocess/harness_utils.py: same detection →valid=False, sophases and the Path-A short-circuit drop and regenerate the harness.
pipeline_workers/preprocess/harness_builder.py,run/preprocess_v3/tools.py: wire stricter validation into Path-A.5. Atomic editor writes
tools/editor_tool.py: write via temp file +os.replaceso edits nevercorrupt a hard-linked/shared baseline inode.
6. Bash tool firewall + per-slot env resolution
tools/bash_command.py: L1 wall-clock timeout + L2 filesystem-scan firewall(reject
find /, NFS roots); expand$VAR/${VAR}from the injected env soGEAK_WORK_DIRresolves per parallel slot.subagents/**/SYSTEM_PROMPT.md,subagents/_common/search_scope_hint.md,run/preprocess_v3/subagent.py: tell subagents the allowed search scope.Cleanup
run/preprocess/commandment.py,run/utils/generated_artifacts.py: removethe superseded
.aiter_jit/AITER_JIT_DIRscratch-dir mechanism (replacedby AITER_REBUILD + sitecustomize hook); unify cpp/python SETUP template.
Tests
tests/kernel_packages/(new): compile bootstrap, profile detection,shadow_worktree layout, end-to-end dispatch.
tests/run/test_worktree_install.py(new): recursive editable install.tests/run/test_harness_workdir_and_tolerance.py: hardcoded-sys.path gate(both validators), workdir injection, tolerance cap.
tests/tools/test_bash_command_safety.py: firewall + env expansion.Made with Cursor