Skip to content

Make GEAK evaluate worktree (patched) kernels, not the baseline#249

Open
chao-xu-spec wants to merge 177 commits into
mainfrom
fix/harness_compile
Open

Make GEAK evaluate worktree (patched) kernels, not the baseline#249
chao-xu-spec wants to merge 177 commits into
mainfrom
fix/harness_compile

Conversation

@chao-xu-spec
Copy link
Copy Markdown
Collaborator

Close #246

Problem

GEAK copies a repo into a worktree, applies the candidate patch, then runs
harness.py. In several setups the harness silently imported the BASELINE
kernel instead of the worktree, so every speedup measured baseline-vs-baseline
(~1.00x). Three independent failure modes were found:

  1. aiter (JIT C++/HIP): .so modules were never rebuilt from worktree
    source (AITER_REBUILD unset; module_aiter_core on a hardcoded
    rebuild-allowlist), so edits never entered the runtime binary.
  2. sgl-kernel / monorepos: only the FIRST installable sub-project was
    editable-installed, so import sgl_kernel resolved to the site-packages
    wheel — baseline again.
  3. Harness hardcoded sys.path.insert(0, "/sgl-workspace/sglang/python"):
    a literal absolute path at sys.path[0] shadowed both the worktree on
    PYTHONPATH and the editable install (observed in
    rotary_embedding_kernel_202605290819).

vLLM adds a fourth constraint: it ships as a wheel-only install (no setup.py,
multi-GB .so), so neither git worktree nor pip install -e applies.

Changes

1. Force JIT kernels to rebuild from worktree source

  • _compile_bootstrap/{__init__,sitecustomize}.py (new): stdlib-only bootstrap
    auto-loaded via PYTHONPATH; sets AITER_REBUILD=2 and installs an import
    hook clearing aiter's rebuilded_list (covers module_aiter_core).
  • run/preprocess/run_harness.py, tools/save_and_test.py,
    run/preprocess_v3/baseline.py: inject the bootstrap dir + AITER_REBUILD
    into every harness subprocess (setdefault — explicit overrides win).

2. Editable-install every installable sub-project in the worktree

  • run/preprocess/worktree_install.py (new): bounded recursive walk installs
    every setup.py/pyproject.toml sub-project; two-tier strategy
    (pip install -esetup_rocm.py develop fallback); snapshots & restores
    the original wheels.
  • run/mini.py: geak --cleanup restores the original wheel-installed
    packages (best-effort).
  • run/utils/generated_artifacts.py: keep editable-install build side-effects
    (hipify-modified files, generated metadata) out of captured patches.

3. vLLM wheel-only support via shadow worktree

  • kernel_packages/{__init__,profile,shadow_worktree,vllm_profile}.py (new):
    PackageProfile registry; shadow tree copies .py (writable) and symlinks
    .so (immutable baseline), git-inits for clean git diff; vLLM profile
    detects wheel-only layout, skips editable install, injects runtime env.
  • run/task_file.py: create_worktree dispatches to a profile's
    make_worktree when one matches.
  • agents/parallel_agent.py: skip git init for shadow-tree profiles (never
    pollute system site-packages).

4. Reject harnesses that bypass the worktree (the 1.00x bug)

  • run/preprocess/unit_test_agent.py: forbid hardcoded
    sys.path.insert(0, '/abs') in triton/asm/unknown/cuda guidance (only
    hip/ck had it); steer toward GEAK_WORK_DIR.
  • kernel_languages/contract.py: find_hardcoded_syspath_inserts +
    validate_harness raises ContractViolation → HarnessBuilder regenerates.
  • run/preprocess/harness_utils.py: same detection → valid=False, so
    phases and the Path-A short-circuit drop and regenerate the harness.
  • pipeline_workers/preprocess/harness_builder.py,
    run/preprocess_v3/tools.py: wire stricter validation into Path-A.

5. Atomic editor writes

  • tools/editor_tool.py: write via temp file + os.replace so edits never
    corrupt a hard-linked/shared baseline inode.

6. Bash tool firewall + per-slot env resolution

  • tools/bash_command.py: L1 wall-clock timeout + L2 filesystem-scan firewall
    (reject find /, NFS roots); expand $VAR/${VAR} from the injected env so
    GEAK_WORK_DIR resolves per parallel slot.
  • subagents/**/SYSTEM_PROMPT.md, subagents/_common/search_scope_hint.md,
    run/preprocess_v3/subagent.py: tell subagents the allowed search scope.

Cleanup

  • run/preprocess/commandment.py, run/utils/generated_artifacts.py: remove
    the superseded .aiter_jit / AITER_JIT_DIR scratch-dir mechanism (replaced
    by AITER_REBUILD + sitecustomize hook); unify cpp/python SETUP template.

Tests

  • tests/kernel_packages/ (new): compile bootstrap, profile detection,
    shadow_worktree layout, end-to-end dispatch.
  • tests/run/test_worktree_install.py (new): recursive editable install.
  • tests/run/test_harness_workdir_and_tolerance.py: hardcoded-sys.path gate
    (both validators), workdir injection, tolerance cap.
  • tests/tools/test_bash_command_safety.py: firewall + env expansion.

Made with Cursor

mehdi-saeedi and others added 30 commits May 5, 2026 10:23
* feat(rdna): add RDNA GPU architecture detection and profiling support
When tasks are submitted via the GEAK API, the top-level model.api_key
field is always serialised as an empty string regardless of what is set
in model_kwargs.api_key. This means _get_api_key() always falls through
to the AMD_LLM_API_KEY env-var lookup, which is also unset in the task
container, causing every amd_llm task to fail immediately with:

  ValueError: API key not provided.

Fix: add model_kwargs.get("api_key") as a fallback between
self.config.api_key and the env-var lookups so that keys passed through
the API's model_kwargs dict are honoured.

Similarly, amd_claude._init_client() ignored model_kwargs["api_base"]
when constructing the Anthropic client base_url, falling back to the
default llm-api.amd.com endpoint instead of the caller-specified one.
Add model_kwargs.get("api_base") as a fallback there too.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Wrap long line in _init_client to satisfy ruff format check.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…base-url

fix(amd_llm): read api_key and api_base from model_kwargs as fallback
Bound an entire GEAK run end-to-end (1h quick / 2h full), covering
preprocess + heterogeneous/homogeneous optimization. Mode is also
extracted from natural-language task content.

Four enforcement layers:
- Polled Deadline at every loop boundary (run/budget.py)
- threading.Timer watchdog flips SoftStop ~5min before deadline
- ProcessRegistry (run/state.py): SIGTERM -> 5s -> SIGKILL escalation
  via os.killpg for tracked Popen/mp.Process/Future
- SIGINT handler + try/finally cleanup; 2nd Ctrl-C force-terminates

Preprocess: 15min soft cap, may borrow up to 50% of total. Stage-aware
soft-stop handler classifies by current_stage + benchmark_baseline.txt
presence. Profiler wrapped in mp.Process with os.setsid() so GPU
grandchildren can be killpg'd.

Mode controls only total_s, finalize_grace_s, preprocess caps, and
max_rounds. Step/cost limits stay user-controlled to avoid silent
overrides. Documented precedence: CLI > mode > env > default.

Plumbed deadline + soft_stop + registry through round loop, run_llm_steps,
dispatch_tool_call, parallel_helpers (run_pool, run_parallel_heterogeneous),
and DefaultAgent.query. Replaced as_completed with poll loops so
soft_stop is observed mid-dispatch.

39 new tests (budget, state, mode presets, integration). Repo-wide ruff
cleanup. Final: ruff + pylint -E clean; 527 passed, 39 skipped.

Made-with: Cursor
…01e4)

Restore the load-bearing ``output = output.resolve()`` that 94c01e4 added
to ``_derive_output_dir`` and that 03b5b9e silently removed under its
"remove dead code" line item. Without it, a relative ``--output``
(or task-extracted ``output_dir``) on a merged HIP/CUDA kernel breaks
the pipeline:

  RuntimeError: Deterministic harness file not found:
    /repo_root/.../outputs/silu/test_silu_harness.py

because the merged-file split helper writes the new harness to
``output_dir / "test_<stem>_harness.py"`` and ``_resolve_deterministic_harness``
treats that relative path as repo-root-relative.

Triton runs hide the bug because non-merged kernels skip the split path
and use the user-provided absolute ``test_command``. The unit tests for
``_derive_output_dir`` only exercised ``tmp_path`` (always absolute), so
``.resolve()`` was a no-op and the assertions held either way -- which
is exactly how the original 03b5b9e regression slipped through CI.

Add three regression tests using relative inputs (relative directory,
relative file path, bare relative name) that all assert
``out_dir.is_absolute()``. Confirmed these would have failed against
the regressed code.

Also add an inline comment in ``_derive_output_dir`` naming both
94c01e4 (intent) and 03b5b9e (the regression) so the next refactor
pass doesn't repeat the mistake.

Made-with: Cursor
03b5b9e dropped the tty guard, so a CI/scripted run without an API key
env would block on prompt() inside interactive setup() instead of
failing fast. Restoring the original guard.

Made-with: Cursor
Closes the homogeneous-mode gap where a stuck subprocess.run inside a
sub-agent let --mode quick run past the 1h budget.

A) New BudgetSpec.kill_buffer_s (default 60s) and a second optimization
   watchdog at opt_deadline + kill_buffer_s that calls
   registry.terminate_all() then os._exit(124). Wired from mini.py and
   geak-orchestrate.

B) New _tracked_subprocess_run() helper in tools/save_and_test.py: a
   drop-in for subprocess.run that registers its Popen with the
   run-level ProcessRegistry (start_new_session=True). Only the
   long-running test_command call is converted; short git ops keep
   subprocess.run. Registry threaded via DefaultAgent._registry, set
   alongside _soft_stop in parallel_helpers / parallel_agent.

Tests: +4 hard-kill (test_budget.py), +5 tracked-subprocess (new
test_save_and_test_registry.py). Total now 561 passed.

Made-with: Cursor
ProcessRegistry.register_future() now appends + adds a done-callback
that removes the future on completion. Lock switched to RLock so the
already-done synchronous-callback case can't deadlock the submitter.
Stops the misleading 'futures=N' in terminate_all's SIGTERM-wave log
on clean run finalization. +3 tests.

Made-with: Cursor
…runs

1. task_parser: log raw LLM response (truncated to 500 chars) on
   JSONDecodeError. Without this the only diagnostic was 'char 0',
   indistinguishable across all non-JSON failure modes.

2. task_parser: promote a directory kernel_url to a kernel file inside
   it (kernel_name.<ext> -> kernel.{py,hip,cu,flydsl} -> single matching
   file by kernel_type). Users frequently say 'the kernel is in <DIR>'
   and the LLM extractor echoes the directory verbatim. Also tighten
   PARSE_TASK_INFO prompt so the LLM is less likely to do this.

3. preprocessor: filter automated_test_discovery results to repo_root
   so harnesses from sibling kernel directories
   (e.g. .../L3/fused_rms_fp8/test_kernel_harness.py while optimizing
   .../L3/gemm_a16wfp4/kernel.py) are dropped before reaching UTA.

+14 tests. 578 passed total.

Made-with: Cursor
When the LLM bails on JSON ('let me check the directory...') we lost the
'quick mode' cue and silently fell back to YAML's 'full' default,
producing 5 rounds instead of 2.

- task_parser._infer_mode_from_text(): regex backstop covering 'quick
  mode', '1 hour', '--mode full', etc. Fires only when LLM left mode
  as None; LLM result wins on conflict.
- JSON_EXTRACTION_SYSTEM_PROMPT: explicit 'never return prose, never
  investigate the filesystem, guess if uncertain' so the LLM stops
  treating ambiguous prompts as research tasks.

+13 tests. 591 passed total.

Made-with: Cursor
…on failure

Guarantees every preprocessor-produced harness exposes a `--iterations N`
argparse flag and refuses to silently progress past contract failures.

- C-like wrapper and Triton-split harnesses now inject `--iterations`
  (with `GEAK_BENCHMARK_ITERATIONS` env fallback); UTA prompt requires it.
- `REQUIRED_HARNESS_FLAGS` enforces `--iterations` so missing-flag harnesses
  fail static validation and trigger UTA's regenerate loop.
- Preprocessor raises `PreprocessAborted` when no harness mode produced a
  baseline (escape hatch: `GEAK_ALLOW_BROKEN_HARNESS=1`).
- New `CommandmentExecutionError` raised from `run_correctness_and_benchmark`
  / `run_profile` on subprocess failure or contract-broken stderr signatures
  (`unrecognized arguments`, `Harness file not found`, etc.); kernel-level
  correctness failures keep the legacy `correctness_failed` round status.
- New `preflight_commandment_contract` smoke-tests SETUP+CORRECTNESS once
  with `--iterations 1` before sub-agent fan-out; opt-out via
  `GEAK_SKIP_COMMANDMENT_PREFLIGHT=1`.

Made-with: Cursor
…workers

When SoftStop fires while sub-agents are mid-LLM-call (which doesn't
observe _soft_stop), the dispatcher's 'with ThreadPoolExecutor as ex:'
blocks at exit on shutdown(wait=True), preventing the orchestrator from
reaching its deadline-finalize path and forcing the hard-kill watchdog
to os._exit(124) without writing final_report.json.

Replace the with-block with manual try/finally in run_pool,
run_parallel_heterogeneous, and ParallelAgent.run_parallel (homogeneous
branch). On SoftStop call shutdown(wait=False, cancel_futures=True) so
the dispatcher returns immediately, letting the orchestrator finalize
within finalize_grace_s instead of being forcibly killed.

+3 tests verifying detach is fast (<2s) and normal drain still works.
605 passed total.

Co-authored-by: Cursor <cursoragent@cursor.com>
The Lint & Format workflow's `ruff format --check src/` step has been
failing on this branch since e56b227. Re-running `ruff format src/`
in-place reformats four files that all share two mechanical patterns:

- `src/minisweagent/run/postprocess/evaluation.py`: collapse multi-line
  `raise CommandmentExecutionError(...)` calls and one
  implicit-string-concat `logger.error(...)` back onto a single line
  that fits the 120-col line limit.
- `src/minisweagent/run/preprocess/harness_utils.py`: rewrite the
  `_GEAK_ITERATIONS_SHIM` heredoc from `'''...'''` to `"""..."""` per
  pyproject.toml's `[tool.ruff.format] quote-style = "double"`.
- `src/minisweagent/run/preprocess/preprocessor.py`: collapse one
  implicit-string-concat `logger.error(...)`.
- `src/minisweagent/run/utils/task_parser.py`: collapse two
  implicit-string-concat `logger.info(...)` calls.

Formatter-only; no semantic change. `ruff format --check src/` and
`ruff check src/` both clean post-change. Verified with the same
`ruff 0.15.2` CI installs via `uv pip install ruff` (unpinned).

Made-with: Cursor
Deprecates geak-orchestrate and addresses must-fix and select should-fix
items from the PR #205 deep review.

Deprecation:
- Remove geak-orchestrate console script and standalone CLI from
  run/orchestrator.py. Its --mode total_s semantics diverged from geak
  (preprocess elapsed always assumed 0, so --mode quick gave a fresh 1h
  optimization budget instead of the 1h preprocess+optimization budget
  the same flag gives via geak). run_orchestrator() and
  _probe_preprocess_dir() are kept for in-process callers and tests.

Must-fix bugs:
- tool_dispatch_tasks: remove dead inner loop in the improvement-skip
  block; collapse the task_paths[:1] over-iteration into a single next().
- _stage_found_improvement: replace __import__() trick with a normal
  top-level import and log JSON parse failures at WARNING.
- _compute_verified_speedup: use 'is None' / '<= 0' so a real 0.0 candidate
  latency is rejected as a broken measurement instead of looking
  identical to "couldn't parse"; set failure_reason and propagate it
  through RoundEvaluation.full_benchmark.failure_reason.
- _promote_kernel_url_dir_to_file: cap iterdir() at 32 entries to avoid
  walking large/shared directories on stale paths; refuse to promote
  unknown kernel_type without a name hint.
- mini.py SIGINT handler: restore the original handler before calling
  terminate_all() so a third Ctrl-C lands on the default handler instead
  of recursing back into the SIGINT handler mid-escalation.
- PreprocessState: add set_stage(stage) helper that bundles "advance +
  raise on hard_fail"; replace the four direct state.current_stage =
  assignments in run_preprocessor that duplicated the guard.
- tool_collect_results: actually short-circuit on SoftStop -- narrow the
  scan to a single-round walk instead of the full cross-round summary.
- default.py _setup_save_and_test_context: document the idempotent
  re-init coupling that parallel/heterogeneous helpers rely on.

Should-fix:
- Deadline.cap() returns 0.0 once SoftStop is set so callers using cap()
  to size new subprocess timeouts refuse new long-running work without
  needing a separate soft_stop.is_set() poll.
- _filter_discovery_to_repo_root: also catch RecursionError from
  Path.resolve() (NFS/macOS symlink loops).
- mini.py hard-kill watchdog: best-effort write a stub final_report.json
  ({status: hard_kill, exit_code: 124, elapsed_s, reason}) before
  os._exit(124) so the operator has something on disk.
- apply_mode_presets: log the actual config delta (+key=val,
  key: before->after) instead of walking the preset tree, which
  misrepresented dict-replaces-scalar merges.
- validate_harness: strip Python comments via tokenize before checking
  required-flag presence so "# --iterations N not yet supported" doesn't
  satisfy the validator. Mirror the helper between the duplicated copies
  in pipeline_helpers and preprocess/harness_utils.

Tests:
- New test for Deadline.cap() returning 0 under SoftStop.
- New test that comment-only --iterations is rejected by validate_harness.

Full unit test suite: 607 passed (was 605), 39 skipped, 2 xfailed.
ruff check / ruff format --check on src/ clean.

Co-authored-by: Cursor <cursoragent@cursor.com>
…strip helper

Pylint flagged ``src/minisweagent/run/preprocess/harness_utils.py:1144:
E1101: Module 'tokenize' has no 'TokenizeError' member (no-member)``.

The exception class is ``tokenize.TokenError`` (singular, no "ize"); the
typo would have raised ``AttributeError`` if the ``except`` clause ever
fired on a tokenize failure. Tests still pass because the source we feed
``_strip_python_comments`` is well-formed; pylint catches what runtime
coverage didn't.

Co-authored-by: Cursor <cursoragent@cursor.com>
A harness that doesn't declare --iterations is no longer a hard-fail at
validation. GEAK warns once, never rewrites the harness, and silently
strips --iterations N from any subprocess invocation of that harness so
argparse doesn't crash with "unrecognized arguments". Iteration counts
still flow via GEAK_BENCHMARK_ITERATIONS for harnesses that read the
env var; harnesses that don't run at their hardcoded default. This
replaces the old "your harness must declare --iterations or set
GEAK_ALLOW_BROKEN_HARNESS=1" UX with "we'll route around it".

Mechanism:

- New ``harness_supports_iterations(path)`` detector in
  ``run/preprocess/harness_utils.py`` (memoized via lru_cache, strips
  comments before checking, returns False on unreadable paths). Paired
  with ``reset_harness_support_cache()`` and a small
  ``_strip_iterations_tokens`` argv-string scrubber.
- ``REQUIRED_HARNESS_FLAGS`` keeps the four mode flags (--profile,
  --correctness, --benchmark, --full-benchmark); a new
  ``RECOMMENDED_HARNESS_FLAGS = ("--iterations",)`` carries the
  downgraded contract. Both copies of ``validate_harness`` (preprocess +
  run/) emit a WARNING on missing --iterations and return valid=True
  with the warning surfaced in the messages list.
- Three EXTRA_ARGS construction sites are gated on
  ``harness_supports_iterations(harness_path)``:
  ``build_eval_env`` (postprocess/evaluation.py),
  ``preflight_commandment_contract`` (postprocess/evaluation.py),
  and ``run_task_batch`` (run/dispatch.py). All three now also seed
  ``GEAK_BENCHMARK_ITERATIONS`` as the canonical fallback channel.
- Belt-and-suspenders gate in ``run_harness._run_single``: even if a
  future construction site forgets to gate, we scrub --iterations N
  out of EXTRA_ARGS before extending argv when the harness lacks the
  flag.
- ``_strip_python_comments`` becomes single-source in
  ``preprocess/harness_utils.py``; ``run/pipeline_helpers.py`` imports
  it instead of duplicating (closes review nit 4.1, partial).
- UnitTestAgent prompt softened from MANDATORY to STRONGLY RECOMMENDED
  while keeping the wire-both-channels example so generated harnesses
  stay clean.

Tests:

- 5 tests for ``harness_supports_iterations`` (declared / absent /
  comment-only / unreadable / cache-invalidation).
- 3 tests for ``_strip_iterations_tokens``.
- 2 tests for ``build_eval_env`` gating.
- 3 end-to-end tests via ``run_harness._run_single`` with a real stub
  harness that echoes its argv (passes --iterations when supported,
  strips when not, preserves other tokens).
- ``validate_harness`` tests rewritten: warns-but-passes when missing
  --iterations / when only in a comment; new
  ``test_rejects_missing_required_flag`` locks in that the four mode
  flags are still required.

Verification: 621 unit tests pass (was 607). ruff check, ruff format
--check, and pylint --errors-only all clean on src/.

Co-authored-by: Cursor <cursoragent@cursor.com>
feat(budget): add --mode quick|full wall-clock budget for runs
feat(tools): add translation tool profile for TranslationAgent
Consolidate flydsl-optimization, flydsl-debug-kernel, and
flydsl-tile-programming into a single skills/flydsl/ skill following
the pytorch2flydsl-translation pattern (summary SKILL.md + docs/).

skills/flydsl/
  SKILL.md - unified summary covering the full kernel lifecycle:
    write (tile programming), optimize (performance), debug (correctness)
  docs/
    flydsl_optimization.md      - optimization workflow and strategies
    flydsl_debug_kernel.md      - correctness debugging (NaN, zeros, mismatch, compilation, hangs)
    flydsl_tile_programming.md  - tile programming guide (skeletons, compute, LDS, MFMA)

The previous skills/flydsl-optimization/ is absorbed into this unified
structure. Tests verify the new layout.
…calls

Previously only AmdClaudeModel and AmdGeminiModel sent a "user" request
header, and the value resolved to "unknown" inside the Docker container
because os.getlogin() raises under `docker exec` and $USER was not
forwarded. As a result the gateway could not attribute most requests
back to the originating host user.

- Add a module-level `get_amd_llm_user()` helper in `amd_base.py` that
  prefers `$GEAK_USER`, then `$USER`, then `os.getlogin()`, falling back
  to "unknown". `_get_user` now delegates to it.
- Forward `-e USER` and `-e GEAK_USER` from the host in
  `scripts/run-docker.sh` (existing containers must be `--rebuild`ed).
- Send the `"user"` header from `AmdOpenAIModel`, the `LitellmModel`
  completion path (via `extra_headers`, preserving any explicit
  override), and the standalone test-discovery MCP server.
- Add unit tests for the resolver and for the header construction in
  each backend (Claude, OpenAI, Gemini via importorskip, LiteLLM).

Co-authored-by: Cursor <cursoragent@cursor.com>
Compliance with AMD's guidelines on user request with LLM calls
Port the refactor-test pipeline foundation and subagent-framework registry onto current main as one integrated branch. This keeps existing budget/runtime safeguards while routing through the unified pipeline, YAML subagent registry, language bundles, preprocessing phases, dispatch plan, and renamed pipeline workers.

Co-authored-by: Cursor <cursoragent@cursor.com>
Preserve registry agent_name through dispatch, make the subagent CLI executable as a module, and keep harness-only preprocessing from falling back into the legacy monolith.

Co-authored-by: Cursor <cursoragent@cursor.com>
support multi-gpu usage and minor fixs
yueliu14 and others added 27 commits May 20, 2026 05:36
Refactor mini CLI, add kernel auto-discovery and budget-timeout patch selection
…acts

Replace `git add -A` with targeted staging of exactly the files the patch
touches (parsed from diff headers after artifact stripping). Falls back to
`git add -u` if the patch can't be parsed.

Prevents untracked runtime artifacts (run.sh, JIT caches, flydsl_cache/)
from being accidentally committed alongside the patch, while still
handling new files created by translation (e.g. Triton → HIP).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
(cherry picked from commit 65334f6)
Two bugs in the Path-A short-circuit caused broken evaluation feedback:

1. When the LLM put all modes in modes_covered (instead of
   inferred_modes), the Benchmark/Profile/Full Benchmark sections
   got the raw --correctness command with no flag substitution.
   Fix: _substitute_mode_flag() deterministically replaces any
   mismatched harness flag regardless of mode categorization.

2. finish_preprocess unconditionally cleared harness_path on Path A,
   so between-rounds Metrix profiling was always skipped.
   Fix: _extract_harness_from_command() recovers the harness path
   from the user's command when it contains a standard harness flag,
   preserving it for the evaluation phase.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
(cherry picked from commit 0d49d20)
…n.sh wrapping

The Path-A COMMANDMENT had two issues: (1) the orchestrator LLM wasn't
told the harness supports all four modes, causing it to miscategorize
modes and produce wrong flags, and (2) commands used bare `cd && python`
instead of `${GEAK_WORK_DIR}/run.sh` wrapping, skipping env setup.

- Enrich harness hint in adapter.py to tell the LLM the harness is
  pre-validated with all four standard CLI modes
- Add Case A exception in orchestrator system prompt for pre-validated
  harnesses
- Fix run.sh body to include cd + exec python3 (matching Path-B)
- Use run.sh wrapping for promoted harness commands
- Revert unsafe flag-append fallback in _substitute_mode_flag

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
(cherry picked from commit 812ecf8)
…ent_baseline

preflight_commandment_contract now runs in a disposable git worktree so
SETUP side-effects (run.sh, JIT caches) never dirty the original repo.
recapture_commandment_baseline calls removed — the preprocessor baseline
is the single source of truth.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
(cherry picked from commit 7e39b61)
…file collection

Three fixes for the verified-speedup evaluation pipeline:

1. Path A now calls collect_baseline and collect_profile when a standard
   harness is available — previously skipped entirely, breaking downstream
   verified-speedup computation.

2. The v3 adapter writes benchmark_baseline.txt and full_benchmark_baseline.txt
   from BaselineMetrics.raw_outputs — previously hardcoded to None.

3. Full-benchmark baseline uses --full-benchmark stdout (via
   capture_full_benchmark_stdout) so the config set matches the
   postprocess evaluator's FULL_BENCHMARK run.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
(cherry picked from commit 20fa341)
Path A's commandment_from_user_command generated a bare `--profile`
flag substitution, missing the warmup + kernel-profile wrapper that
Path B uses. Without the wrapper, profile.json is never written and
post-round evaluation cannot access hardware counter data.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
(cherry picked from commit c17ac40)
…ernel_url drops the extension

parse_task_info's LLM extractor occasionally returns a bare basename
(e.g. "silu" for a prompt mentioning "silu.hip"). The existence-check
clears kernel_url when the bare name doesn't resolve on disk, but
kernel_name (also bare) leaks back through mini.py's
`kernel_target = ... or parsed_config.get("kernel_name")` fallback,
re-entering _resolve_kernel_and_repo as a bare-name kernel_url.

Without this fix, _resolve_kernel_and_repo's repo-relative candidate
(repo/kernel_url) is not a file, and the legacy URL resolver also
doesn't try extensions, so the run dies with
"resolve-kernel-url failed: Kernel file not found: <repo>/<bare>".

When the candidate has no suffix, probe each extension from
_KERNEL_TYPE_TO_EXT under the repo and promote if exactly one
matches. Refuse to guess when multiple match.

Co-authored-by: Cursor <cursoragent@cursor.com>
…r from LLM-hallucinated argument names

Observed failure: on a silu run, the orchestrator LLM called
`commandment_from_user_command` six times with hallucinated keyword
names (`user_command`, `command`, `cmd`, `raw_command`, `harness_command`,
`kernel_path`) before giving up, burning ~28 minutes of preprocess
budget. The tool's TypeError reply included only `<type>: <message>`,
which named the *bad* argument but never the correct one.

Three changes:

1. tools.py — `_schema_commandment_from_user_command`: add a STRICT
   ARGUMENT NAMING block to both the tool description and the per-property
   descriptions, explicitly listing the synonyms the LLM was inventing
   and naming the canonical arg (`run_command`, `out_path`).

2. orchestrator.py Case A prompt: add a fenced example of the exact
   keyword-arg signature with the same do-NOT list.

3. orchestrator.py `_dispatch_tool`: on any tool exception, return a
   structured error containing the canonical schema's
   `expected_arguments` and `required_arguments`, the names actually
   passed, and the traceback tail. For TypeError specifically, add an
   explicit hint reminding the LLM that the schema is authoritative.
   This gives the LLM enough signal to self-correct on the next turn
   instead of cycling through more synonyms.

Co-authored-by: Cursor <cursoragent@cursor.com>
Fix tool unabailable error, set default pipeline mode, edit readme
…th A

Brings the legacy AKA `batch_test_hip_kernel.sh` workflow back to working
on v3 by surfacing existing legacy modules at v3 call sites. No code
duplicated; only call sites added.

1) Shell-contract harness synthesis (`tools._try_synthesize_shell_contract_harness`)
   When the user's run_command is a compound shell pipeline (e.g.
   `python3 scripts/task_runner.py compile && correctness && performance`)
   without any GEAK harness flag, mirror legacy
   `resolve_shell_eval_commands` (rsplit on last &&) and call
   `eval_contract_adapter.materialize_shell_contract_harness` to write
   `_geak_shell_contract_harness.py` exposing the standard 4-mode CLI.
   This unblocks the legacy AKA prompt that previously died with
   "v3 preprocess failed: No harness_path available".

2) Static `validate_harness` gate at two call sites
   - `commandment_from_user_command`: validate user-supplied or synthesized
     harness; reject malformed paths so finish_preprocess doesn't silently
     thread bogus paths downstream.
   - `adapter._recover_harness_path`: validate the path picked by
     legacy `extract_harness_path` so a greedy match on `task_runner.py`
     is rejected instead of breaking profile/benchmark.

3) Correctness gate before baseline (`baseline._CORRECTNESS_GATE_TIMEOUT_S`)
   `collect_baseline_metrics` now runs `--correctness` once with a short
   timeout (default 120s, override via GEAK_CORRECTNESS_GATE_TIMEOUT) before
   the expensive benchmark loop. Broken kernels fail in seconds instead of
   minutes. Bypass via GEAK_SKIP_CORRECTNESS_GATE=1.

4) Compile-command extraction for synthesized harness
   `_try_synthesize_shell_contract_harness` calls legacy
   `contract_normalize.infer_compile_command_from_eval` to extract the
   build prefix and re-prepend it to the performance shell so a standalone
   `--benchmark` invocation rebuilds when needed.

5) `build_baseline_metrics` enrichment in `_project_baseline`
   When a profile result is also available, project legacy
   `build_baseline_metrics(include_all=True)` keys (`bottleneck`,
   `top_kernels`, `kernel_name`, `kernel_names`, `metrics`, `observations`)
   into the baseline_metrics dict. Restores fields that
   `inject_pipeline_context` consumes downstream which were silently empty
   on v3.

All five legacy modules (`eval_contract_adapter`, `harness_utils`,
`contract_normalize`, `baseline.build_baseline_metrics`) are already in
tree; v3 just wasn't calling them.

Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit 0e1eadad242c52b727ddd6a662dd75b789e7f39f)
add auto finalize before hard kill
- Remove unused imports (F401) in selector.py, writer.py, unified.py
- Fix import sorting (I001) in evaluation.py, selector.py
- Replace unnecessary key check with dict.get (RUF019) in adapter.py
- Remove quoted type annotation (UP037) and unused variable (F841) in unified.py
- Fix test_legacy_context_recovers_harness_path_from_promoted_command:
  add missing success/full_benchmark_stdout attrs, valid harness content,
  and update assertions to match current path-based baseline API
- Apply ruff format to all PR-affected files

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
fix the tools call list in litellm
Copy link
Copy Markdown
Collaborator

@sdubagun-amd sdubagun-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some comments.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove and rotate these keys immediately?

# Budget
max_cost: float = 0.50
max_steps: int = 100
max_cost: float = 0.0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this affect working memory in any way? Why are the costs being set to 0?

parallel-agent flow doesn't differ on HOME and the bug we're fixing
is specifically about ``GEAK_*``.
"""
expanded = _expand_env_vars(os.path.expanduser(tok), extra_env)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please address the CI issues?

# ``sys.path.insert(0, "/sgl-workspace/sglang/python")`` pinned every run
# to the baseline sglang checkout.
_HARDCODED_SYSPATH_RE = re.compile(
r"""sys\.path\.insert\(\s*\d+\s*,\s*(['"])(?P<path>/[^'"]*)\1\s*\)"""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sys.path.append or extend wouldn't work with this.

@chao-xu-spec chao-xu-spec force-pushed the fix/harness_compile branch from 5fae218 to ff1355b Compare May 29, 2026 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

10 participants