Skip to content

Symlink gitignored build artifacts into worktrees#192

Open
fxmarty-amd wants to merge 445 commits into
AMD-AGI:mainfrom
fxmarty-amd:felmarty/symlink-gitignored-files-in-worktree
Open

Symlink gitignored build artifacts into worktrees#192
fxmarty-amd wants to merge 445 commits into
AMD-AGI:mainfrom
fxmarty-amd:felmarty/symlink-gitignored-files-in-worktree

Conversation

@fxmarty-amd
Copy link
Copy Markdown
Contributor

Motivation

Worktrees created by git worktree add don't include gitignored files. For projects like vllm that have compiled
extensions (.so), generated version files (_version.py), and build artifacts, this means the worktree can't run
without a full rebuild. Symlinking these files from the original repo avoids that cost.

For example, I was getting errors as:

WARNING 04-23 16:39:18 [rocm.py:43] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")

and others due to missing objects, and GEAK would go endlessly and unnecessarily looking for them.

Testing

python -m pytest tests/run/test_worktree_symlink.py -v

8 passed — covers .so symlinking, _version.py symlinking, output-directory exclusion, existing-file preservation, and log
output.

AI assistance was used (Claude).

yueliu14 and others added 30 commits April 3, 2026 07:31
When --num-parallel >=2, multiple agent threads shared the same
module-level MCPToolBridge singletons. Since each bridge wraps a
single subprocess with one asyncio StreamReader, concurrent
readline() calls from different threads triggered:

  readuntil() called while another coroutine is already waiting
  for incoming data

Fix: each ToolRuntime now creates its own set of MCPToolBridge
instances via _create_own_bridges(), giving every parallel agent
its own subprocess and stdio pipes. The module-level bridges are
used only for schema discovery at import time and then shut down.

Fixes: AMD-AGI#100
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The explicit _shutdown_loop() calls at module init killed the event
loop, then the atexit handler tried to shut down the same bridges
again on a dead loop, causing the process to hang after tests.

Let atexit handle cleanup once, matching the behavior on main.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…n-cleanup

Fix/postprocess evaluation cleanup
…ract

Four correctness fixes to prevent agents from evaluating patches against
the wrong (original) code:

Issue 1 – kernel defs in harness: detect @triton.jit/@triton.autotune
decorated functions embedded inside a harness file and split them into
kernel_extracted.py, rewriting the harness to import from it. Called at
all three harness selection points (deterministic, discovery, UTA).

Issue 2 – relative imports bypass PYTHONPATH: _rewrite_relative_imports()
converts `from .. import foo` style imports to absolute imports anchored
at repo_root, so PYTHONPATH ordering (GEAK_WORK_DIR first) is respected.
Plugged into _rewrite_materialized_harness_source().

Issue 3 – COMMANDMENT.md hardcodes harness path: replace ${GEAK_HARNESS}
variable references in _generate_simple() and _generate_inner_kernel()
with the literal resolved harness path, so agents cannot accidentally
override it.

Issue 4 – null repo_root: add _infer_repo_root() that walks up from the
kernel file looking for .git/pyproject.toml/setup.py markers. Used as
fallback when resolve_kernel_url returns no local_repo_path (local file
path specs). Hard assertion ensures repo_root is never empty downstream.

Adds 11 unit tests covering all four fixes (no GPU/LLM required).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nel defs

Issue 1 implementation was inverted: the correct split is to pull test
functions out into a new test_<stem>_harness.py and leave the original
file as a clean kernel. The old code did the opposite (extracted @triton.jit
defs, rewrote harness to import them), which would break patch evaluation.

New detect_and_split_kernel_from_harness algorithm:
- Find test-root seeds: run_*/test_* names, pytest/unittest decorators,
  GEAK CLI flag usage (--correctness etc.), functions called from __main__
- BFS from seeds collecting all reachable functions, skipping @triton.jit
- Strip collected test functions + __main__ block from original file
- Write test_<stem>_harness.py with all imports, sys.path bootstrap,
  `from <stem> import *`, all test functions, and __main__ block

Update _ensure_harness_has_no_kernel_defs to match new return signature
(new_harness_path, kernel_path) and always set ctx["kernel_path"].

Update TestDetectAndSplitKernelFromHarness: 3 tests covering the
corrected split direction and BFS exclusion of @triton.jit functions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously detect_and_split_kernel_from_harness was only called when a
file was being evaluated as a harness candidate. Merged files (like
naive_softmax.py) passed directly via --kernel-path were never split,
so agents saw and patched a file containing both kernel defs and test
infrastructure.

Now immediately after Step 1 (resolve-kernel-url), if the resolved
kernel file contains both @triton.jit defs and test roots, we split
it: test logic goes to test_<stem>_harness.py in output_dir, the
original becomes a clean kernel file. The split harness is also
surfaced as the harness hint for downstream discovery/UTA so they
build on it rather than starting from scratch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… hint

The pytest-extracted harness from detect_and_split_kernel_from_harness
lacks --correctness/--profile/--benchmark/--full-benchmark flags, so
setting it as the harness hint caused deterministic validation to crash.

Instead, let UTA consume the cleaned kernel and generate a proper GEAK
harness normally. The split still fires (kernel is cleaned), UTA gets
the right input, end-to-end smoke test confirms preprocessing completes
successfully.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ract

Write clean kernel copy to output_dir instead of mutating the original
repo file. The geak framework uses git apply/revert to manage patches
against the repo; in-place modification of the kernel file breaks those
git operations and causes patches to fail in subsequent rounds.

Instead:
- Clean kernel (test logic stripped) is written to output_dir/<name>.py
- Original repo file is left untouched
- PYTHONPATH already includes GEAK_WORK_DIR (output_dir) first, so
  `from <stem> import *` in the harness resolves to the clean copy
- Agents patch the output_dir copy; git state of the repo is clean

Update tests to assert original file is untouched and clean kernel copy
lives in output_dir.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fn_map was dict[name -> single node], so duplicate function definitions
(e.g. two `def test_vecmat` in test_batched_vecmat.py) would only store
the last definition. The first definition was never added to the strip
set, leaving it in the clean kernel output.

Two fixes:
1. fn_map is now dict[name -> list[nodes]] — all definitions captured
2. Strip phase scans tree.body directly (not fn_map) to catch every
   definition whose name is in test_fns, regardless of duplicates

Validated against all 31 rocmbench kernel files: 31/31 clean splits,
no test functions leaking into kernels, originals untouched.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Issue 4 (_infer_repo_root):
- Add test_finds_setup_py and test_finds_setup_cfg to verify all 4
  marker types are detected
- Add test_git_takes_precedence_over_inner_pyproject to verify walk-up
  handles nested marker files without crashing

Issue 2 (_rewrite_relative_imports):
- Add test_rewrites_same_package_relative_import (level=1, from .sibling)
- Add test_rewrites_multiple_relative_imports_in_one_file — multiple
  relative imports of different levels in one source
- Add test_rewritten_import_resolves_to_patched_not_original — the core
  correctness guarantee: absolute import resolves to the GEAK_WORK_DIR
  copy (prepended to PYTHONPATH) not the original repo file
- Retain existing absolute/outside-repo no-op tests

Issue 3 (COMMANDMENT hardcodes harness):
- Add test_all_four_sections_contain_literal_harness_path — each of
  CORRECTNESS, PROFILE, BENCHMARK, FULL_BENCHMARK must embed the
  literal absolute path and must not contain ${GEAK_HARNESS}
- Add test_harness_path_is_absolute_not_relative — path must start with
  / so agents can find it from any working directory

68 tests total, all passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…l file path

When detect_and_split_kernel_from_harness writes the new harness to
output_dir, the harness is outside the repo — so a subsequent call to
_rewrite_relative_imports(new_harness, repo_root) would fail because it
cannot determine the package path from outside the repo tree.

Fix: collect and rewrite relative imports inside the split function
itself, while harness_path (the original file's location inside the repo)
is still available as the reference for computing the package hierarchy.
Walk up from harness_path to find repo_root independently so the split
function is self-contained.

Smoke tested with a synthetic package:
  myrepo/ops/kernels/naive_add.py  (merged, has from ..helpers import set_seed)
  myrepo/ops/helpers.py

After split:
  test_naive_add_harness.py -> from ops.helpers import set_seed, make_input
  naive_add.py              -> @triton.jit add_kernel (no test fns)
  original file             -> untouched (still has from ..helpers)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extend the merged-file split path to generate a GEAK-compatible Python wrapper for HIP/CUDA harnesses. This lets real mixed-source HIP kernels run through correctness, profiling, baseline capture, and commandment generation without changing the existing preprocess contract.

Made-with: Cursor
- Skip swerex docker tests when docker binary is unavailable (container
  CI environment has no docker daemon)
- Fix test_env_var_fallback: GEAK_MODEL env var takes precedence over
  MSWEA_MODEL_NAME; explicitly unset it in the patch.dict context so
  the MSWEA_MODEL_NAME fallback path is correctly exercised

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
test_rewritten_import_resolves_to_patched_not_original was clearing any
module whose name contained "utils" from sys.modules. This inadvertently
evicted minisweagent.run.utils.* (including task_parser) from the module
cache. In xdist workers, subsequent tests that patched
minisweagent.run.utils.task_parser.datetime received a freshly re-imported
module instance, while the test's tp reference still pointed to the old
one — causing the patch to be silently ignored and datetime.now() to
return the real timestamp instead of the mock.

Fix: use an exact prefix match (key == "ops" or key.startswith("ops."))
so only the temporary ops package created by the test is evicted.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sor artifacts from patches

Two bugs caused degraded optimization runs when a relative output
directory was specified:

1. _derive_output_dir_and_traj did not resolve explicit output paths to
   absolute, causing COMMANDMENT and other artifact paths in task YAML to
   be relative. read_task_file then resolved them against the task file's
   directory instead of the workspace root, producing bogus paths. Sub-agents
   silently fell back to a raw test_command without SETUP/CORRECTNESS/BENCHMARK
   sections.

2. The preprocessor writes baseline_metrics.json and profile.json to the
   kernel repo root (introduced in 3178b58). These leaked into patches via
   git diff, causing "Failed to apply starting patch" in subsequent rounds
   when the files already existed in worktrees. Add both files to the
   generated-artifacts exclusion list so they are stripped from patches and
   excluded from diffs.

Made-with: Cursor
…egrity-main-20260401

Fix/preprocess harness integrity main 20260401
Two changes:
- tools_runtime: collect_mcp_tools() was called at module import time,
  spawning 3 MCP server subprocesses on every `import minisweagent`.
  This caused `geak --help` (and any sub-agent that probed the geak CLI)
  to hang indefinitely waiting for MCP handshakes.
  Fixed by deferring to _ensure_mcp_collected(), called lazily on the
  first ToolRuntime instantiation instead.

- mini.py: guard configure_if_first_time() with sys.stdin.isatty() so
  the interactive setup wizard does not block piped / non-TTY invocations
  that lack the MSWEA_CONFIGURED env var.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Align model tool filtering with mini.py disabled_tools: when
AmdLlmModelConfig.profiling is false, remove both the built-in
profiling tool and the MCP profile_kernel tool.

Centralize logic in filter_tools_for_amd_config for AmdLlmModelBase
(__init__ and set_tools) and reuse from LitellmModel.

Made-with: Cursor
parse_speedup_report only recognized the `Overall: Xx (Yms -> Zms)`
format from save_and_test. When the harness only prints
`GEAK_RESULT_LATENCY_MS=<number>`, candidate_ms was extracted but
overall_speedup and baseline_ms stayed null — breaking trajectory
tracking, strategy scoring, and dead-end detection in working memory.

Fix: record_result now reads the stored baseline from the notebook's
event log and passes it to parse_speedup_report, which computes
`overall_speedup = baseline_ms / candidate_ms` when the Overall line
is absent. Also fixes the GEAK_RESULT_LATENCY_MS regex to handle
scientific notation.

Made-with: Cursor
Working memory baseline was set from the METRIX profiler's kernel-level
duration_us, not the harness's GEAK_RESULT_LATENCY_MS. When the harness
measures a different metric (e.g. full model forward pass vs single
kernel dispatch), speedup computation was wrong — agents saw "regression"
when they actually improved the target metric.

Extract load_baseline_from_artifacts() into WorkingMemory so both the
heterogeneous orchestrator and parallel_helpers use the same logic:
read baseline_metrics.json first (profiler), then override with
benchmark_baseline.txt (harness GEAK_RESULT_LATENCY_MS) if available.

Made-with: Cursor
Add missing blank line after inline import per ruff E303.

Made-with: Cursor
…ssing

benchmark_baseline.txt is not always written by the preprocessor (depends
on the code path). When it's absent, load_baseline_from_artifacts now
checks harness_results.json for the benchmark entry's GEAK_RESULT_LATENCY_MS.

This fixes the e2e harness scenario where the preprocessor writes harness
results but not the separate benchmark_baseline.txt file, causing working
memory to use the profiler's kernel-level baseline instead of the harness's
end-to-end measurement.

Priority chain:
1. benchmark_baseline.txt (GEAK_RESULT_LATENCY_MS)
2. harness_results.json benchmark entry (GEAK_RESULT_LATENCY_MS)
3. baseline_metrics.json (profiler duration_us)

Made-with: Cursor
fix: prevent geak --help from hanging by deferring MCP server startup
Resolves pylint E0102 (function-redefined) from a merge that left two
set_tools definitions; keep the implementation that uses
filter_tools_for_amd_config.

Made-with: Cursor
fix: compute overall_speedup from stored baseline in working memory
iraj465 and others added 28 commits April 20, 2026 07:11
Adds memory/cross_session/*.json to [tool.setuptools.package-data] so
the KB ships with the wheel even if include-package-data is ever
disabled or the project moves to a build system without file-tree
auto-inclusion.

Current behavior: knowledge_base.json already ships because
include-package-data=true picks up git-tracked files (verified with
wheel builds both inside and outside git). This change is defensive
— makes the intent explicit and survives config changes.

Without it, a non-editable install into an environment where the KB
is not already present (e.g., pip install of a source tarball in a
non-docker/non-editable setup) would leave _seed_from_knowledge_base
in backends/local.py silently no-op because Path(__file__).parent.parent
/ knowledge_base.json would not exist in site-packages.

Made-with: Cursor
…-fallback

fix(postprocess): fallback to 3-way apply when eval worktree lacks parent blob
…er + raised budgets

Root cause analysis (with 100% hard evidence from peak vs current logs) showed
that the 3 L3 kernels regressed because:

1. Retrieval scoring PENALIZED same-kernel KB entries whose stored bottleneck
   (e.g. "memory" from prior run's classifier) differed from runtime's current
   bottleneck ("latency"). The -0.15 penalty + success_boost caused cross-kernel
   2.46x entries to outrank byte-identical same-kernel 2.23x entries.

   Peak retrieval log:  top=fused_rms_fp8(0.706)
   Current (pre-fix):   top=fused_mxfp4_quant_moe_sort(0.721)

2. Context budget was slashed 60K -> 20K, per-patch 8K -> 4K. The 2.23x
   fused_rms_fp8 winning patch is 38KB -- it cannot fit in today's budget
   even if retrieval ranks it AMD-AGI#1, so the agent cannot read the code to
   reproduce.

3. FIRST-MOVE / EXACT-CODE-MATCH banner was removed in favor of purely
   "informed cross-reference" framing. The agent had no prior signal telling
   it a verbatim-applicable patch was in the context, so it defaulted to
   authoring minimal own-changes (R1=1.00x every round today vs R2=1.99x in
   peak runs).

Fixes:

Retriever (`retriever.py`):
  * New `_code_similarity(target, kb)`: whitespace-normalized line-set
    Jaccard with a sha256 short-circuit for byte-identical detection.
  * `_stage2_code_similarity` replaces `_stage2_text_similarity`. Scoring:
      total = code_sim * 1.0 + scaled_speedup_boost (cap 0.30) + stem_boost (cap 0.10)
    Byte-identical entries always outrank non-identical (1.0 + 0 > 0.99 + 0.40).
    Tie-break by `best_speedup` ensures 2.23x fused_rms_fp8 surfaces above
    1.80x fused_rms_fp8 when all share code_sim=1.0.
  * Removed: category boost, bottleneck match +/-0.25 (the -0.15 penalty
    was the primary regression trigger), language boost, diversity penalty,
    text-based `_build_query_terms`/`_experience_text`/`_text_similarity`
    /`_extract_source_terms`/`_NOISE_WORDS`. ~230 lines of dead scoring
    machinery deleted.
  * Relevance gate simplified: emit context iff best code_sim-based score
    >= 0.02. Prevents unrelated entries from being shown when none apply.

Formatter (`formatter.py`):
  * `_MAX_CONTEXT_FULL`:        20_000 -> 40_000
  * `_MAX_CONTEXT_COMPACT`:      4_000 ->  8_000
  * `_MAX_BEST_PATCH_CHARS`:     4_000 ->  8_000   (fits full winning patch)
  * `_TOP_IMPROVED_STRATEGIES`:      3 ->      5   (more working patterns visible)
  * `_TOP_REGRESSED_STRATEGIES`:     2 ->      3   (more anti-patterns to avoid)
  * `_MAX_REGRESSION_PATCH_CHARS`: 1_500 -> 2_000
  * `_MAX_BASELINE_BENCHMARK_CHARS`: 1_500 -> 2_000
  * New `_find_exact_code_match(experiences, target_code)` picks highest-speedup
    byte-identical entry (whitespace-normalized).
  * New `_build_exact_match_banner(exp)` emits a targeted EXACT-CODE-MATCH
    banner at the top of the injected context. Banner fires ONLY when at
    least one retrieved entry is byte-identical; silent for cross-kernel
    transfers (zero false positives, zero risk of directing agent to
    copy an inapplicable patch).

Simulation validation (per-kernel) on the current KB of 23 entries:

  fused_rms_fp8: target matches 11 KB entries byte-for-byte; top-8 are all
    same-kernel sorted by speedup (2.23x, 2.17x, 1.80x, 1.77x, ...).
    EXACT-CODE-MATCH banner fires naming the 2.23x patch.

  gemm_a16wfp4:  no same-kernel entry exists; top is gemm_a16w16_atomic
    (3.92x, stem-matched, line-level code_sim=0.131) as expected cross-
    kernel transfer seed. Banner does not fire.

  llama_ff_triton: exact match on our own 5.24x entry; banner fires;
    subsequent cross-kernel entries (gemm_a16w16_atomic, knn, three_nn)
    follow as expected secondary options.

Made-with: Cursor
… is sufficient

The banner was redundant. Every KB entry already emits a ``Code
fingerprint`` line (``Nbytes, sha256=...``) and the top of the context
shows ``Your kernel fingerprint`` for the current kernel. When those
match byte-for-byte the agent sees it directly in the evidence and can
act on it without a prescriptive "apply verbatim" banner at the top.

This returns to the "informed cross-reference, not directive" principle
we already committed to. The code-similarity-based ranking from the
previous commit already surfaces byte-identical matches to the top of
the retrieved set -- no extra banner needed for the agent to notice.

Removed:
  * ``_find_exact_code_match`` (58 lines)
  * ``_build_exact_match_banner`` (18 lines)
  * top-of-context banner emission

Kept:
  * per-entry ``Code fingerprint`` line -- the signal the agent compares
  * top-of-context ``Your kernel fingerprint`` -- the reference value
  * code-similarity primary ranking (from previous commit)
  * raised context budgets (from previous commit)

Made-with: Cursor
… I/O

The retriever no longer accepts a ``kernel_path`` and never reads the
filesystem. Callers pass the kernel's raw source as ``target_code``; the
integration wrapper at ``memory.cross_session.retrieve()`` keeps a
``kernel_path`` kwarg as a backward-compat shim that reads the file
once before forwarding.

Why this matters:
  sub-agent dispatch contexts were passing a ``kernel_path`` that
  pointed at a not-yet-materialised working-dir location
  (``/workspace/outputs/<k>/tasks/round_N/outputs/<k>/kernel.py``), so
  ``_read_target_code`` silently returned ``""`` and every KB entry's
  code_sim dropped to 0. The ranking then collapsed to success_boost
  only, which surfaced unrelated high-speedup kernels
  (e.g. ``three_nn``/5.38x) as top for ``fused_qkv_rope``/
  ``fast_rms_layernorm`` sub-agent retrievals. The orchestrator-level
  retrieval was correct because its path resolved; sub-agents regressed.

Making the retriever purely a (target_code, candidates) -> ranking
function removes the failure mode by construction -- the caller either
has the code and supplies it, or retrieval declines (no misleading
partial context). This is semantically cleaner too: code identity
matches should be computed on code, not inferred from path strings.

Concrete changes:
  * ``retrieve_context(kernel_path=...)`` -> ``retrieve_context(target_code=...)``
  * Removed ``_read_target_code``, ``_infer_category``, ``_infer_language``,
    ``_kernel_stem_overlap`` (path-based helpers, no longer used).
  * Scoring collapses to ``code_sim + capped success_boost``:
    - stem_boost removed: name strings are a path-derived proxy for
      code similarity; if code similarity is weak, the stem heuristic
      was overweighting tenuous cross-family transfers.
    - category/bottleneck/language boosts already gone in the previous
      commit for the same reason.
  * ``format_landscape_context`` takes ``target_code`` instead of
    ``target_kernel_path``; drops its own Path-read logic.
  * ``memory.cross_session.retrieve()`` accepts ``target_code`` (preferred)
    and still accepts legacy ``kernel_path`` (reads the file once before
    forwarding the raw code).

Semantic validation (re-simulated on current KB of 23 entries):

  fused_rms_fp8 target (49,666 B):
    top = fused_rms_fp8 sp=2.231x total=1.100 (code_sim strong=yes)
    -- identical to the outgoing behaviour, correctly surfaces the
    same-kernel 2.23x entry at rank 1.

  fused_qkv_rope target (25,074 B at the real AKA path; 11,491 B at
  the GEAK working-dir stripped version):
    * AKA path    -> code_sim=1.000 (byte-identical), same-kernel wins.
    * Working dir -> code_sim~=0.45 (Jaccard over shared lines),
      same-kernel still wins because 0.45 + 0.05 > any cross-kernel
      0 + 0.20 (three_nn's success_boost).

  Sub-agent path regression (target_code="" failure mode) now impossible:
  if a caller passes ``target_code=""`` the retriever declines via the
  ``best_score < 0.02`` relevance gate; no unrelated entries surface.

Made-with: Cursor
The agent now sees the Jaccard code-overlap percentage between its
current kernel and every retrieved KB entry, plus a top-level
KB-relevance tier. Without this signal the agent sometimes anchored on
a weak cross-family entry (e.g. gemm_a16w16_atomic ~ 13% overlap with
gemm_a16wfp4) and defaulted to generic-GEMM strategies across all 5
rounds instead of recognising early that the KB had no close match and
pivoting to kernel-specific analysis (MXFP4 quant ops etc.).

The framing stays non-prescriptive: we report the NUMBER, explain
what it means, and let the agent weigh.

Retriever:
  * Compute per-entry code_sim for the top-k selected entries (reuses
    _code_similarity, no extra pass over all candidates).
  * Thread ``per_entry_code_sim`` into ``format_landscape_context``.

Formatter:
  * Top-of-context ``KB relevance`` tier:
      STRONG   (code_sim ≥ 99%)  "patch applies verbatim"
      PARTIAL  (25% ≤ code_sim)  "adapt techniques, validate against
                                  your hot paths"
      LOW      (code_sim < 25%)  "distant cross-family references;
                                  analyse YOUR kernel + profiler for
                                  kernel-specific optimisations"
  * Per-entry ``**Code similarity to your kernel**: NN.N%`` line with
    the same qualitative tier. Agent can now sort/weigh entries by
    their actual code overlap rather than rank position alone.
  * Reasoning guidance updated to call out: "if no entry matches well
    and early rounds of KB-inspired strategies don't improve, pivot to
    analysing the current kernel's profile and propose kernel-specific
    optimizations."

Validated against the current 23-entry KB:

  fused_rms_fp8 target     → KB relevance: STRONG (100% top match)
  fused_qkv_rope target    → KB relevance: STRONG (100% top match)
  fast_rms_layernorm       → KB relevance: PARTIAL (26% top match)
  gemm_a16wfp4 target      → KB relevance: LOW    (13% top match)

The LOW-relevance case is exactly where the previous run got stuck at
1.03x: the agent anchored on a 13% same-category (GEMM) entry and
rolled out 5 rounds of generic GEMM strategies, never trying MXFP4-
specific optimisations (fuse-wrapper-ops, precompute-quant-separate,
log2/exp2 bitops) that the historical 1.43x peak exploited. With the
explicit "LOW / use as weak hints / focus on YOUR kernel's quant ops"
framing the agent should recognise the mismatch earlier and pivot.

Made-with: Cursor
…m number

Earlier we added "KB relevance: STRONG / PARTIAL / LOW" banners and
per-entry "STRONG/WEAK: treat as generic hint only" labels with
directive-flavoured text like "prioritise analysing YOUR kernel's
profiler output". Reverting this -- the agent already has:

  * Full current kernel source (task body)
  * Current baseline_metrics / profiler output (injected separately)
  * Each KB entry's stored code_fingerprint, code_sim %, baseline→
    best latency, bottleneck, strategies with diffs, key params,
    round trajectory, regressions

Adding a pre-digested "this is WEAK, pivot to kernel-specific" tier is
us interpreting on the agent's behalf. Given the same raw evidence the
agent can (and should) form its own judgement.

Kept from the previous commit:
  * Per-entry raw Jaccard percentage -- a signal the agent can't derive
    from fingerprint alone ("your fingerprint is A; the KB entry's is B"
    tells you they differ but not by how much).
  * Per-entry code_fingerprint (so the agent can confirm byte-identity
    exactly when it matters).
  * Per-entry full evidence block (hardware, performance, diffs, etc.).

Removed:
  * Top-of-context "KB relevance: STRONG/PARTIAL/LOW" banner.
  * Per-entry qualitative tier suffix ("STRONG: byte-identical source,
    patch applies verbatim", "WEAK: distant cross-family, treat as
    generic hint only", etc.).
  * Reasoning-guidance directives ("pivot to analysing the current
    kernel's profile", "prioritise analysing YOUR kernel's profiler
    output", "if early rounds of KB-inspired strategies don't
    improve...").

Guidance text is now minimal and descriptive:
"*Below: evidence from past optimization runs... You also have your
current kernel's full source and profiler metrics from the main task.
Use both inputs to form your own plan -- the KB informs your decision,
it does not make it for you.*"

Made-with: Cursor
`write_task_file` was writing path-valued frontmatter fields verbatim
when no `relative_to=` anchor was passed. Callers (`tools.py::tool_generate_tasks`)
sometimes pass CWD-relative strings like `outputs/fused_qkv_rope/kernel.py`,
sometimes absolute paths -- depending on how the orchestrator was
initialised. The downstream reader (`read_task_file`) then resolves the
relative string against the *task file's own directory*, producing
nonsense paths like
`<output>/<kernel>/tasks/round_2/outputs/<kernel>/kernel.py`
that don't exist.

Visible symptom: the cross-session memory retriever, called from
`dispatch.task_file_to_agent_task` for sub-agent injection, gets
`kernel_path` pointing at a non-existent file, fails the read silently
(OSError -> empty string), and logs `Retriever: target_code=0B
bottleneck=unknown`. Code-similarity scoring then returns zero for all
entries, so unrelated KB entries get promoted (e.g. `fast_rms_layernorm`
6.50x getting injected into `fused_qkv_rope` sub-agents).

Empirically verified against today's MI355X runs: slot1 (`gemm_a16wfp4`)
had absolute paths in its task files and 0/28 retrieval calls had
target_code=0B; slot2 (`fused_qkv_rope`) had relative paths and 32/37
(86%) retrieval calls had target_code=0B and surfaced wrong-kernel KB
entries.

Fix: in `write_task_file`, resolve path-valued fields to absolute paths
against the writer's CWD before serialising. The read-side resolution
in `read_task_file` reduces to a no-op for absolute paths, so the file
opens correctly regardless of who later reads it (orchestrator
sub-agent dispatch, parallel agent worker, standalone CLI).

Smoke-tested end-to-end: a relative `outputs/foo/kernel.py` written
from CWD `/tmp/.../workspace` is stored as the absolute path and read
back correctly from a different CWD (`/tmp`). Behaviour with
`relative_to=` set is unchanged.

Made-with: Cursor
update skill to support docs and scripts within skill folder
… postprocessor

- Replace site.main() with sys.path.insert() in mini.py for more targeted
  path refresh after rag-mcp auto-install (PR AMD-AGI#90 review followup)
- Pass api_key from agent model config to RAG postprocessor so it can use
  yaml-configured api_key instead of relying solely on env vars (Issue AMD-AGI#169)
…-install

Made-with: Cursor

# Conflicts:
#	README.md
#	mcp_tools/README.md
#	pyproject.toml
…ocess

fix(preprocess): use full Metrix profile for baseline runs
…tall

fix(packaging): make full extras pip-installable again
…-rag-integration

feat(memory): enhance cross-session memory retrieval + RAG
Removed the unnecessary top-level knowledge-base directory copy that causes docker failure.
Removed from install / install-full / install-dev. Replaced with an
opt-in 'make index' target. mini.py already lazy-builds on first RAG
use when tools.rag is enabled (off by default), so the eager run only
slowed every 'make install' (and docker build) by several minutes
without benefit.

Made-with: Cursor
…efore-install

fix(docker): copy scripts/ before make install for RAG index build
  fix: replace site.main() with sys.path.insert and pass api_key to RAG postprocessor
fix(docker): include scripts/ in image so make index works at runtime
Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants