Geak triton common benchmark by iraj465 · Pull Request #28 · AMD-AGI/AgentKernelArena

iraj465 · 2026-04-06T08:23:40Z

Adds triton tasks to Arena.

Made-with: Cursor

- Add geak_v3_triton agent: full geak-preprocess + geak-orchestrate pipeline with patch application from worktree evaluation - Add 8 Triton eval kernels (L1/L2/L3) with harnesses and configs including compile_command for AKA evaluator compatibility - Add run_geak_triton.sh for dual-stream parallel execution - Add config files for Triton and HIP benchmark runs - Switch geak_v3 HIP agent from 'mini' to 'geak' entrypoint - Fix GPU baseline measurement: set HIP_VISIBLE_DEVICES during compilation and performance measurement - Register geak_v3_triton in module_registration.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- geak_v3_triton now calls `geak --kernel-url --harness` (same entrypoint as HIP/geak_v3) instead of separate preprocessor + orchestrator calls - Both Triton and HIP agents use the unified geak CLI - Update README with instructions for both HIP and Triton runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- geak_v3 (HIP): now calls `geak --kernel-url <kernel> --eval "<commands>"` instead of `geak -t <task_prompt.md>` (Path B) - geak_v3_triton: uses `--eval <harness>` instead of `--harness` - Both agents use the same unified geak CLI with --eval auto-detection (file path → harness mode, shell commands → command mode) - Updated README with instructions for both HIP and Triton runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The geak CLI writes results directly to logs_dir/ (not logs_dir/preprocess/). The launcher was looking in logs_dir/preprocess/ which doesn't exist with the new unified CLI, causing _apply_best_patch to never run and all AKA speedups to show 0.0x despite real GEAK optimizations. Fix: check for final_report.json in logs_dir first, fall back to preprocess subdir for backward compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… score launch_agent.py: read full_benchmark.verified_speedup from round evaluation JSONs instead of benchmark_speedup. The select_agent score can be inflated (e.g. 2.53x) while the actual FULL_BENCHMARK verification shows regression (0.96x). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…arch Two fixes for _apply_best_patch(): 1. Worktree search: Use rglob() to find kernel.py recursively under slot_* dirs (files are nested at tasks/triton2triton/geak_eval/.../kernel.py) 2. Patch strip: Try -p1 through -p8 since GEAK patches have nested paths like a/tasks/triton2triton/geak_eval/L2/topk/kernel.py (-p6 needed) Previously all patches failed with "can't find file to patch" because -p1 only stripped the git a/ prefix, leaving the full tasks/... path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The .to('cuda') call on a requires_grad tensor creates a non-leaf tensor, so .grad is never populated during backward(). Fixed by creating directly on device='cuda'. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When AKA's performance_command parser fails to extract test cases from harness output, fall back to reading GEAK's final_report.json which contains already-verified baseline_ms, candidate_ms, and verified_speedup. This ensures speedup_ratio in task_result.yaml reflects GEAK's actual verified results instead of always being 0.0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reports benchmark_speedup, best_task, best_round, and full round_history from GEAK's final_report.json so AKA captures both the task-local benchmark speedup and the verified FULL_BENCHMARK speedup per round. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@jit

pid_grid was a plain function called from @jit kernel — needs @triton.jit. EVEN_M_N heuristic was defined but never used in kernel body, and caused KeyError on newer Triton versions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Consistent with GEAK's results.py — use whichever measurement is higher to avoid undercounting on noisy tiny-kernel benchmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Consistent with GEAK — FULL_BENCHMARK verified_speedup is the independently reproducible ground truth. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add 8 new kernel tasks from AIG-Eval (sdubagun/fix-kernel-harness-parity): - L1: fused_append_shared_experts, mla_decode - L2: rope - L3: gemm, gemm_a16w16_atomic, fused_qk_rope_cache_mla, fused_mxfp4_quant_moe_sort, fused_moe_mxfp4 New tasks use aiter_commit field in config.yaml for reproducible benchmarks. When aiter_commit is present, AKA evaluator automatically runs harness commands inside Docker via docker exec, with the correct aiter version checked out. Framework changes: - evaluator_utils.py: add docker_container param to run_command(), add checkout_aiter() for pinned aiter versions - evaluator.py: thread docker_container through evaluate functions - performance.py: thread docker_container + GEAK_RESULT_LATENCY_MS parsing - main.py: detect aiter_commit, checkout aiter, pass docker_container Backward compatible: existing 8 kernels run on host unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Removed: mla_prefill_reduce, refk_mla_decode, refk_moe, rope, nsa_forward. Updated all config files to remove stale references. Made-with: Cursor

launch_agent.py: - Reorder patch strategies: try .patch file from best verified round first, then final_report.json patch, then worktree kernel copy. Previously worktree copy ran before patch file application. - Add change verification: after applying any patch or copying a worktree kernel, verify the file actually changed vs original. Skip candidates that produce identical output. - Use verified speedup preferentially over agent-reported benchmark speedup for round selection (verified if > 0, else benchmark). - Read original kernel text once at the top instead of re-reading in multiple places. agent_config.yaml: - Increase timeout from 3600s to 10800s for HIP kernel runs that require longer compilation and multi-shape benchmarking. Made-with: Cursor

… support - Add RDNA4_architecture.md with gfx1201 hardware specs (Wave32, WGP, WMMA, GDDR6 bandwidth constraints) - Add hip_rdna_cheatsheet.md as standalone RDNA HIP best practices - Wire RDNA4 into default_cheatsheet.yaml with knowledge_override - Set AMDGPU_TARGETS and GPU_TARGETS in preprocessing for CMake builds - Support knowledge_override in prompt_builder for per-arch cheatsheets - Remove hardcoded MI300X prompt from points_in_boxes task config Made-with: Cursor

- Add scaled_fp4_quant to sgl_kernel mock in fused_append_shared_experts harness to fix ImportError from upstream sglang changes - Add per-config error threshold for mla_decode correctness check to handle known numerical divergence at nhead=128, dq=2, ctx<=21 - Replace old vLLM/ROCmBench README with standard 18-kernel task list Made-with: Cursor

…a-hip-enablement

simple_prompt_builder() bypasses src/prompt_builder.py, so the architecture cheatsheets + arch-precheck directive wired up in 14d240f were never reaching the agent. Load them in launch_agent.py after the simple prompt is built, mirroring src/prompt_builder.py's section layout (precheck + architecture context + language-specific knowledge). Made-with: Cursor

…rocessing support - Add RDNA4_architecture.md with gfx1201 hardware specs (Wave32, WGP, WMMA, GDDR6 bandwidth constraints) - Add hip_rdna_cheatsheet.md as standalone RDNA HIP best practices (needed for hip2hip and torch2hip tasks on this branch) - Add triton_rdna_cheatsheet.md as standalone RDNA Triton best practices (Wave32 implications, WMMA vs MFMA, gfx1201 tl.dot dtype support) - Wire RDNA4 into default_cheatsheet.yaml with per-language knowledge_override (hip + triton) - Set AMDGPU_TARGETS and GPU_TARGETS in preprocessing for CMake builds - Support knowledge_override in prompt_builder for per-arch cheatsheets - Remove hardcoded MI300X prompt from points_in_boxes task config Mirrors AgentKernelArena HIP-branch commit 14d240f, with an added triton-specific cheatsheet and knowledge_override wiring. Made-with: Cursor

Enable RDNA4 (gfx1201) HIP support: arch cheatsheets, prompt builder, preprocessing

Adds a uniform Task Contract section to every hip2hip task prompt with the four constraints raised in the PR #35 review: 1. Preserve kernel function names and signatures (named per-task from `target_kernel_functions` so the agent sees exactly which symbols are part of the contract). 2. Keep the launch / configuration interface compatible (grid/block dims, stream usage, host-side launch helpers / Python bindings). 3. Output must remain directly compilable and runnable with the existing `compile_command` / `correctness_command` / `performance_command` — no edits to the test runner. 4. Handle shared-memory launch sizing correctly when `__shared__` / dynamic LDS allocations are introduced. Hosts the contract in `src/prompts/task_type.py::hip2hip_task_contract` and injects it from `src/prompt_builder.py` (regular path) and `agents/geak_v3/launch_agent.py` (GEAK-v3 simple_prompt_builder path, mirroring the cheatsheet injection added in PR #35) so the contract reaches every agent uniformly. Why framework-level (not per-task `prompt.instructions`): - Applies uniformly to all 36 current hip2hip configs (24 gpumode/* with populated `instructions` + 12 others/* with `instructions: null`) without per-task duplication or drift. - Single source of truth: future edits update one function, not 36 YAMLs. - Architecture-neutral by construction (no MI300X-specific hardware text, which was the original problem in `tasks/hip2hip/others/points_in_boxes/config.yaml` that PR #35 cleaned up). Scope: - Only `task_type == 'hip2hip'`. `triton2triton`, `torch2hip`, `cuda2hip`, `repository`, `instruction2triton` unchanged. - No task YAML edits. - No new dependencies. Verification: - All 36 hip2hip configs render with the contract present and the correct kernel name(s) listed. - 5/5 sampled triton2triton configs do NOT receive the contract (correctly scoped). - Both injection paths exercised (regular `prompt_builder` and GEAK-v3 `launch_agent`). Follow-up to PR #35 review thread (irvineoy's request). Co-authored-by: Cursor <cursoragent@cursor.com>

… path too setup_rocm_env() previously set all three of PYTORCH_ROCM_ARCH, AMDGPU_TARGETS, and GPU_TARGETS only on the fallback path. On the common rocminfo-success path it set just PYTORCH_ROCM_ARCH and returned, leaving the two CMake env vars unset. Reproduced on MI300: rocminfo detected gfx942 but AMDGPU_TARGETS / GPU_TARGETS remained None, so CMake-based HIP builds did not pick up the selected arch. Refactor the function to resolve gfx_arch first (rocminfo -> cheatsheet fallback) and converge on a single export block driven by a module- level tuple. Both paths now export all three vars uniformly, and any future CMake env var can be added to the tuple in one place. Addresses irvineoy's review on #37. Co-authored-by: Cursor <cursoragent@cursor.com>

feat(rdna): add RDNA4 Triton + HIP cheatsheets, arch config, and preprocessing support

…ints feat(hip2hip): inject generic task contract at framework level

Conflict resolutions: - .gitignore: union of both ignore lists - main.py: integrated main's torch2hip baseline-skip branch into our aiter checkout + GPU pinning (HIP_VISIBLE_DEVICES) wrapper - src/evaluator.py: combined main's configurable compile/correctness timeouts with our docker_container kwarg - src/performance.py: same pattern — perf_timeout + --baseline_only + docker_container kwarg merged into the run_command call - src/evaluator_utils.py: in run_command, docker_container path skips normalize_python_command (host's python isn't in the container); non-docker path keeps main's interpreter normalization. checkout_aiter function preserved at end of file - src/preprocessing.py: took main's enhanced _ensure_repo_cloned (tuple return, shallow clone, non-git target cleanup) and the new post_clone_install helpers. Our RDNA4 / AMDGPU_TARGETS additions were outside the conflict markers and remained intact - src/prompt_builder.py: took main's repository_language resolution (task_type == 'repository' branch), kept our knowledge_override lookup for RDNA4 arch entries - tasks/repository/rocprim/*/config.yaml (4 files): took main's superset (task_type, repository_language, post_clone_install) Verified end-to-end with refk_identity smoke test on MI300 GPU 0 (1 GEAK round): compile/correctness/perf eval all pass, final report written. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Replace the proliferation of ad-hoc per-slot/per-batch/per-kernel configs with a single canonical config covering all 18 benchmark kernels, plus a tiny smoke config for sanity checks. The two-slot 4+4 GPU parallel run convention is achieved by passing the all18 config to scripts/run_geak_triton.sh, which already splits tasks odd/even across GPUs 0-3 and 4-7. Because the config groups tasks by level (6 L1, 4 L2, 8 L3), the split is balanced 3/2/4 per slot. - Add config_geak_triton_all18.yaml and config_geak_triton_smoke.yaml - Delete 13 stale slot/batch/rerun/partial configs - Update .gitignore whitelist to track the two surviving configs - Update README and agents/geak_v3_triton/README to document the single-config + run_geak_triton.sh workflow - Fix stale defaults/help text in scripts/run_geak_triton.sh Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Make kernel.py and test_kernel_harness.py fully self-contained so GEAK's per-task workspace can see (and optimize) all relevant code. Inlines _fused_moe_kernel_mxfp4, helpers (get_num_xcds, torch_to_triton_dtype, get_scaled_dot_format_string, remap_xcd, pid_grid, _write_zeros_to_output), the gfx950 MOE config table, arch detection, mxfp4 quant/dequant helpers, and the einsum reference. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Make kernel.py and test_kernel_harness.py fully self-contained so GEAK's per-task workspace can see (and optimize) all kernels and helpers. Inlines the fused rms/mxfp4-quant/moe-sort triton kernels, activation helpers, and reference paths previously imported from aiter.ops.triton. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Reorganize the inlined fused QK RoPE + KV-cache kernel and rotation helpers so the workspace has zero aiter dependency, with explicit source attribution for each inlined block. Harness updated to match. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

sharareh-y

Thanks for your contribution, overall code looks good, here I things I'd like addressed:

fused_append_shared_experts is wired so the agent edits kernel.py but the harness only loads fused_moe_triton_kernels.py — see file-level comment. Speedup for this kernel will always be ~1.0×.
User-specific hardcoded paths are present in code: sapmajum username and /home/sapmajum/... GEAK_SRC fallback in scripts/run_geak_triton.sh, /workspace/src in agents/mini_swe_triton/launch_agent.py, /sgl-workspace/aiter default in src/evaluator_utils.py.
In src/evaluator.py GEAK results overwrites report if AKA fails. We don't want this behavior, final evaluation for all agents should come from AKA runs.
agents/geak_v3_triton/README.md and the agent_config.yaml disagree on defaults (MAX_ROUNDS 5 vs 3, MODEL_ENSEMBLE listed but not set) and the "All 18 kernels" table doesn't match config_geak_triton_all18.yaml (ff_backward listed at L1, mystery rope row, etc.).

sharareh-y · 2026-05-25T18:14:57Z

Blocking. This harness is hardcoded to load fused_moe_triton_kernels.py, but the task config (source_file_path: kernel.py) and both launchers (geak_v3_triton, mini_swe_triton) only edit kernel.py. After the agent runs, the two files diverge and the harness keeps importing the unmodified duplicate — so --correctness always passes and --full-benchmark always reports the baseline. Effective speedup will be 1.0× regardless of the agent.

Two options:
a) Drop fused_moe_triton_kernels.py (it's a byte-identical 977-line copy of kernel.py) and change _KERNEL_FILENAME to kernel.py. [Prefered]
b) Keep the sglang-style filename and update source_file_path + the launchers to operate on fused_moe_triton_kernels.py.

Either is fine, but the current setup is broken end-to-end.

sharareh-y · 2026-05-25T18:26:32Z

Blocking. Please remove the user-specific fallbacks:
CONTAINER="geak-agent-${USER:-sapmajum}"
export GEAK_SRC="${GEAK_SRC:-/home/sapmajum/GEAK-agent-filtering-and-cli-unification/src}"

Both leak one developer's environment into shared tooling. Either hard-fail when GEAK_SRC / $USER isn't set, or document the required env vars at the top and use : "${GEAK_SRC:?GEAK_SRC must be set}".

Also: the .tmp_config_stream_{a,b}.yaml temp files are written into the repo root and only cleaned up on the happy path. With set -euo pipefail and the || FAIL=1 swallow on wait, a crash leaves dirty files in the worktree. Please wrap cleanup in a trap.

sharareh-y · 2026-05-25T18:37:58Z

+
+    # Step 3b: If performance measurement failed, read GEAK's final_report.json
+    if results['best_optimized_execution_time'] == 0.0:
+        geak_results = _read_geak_final_report(workspace, log)
+        if geak_results:
+            results['best_optimized_execution_time'] = geak_results['candidate_ms']
+            results['average_speedup'] = geak_results['verified_speedup']
+            results['valid_optimized_cases'] = 1
+            results['valid_baseline_cases'] = 1
+            results['geak_baseline_ms'] = geak_results['baseline_ms']
+            results['geak_benchmark_speedup'] = geak_results.get('benchmark_speedup')
+            results['geak_best_task'] = geak_results.get('best_task')
+            results['geak_best_round'] = geak_results.get('best_round')
+            results['geak_round_history'] = geak_results.get('round_history', [])
+            log.info(
+                f"Using GEAK verified results: {geak_results['verified_speedup']:.4f}x "
+                f"(baseline={geak_results['baseline_ms']:.4f}ms, "
+                f"candidate={geak_results['candidate_ms']:.4f}ms, "
+                f"benchmark={geak_results.get('benchmark_speedup', 'N/A')}x, "
+                f"task={geak_results.get('best_task', 'N/A')})"
+            )
+


Blocking.
Please remove this part, we want final evaluation to come from AgentKernelArena.

sharareh-y · 2026-05-25T18:42:24Z

This README is out of sync with the code it's documenting:

Defaults table says GEAK_MAX_ROUNDS=5 and GEAK_MODEL_ENSEMBLE=gpt-5.2,claude-opus-4.6, but agent_config.yaml ships MAX_ROUNDS=3 and has no MODEL_ENSEMBLE set.

"All 18 Triton Kernels" table lists ff_backward at L1, but config_geak_triton_all18.yaml lists it at L2.

Same table has a rope kernel at L2 (# 11) — there's no rope task in config_geak_triton_all18.yaml. Did you mean fused_qk_rope_cache_mla / fused_qkv_rope?

L1 row count is 7 in the README, 6 in the config.

…-common-benchmark # Conflicts: # .gitignore # src/preprocessing.py # src/prompts/cheatsheet/default_cheatsheet.yaml # tasks/repository/rocprim/block_radix_rank/config.yaml # tasks/repository/rocprim/device_binary_search/config.yaml # tasks/repository/rocprim/device_merge_sort/config.yaml # tasks/repository/rocprim/device_search_n/config.yaml

Introduce num_parallel knob so GEAK's gwiab-scheduler can run >1 subagent per GPU. Resolution priority: GEAK_NUM_PARALLEL env > eval config num_parallel > agent config num_parallel > len(gpu_ids) (historical default). Wired into both geak_v3 (regex-overrides --num-parallel/--gpu-ids in run.configs) and geak_v3_triton (helper call replaces hardcoded len(gpu_ids)). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

agents/geak_v3/geak.yaml was missing the run.budgets.{quick,full} block that GEAK now requires. Aligning with upstream GEAK's src/minisweagent/config/geak.yaml — same quick/full budget shape used by geak_with_modes.yaml — so HIP runs launch without "No run.budgets.full block in config". config_smoke_hip.yaml is a one-task hip2hip smoke config used to validate the GEAK gwiab-scheduler single-GPU + multi-subagent path on the HIP code path. Allowlisted in .gitignore alongside the existing triton smoke/all18 configs.

yueliu14 and others added 30 commits March 8, 2026 21:52

add geak benchmark and repo tasks

c41742c

modify task_runner

3b49279

change rocprim task_runner.py

ba8cb08

change rocprim task_runner.py

87b2508

modify geak_benchmark parallel

6de72f6

modify geak_benchmark parallel

1db99a0

add task configs

5b25da6

add task configs

76362bb

add readme

df491fa

Update README.md

2ca9a6d

Update README.md

a8674b1

change agent name from geak_benchamrk to geak_v3

b00392e

change name geak_benchmark to geak_v3

0bdb055

Merge main into geak_benchmark: add repo_url support

a1e872a

Made-with: Cursor

Add repo_url support for rocprim tasks

c06d30c

change to geak_v3

7621596

Add README for geak_v3_triton agent

497b027

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: use max of benchmark and verified speedup in AKA evaluator

658ec8c

Consistent with GEAK's results.py — use whichever measurement is higher to avoid undercounting on noisy tiny-kernel benchmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: use verified_speedup as canonical, revert max() fallback

a549cff

Consistent with GEAK — FULL_BENCHMARK verified_speedup is the independently reproducible ground truth. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

iraj465 and others added 16 commits April 10, 2026 07:20

remove 5 non-benchmark kernels, keep 18 target kernels

7f4e9ed

Removed: mla_prefill_reduce, refk_mla_decode, refk_moe, rope, nsa_forward. Updated all config files to remove stale references. Made-with: Cursor

fix best patch apply error, change best_results.json->final_report.json

42f94a8

Merge remote-tracking branch 'origin/main' into geak_benchmark

34612d7

Merge remote-tracking branch 'origin/geak_benchmark' into feature/rdn…

22c0439

…a-hip-enablement

GEMM kernel fixes

23583fa

Merge pull request #35 from AMD-AGI/feature/rdna-hip-enablement

2284ab6

Enable RDNA4 (gfx1201) HIP support: arch cheatsheets, prompt builder, preprocessing

Merge pull request #37 from AMD-AGI/feature/rdna-triton-enablement

3a79d46

feat(rdna): add RDNA4 Triton + HIP cheatsheets, arch config, and preprocessing support

Merge pull request #38 from AMD-AGI/feature/hip2hip-framework-constra…

0d3fc5e

…ints feat(hip2hip): inject generic task contract at framework level

sdubagun-amd requested review from chushic, ppalanga, sharareh-y, sharonzhou, sinarafati-amd and yueliu14 as code owners May 24, 2026 17:56

sdubagun-amd and others added 4 commits May 24, 2026 13:17

sharareh-y requested changes May 25, 2026

View reviewed changes

sdubagun-amd and others added 3 commits May 27, 2026 05:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Geak triton common benchmark#28

Geak triton common benchmark#28
iraj465 wants to merge 110 commits into
mainfrom
geak-triton-common-benchmark

iraj465 commented Apr 6, 2026

Uh oh!

sharareh-y left a comment

Uh oh!

sharareh-y May 25, 2026

Uh oh!

sharareh-y May 25, 2026

Uh oh!

sharareh-y May 25, 2026

Uh oh!

sharareh-y May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

iraj465 commented Apr 6, 2026

Uh oh!

sharareh-y left a comment

Choose a reason for hiding this comment

Uh oh!

sharareh-y May 25, 2026

Choose a reason for hiding this comment

Uh oh!

sharareh-y May 25, 2026

Choose a reason for hiding this comment

Uh oh!

sharareh-y May 25, 2026

Choose a reason for hiding this comment

Uh oh!

sharareh-y May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants