Skip to content

Geak triton common benchmark#28

Open
iraj465 wants to merge 110 commits into
mainfrom
geak-triton-common-benchmark
Open

Geak triton common benchmark#28
iraj465 wants to merge 110 commits into
mainfrom
geak-triton-common-benchmark

Conversation

@iraj465
Copy link
Copy Markdown
Collaborator

@iraj465 iraj465 commented Apr 6, 2026

Adds triton tasks to Arena.

yueliu14 and others added 30 commits March 8, 2026 21:52
- Add geak_v3_triton agent: full geak-preprocess + geak-orchestrate
  pipeline with patch application from worktree evaluation
- Add 8 Triton eval kernels (L1/L2/L3) with harnesses and configs
  including compile_command for AKA evaluator compatibility
- Add run_geak_triton.sh for dual-stream parallel execution
- Add config files for Triton and HIP benchmark runs
- Switch geak_v3 HIP agent from 'mini' to 'geak' entrypoint
- Fix GPU baseline measurement: set HIP_VISIBLE_DEVICES during
  compilation and performance measurement
- Register geak_v3_triton in module_registration.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- geak_v3_triton now calls `geak --kernel-url --harness` (same entrypoint
  as HIP/geak_v3) instead of separate preprocessor + orchestrator calls
- Both Triton and HIP agents use the unified geak CLI
- Update README with instructions for both HIP and Triton runs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- geak_v3 (HIP): now calls `geak --kernel-url <kernel> --eval "<commands>"`
  instead of `geak -t <task_prompt.md>` (Path B)
- geak_v3_triton: uses `--eval <harness>` instead of `--harness`
- Both agents use the same unified geak CLI with --eval auto-detection
  (file path → harness mode, shell commands → command mode)
- Updated README with instructions for both HIP and Triton runs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The geak CLI writes results directly to logs_dir/ (not logs_dir/preprocess/).
The launcher was looking in logs_dir/preprocess/ which doesn't exist with the
new unified CLI, causing _apply_best_patch to never run and all AKA speedups
to show 0.0x despite real GEAK optimizations.

Fix: check for final_report.json in logs_dir first, fall back to preprocess
subdir for backward compatibility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… score

launch_agent.py: read full_benchmark.verified_speedup from round
evaluation JSONs instead of benchmark_speedup. The select_agent
score can be inflated (e.g. 2.53x) while the actual FULL_BENCHMARK
verification shows regression (0.96x).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…arch

Two fixes for _apply_best_patch():
1. Worktree search: Use rglob() to find kernel.py recursively under
   slot_* dirs (files are nested at tasks/triton2triton/geak_eval/.../kernel.py)
2. Patch strip: Try -p1 through -p8 since GEAK patches have nested paths
   like a/tasks/triton2triton/geak_eval/L2/topk/kernel.py (-p6 needed)

Previously all patches failed with "can't find file to patch" because
-p1 only stripped the git a/ prefix, leaving the full tasks/... path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The .to('cuda') call on a requires_grad tensor creates a non-leaf tensor,
so .grad is never populated during backward(). Fixed by creating directly
on device='cuda'.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When AKA's performance_command parser fails to extract test cases from
harness output, fall back to reading GEAK's final_report.json which
contains already-verified baseline_ms, candidate_ms, and verified_speedup.

This ensures speedup_ratio in task_result.yaml reflects GEAK's actual
verified results instead of always being 0.0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reports benchmark_speedup, best_task, best_round, and full round_history
from GEAK's final_report.json so AKA captures both the task-local
benchmark speedup and the verified FULL_BENCHMARK speedup per round.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pid_grid was a plain function called from @jit kernel — needs @triton.jit.
EVEN_M_N heuristic was defined but never used in kernel body, and caused
KeyError on newer Triton versions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Consistent with GEAK's results.py — use whichever measurement is
higher to avoid undercounting on noisy tiny-kernel benchmarks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Consistent with GEAK — FULL_BENCHMARK verified_speedup is the
independently reproducible ground truth.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 8 new kernel tasks from AIG-Eval (sdubagun/fix-kernel-harness-parity):
- L1: fused_append_shared_experts, mla_decode
- L2: rope
- L3: gemm, gemm_a16w16_atomic, fused_qk_rope_cache_mla,
      fused_mxfp4_quant_moe_sort, fused_moe_mxfp4

New tasks use aiter_commit field in config.yaml for reproducible
benchmarks. When aiter_commit is present, AKA evaluator automatically
runs harness commands inside Docker via docker exec, with the correct
aiter version checked out.

Framework changes:
- evaluator_utils.py: add docker_container param to run_command(),
  add checkout_aiter() for pinned aiter versions
- evaluator.py: thread docker_container through evaluate functions
- performance.py: thread docker_container + GEAK_RESULT_LATENCY_MS parsing
- main.py: detect aiter_commit, checkout aiter, pass docker_container

Backward compatible: existing 8 kernels run on host unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
iraj465 and others added 16 commits April 10, 2026 07:20
Removed: mla_prefill_reduce, refk_mla_decode, refk_moe, rope, nsa_forward.
Updated all config files to remove stale references.

Made-with: Cursor
launch_agent.py:
- Reorder patch strategies: try .patch file from best verified round
  first, then final_report.json patch, then worktree kernel copy.
  Previously worktree copy ran before patch file application.
- Add change verification: after applying any patch or copying a
  worktree kernel, verify the file actually changed vs original.
  Skip candidates that produce identical output.
- Use verified speedup preferentially over agent-reported benchmark
  speedup for round selection (verified if > 0, else benchmark).
- Read original kernel text once at the top instead of re-reading
  in multiple places.

agent_config.yaml:
- Increase timeout from 3600s to 10800s for HIP kernel runs that
  require longer compilation and multi-shape benchmarking.

Made-with: Cursor
… support

- Add RDNA4_architecture.md with gfx1201 hardware specs (Wave32, WGP,
  WMMA, GDDR6 bandwidth constraints)
- Add hip_rdna_cheatsheet.md as standalone RDNA HIP best practices
- Wire RDNA4 into default_cheatsheet.yaml with knowledge_override
- Set AMDGPU_TARGETS and GPU_TARGETS in preprocessing for CMake builds
- Support knowledge_override in prompt_builder for per-arch cheatsheets
- Remove hardcoded MI300X prompt from points_in_boxes task config

Made-with: Cursor
- Add scaled_fp4_quant to sgl_kernel mock in fused_append_shared_experts
  harness to fix ImportError from upstream sglang changes
- Add per-config error threshold for mla_decode correctness check to
  handle known numerical divergence at nhead=128, dq=2, ctx<=21
- Replace old vLLM/ROCmBench README with standard 18-kernel task list

Made-with: Cursor
simple_prompt_builder() bypasses src/prompt_builder.py, so the
architecture cheatsheets + arch-precheck directive wired up in 14d240f
were never reaching the agent. Load them in launch_agent.py after the
simple prompt is built, mirroring src/prompt_builder.py's section
layout (precheck + architecture context + language-specific knowledge).

Made-with: Cursor
…rocessing support

- Add RDNA4_architecture.md with gfx1201 hardware specs (Wave32, WGP,
  WMMA, GDDR6 bandwidth constraints)
- Add hip_rdna_cheatsheet.md as standalone RDNA HIP best practices
  (needed for hip2hip and torch2hip tasks on this branch)
- Add triton_rdna_cheatsheet.md as standalone RDNA Triton best practices
  (Wave32 implications, WMMA vs MFMA, gfx1201 tl.dot dtype support)
- Wire RDNA4 into default_cheatsheet.yaml with per-language
  knowledge_override (hip + triton)
- Set AMDGPU_TARGETS and GPU_TARGETS in preprocessing for CMake builds
- Support knowledge_override in prompt_builder for per-arch cheatsheets
- Remove hardcoded MI300X prompt from points_in_boxes task config

Mirrors AgentKernelArena HIP-branch commit 14d240f, with an added
triton-specific cheatsheet and knowledge_override wiring.

Made-with: Cursor
Enable RDNA4 (gfx1201) HIP support: arch cheatsheets, prompt builder, preprocessing
Adds a uniform Task Contract section to every hip2hip task prompt with
the four constraints raised in the PR #35 review:
  1. Preserve kernel function names and signatures (named per-task from
     `target_kernel_functions` so the agent sees exactly which symbols
     are part of the contract).
  2. Keep the launch / configuration interface compatible (grid/block
     dims, stream usage, host-side launch helpers / Python bindings).
  3. Output must remain directly compilable and runnable with the
     existing `compile_command` / `correctness_command` /
     `performance_command` — no edits to the test runner.
  4. Handle shared-memory launch sizing correctly when `__shared__` /
     dynamic LDS allocations are introduced.

Hosts the contract in `src/prompts/task_type.py::hip2hip_task_contract`
and injects it from `src/prompt_builder.py` (regular path) and
`agents/geak_v3/launch_agent.py` (GEAK-v3 simple_prompt_builder path,
mirroring the cheatsheet injection added in PR #35) so the contract
reaches every agent uniformly.

Why framework-level (not per-task `prompt.instructions`):
  - Applies uniformly to all 36 current hip2hip configs (24 gpumode/*
    with populated `instructions` + 12 others/* with `instructions:
    null`) without per-task duplication or drift.
  - Single source of truth: future edits update one function, not 36
    YAMLs.
  - Architecture-neutral by construction (no MI300X-specific hardware
    text, which was the original problem in
    `tasks/hip2hip/others/points_in_boxes/config.yaml` that PR #35
    cleaned up).

Scope:
  - Only `task_type == 'hip2hip'`. `triton2triton`, `torch2hip`,
    `cuda2hip`, `repository`, `instruction2triton` unchanged.
  - No task YAML edits.
  - No new dependencies.

Verification:
  - All 36 hip2hip configs render with the contract present and the
    correct kernel name(s) listed.
  - 5/5 sampled triton2triton configs do NOT receive the contract
    (correctly scoped).
  - Both injection paths exercised (regular `prompt_builder` and
    GEAK-v3 `launch_agent`).

Follow-up to PR #35 review thread (irvineoy's request).

Co-authored-by: Cursor <cursoragent@cursor.com>
… path too

setup_rocm_env() previously set all three of PYTORCH_ROCM_ARCH,
AMDGPU_TARGETS, and GPU_TARGETS only on the fallback path. On the
common rocminfo-success path it set just PYTORCH_ROCM_ARCH and
returned, leaving the two CMake env vars unset. Reproduced on MI300:
rocminfo detected gfx942 but AMDGPU_TARGETS / GPU_TARGETS remained
None, so CMake-based HIP builds did not pick up the selected arch.

Refactor the function to resolve gfx_arch first (rocminfo -> cheatsheet
fallback) and converge on a single export block driven by a module-
level tuple. Both paths now export all three vars uniformly, and any
future CMake env var can be added to the tuple in one place.

Addresses irvineoy's review on #37.

Co-authored-by: Cursor <cursoragent@cursor.com>
feat(rdna): add RDNA4 Triton + HIP cheatsheets, arch config, and preprocessing support
…ints

feat(hip2hip): inject generic task contract at framework level
Conflict resolutions:
- .gitignore: union of both ignore lists
- main.py: integrated main's torch2hip baseline-skip branch into our
  aiter checkout + GPU pinning (HIP_VISIBLE_DEVICES) wrapper
- src/evaluator.py: combined main's configurable compile/correctness
  timeouts with our docker_container kwarg
- src/performance.py: same pattern — perf_timeout + --baseline_only +
  docker_container kwarg merged into the run_command call
- src/evaluator_utils.py: in run_command, docker_container path skips
  normalize_python_command (host's python isn't in the container);
  non-docker path keeps main's interpreter normalization. checkout_aiter
  function preserved at end of file
- src/preprocessing.py: took main's enhanced _ensure_repo_cloned
  (tuple return, shallow clone, non-git target cleanup) and the new
  post_clone_install helpers. Our RDNA4 / AMDGPU_TARGETS additions
  were outside the conflict markers and remained intact
- src/prompt_builder.py: took main's repository_language resolution
  (task_type == 'repository' branch), kept our knowledge_override
  lookup for RDNA4 arch entries
- tasks/repository/rocprim/*/config.yaml (4 files): took main's superset
  (task_type, repository_language, post_clone_install)

Verified end-to-end with refk_identity smoke test on MI300 GPU 0
(1 GEAK round): compile/correctness/perf eval all pass, final report
written.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
sdubagun-amd and others added 4 commits May 24, 2026 13:17
Replace the proliferation of ad-hoc per-slot/per-batch/per-kernel configs
with a single canonical config covering all 18 benchmark kernels, plus a
tiny smoke config for sanity checks. The two-slot 4+4 GPU parallel run
convention is achieved by passing the all18 config to
scripts/run_geak_triton.sh, which already splits tasks odd/even across
GPUs 0-3 and 4-7. Because the config groups tasks by level
(6 L1, 4 L2, 8 L3), the split is balanced 3/2/4 per slot.

- Add config_geak_triton_all18.yaml and config_geak_triton_smoke.yaml
- Delete 13 stale slot/batch/rerun/partial configs
- Update .gitignore whitelist to track the two surviving configs
- Update README and agents/geak_v3_triton/README to document the
  single-config + run_geak_triton.sh workflow
- Fix stale defaults/help text in scripts/run_geak_triton.sh

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Make kernel.py and test_kernel_harness.py fully self-contained so
GEAK's per-task workspace can see (and optimize) all relevant code.
Inlines _fused_moe_kernel_mxfp4, helpers (get_num_xcds,
torch_to_triton_dtype, get_scaled_dot_format_string, remap_xcd,
pid_grid, _write_zeros_to_output), the gfx950 MOE config table, arch
detection, mxfp4 quant/dequant helpers, and the einsum reference.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Make kernel.py and test_kernel_harness.py fully self-contained so
GEAK's per-task workspace can see (and optimize) all kernels and
helpers. Inlines the fused rms/mxfp4-quant/moe-sort triton kernels,
activation helpers, and reference paths previously imported from
aiter.ops.triton.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Reorganize the inlined fused QK RoPE + KV-cache kernel and rotation
helpers so the workspace has zero aiter dependency, with explicit
source attribution for each inlined block. Harness updated to match.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@sharareh-y sharareh-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution, overall code looks good, here I things I'd like addressed:

  1. fused_append_shared_experts is wired so the agent edits kernel.py but the harness only loads fused_moe_triton_kernels.py — see file-level comment. Speedup for this kernel will always be ~1.0×.
  2. User-specific hardcoded paths are present in code: sapmajum username and /home/sapmajum/... GEAK_SRC fallback in scripts/run_geak_triton.sh, /workspace/src in agents/mini_swe_triton/launch_agent.py, /sgl-workspace/aiter default in src/evaluator_utils.py.
  3. In src/evaluator.py GEAK results overwrites report if AKA fails. We don't want this behavior, final evaluation for all agents should come from AKA runs.
  4. agents/geak_v3_triton/README.md and the agent_config.yaml disagree on defaults (MAX_ROUNDS 5 vs 3, MODEL_ENSEMBLE listed but not set) and the "All 18 kernels" table doesn't match config_geak_triton_all18.yaml (ff_backward listed at L1, mystery rope row, etc.).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking. This harness is hardcoded to load fused_moe_triton_kernels.py, but the task config (source_file_path: kernel.py) and both launchers (geak_v3_triton, mini_swe_triton) only edit kernel.py. After the agent runs, the two files diverge and the harness keeps importing the unmodified duplicate — so --correctness always passes and --full-benchmark always reports the baseline. Effective speedup will be 1.0× regardless of the agent.

Two options:
a) Drop fused_moe_triton_kernels.py (it's a byte-identical 977-line copy of kernel.py) and change _KERNEL_FILENAME to kernel.py. [Prefered]
b) Keep the sglang-style filename and update source_file_path + the launchers to operate on fused_moe_triton_kernels.py.

Either is fine, but the current setup is broken end-to-end.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking. Please remove the user-specific fallbacks:
CONTAINER="geak-agent-${USER:-sapmajum}"
export GEAK_SRC="${GEAK_SRC:-/home/sapmajum/GEAK-agent-filtering-and-cli-unification/src}"

Both leak one developer's environment into shared tooling. Either hard-fail when GEAK_SRC / $USER isn't set, or document the required env vars at the top and use : "${GEAK_SRC:?GEAK_SRC must be set}".

Also: the .tmp_config_stream_{a,b}.yaml temp files are written into the repo root and only cleaned up on the happy path. With set -euo pipefail and the || FAIL=1 swallow on wait, a crash leaves dirty files in the worktree. Please wrap cleanup in a trap.

Comment thread src/evaluator.py
Comment on lines +405 to +426

# Step 3b: If performance measurement failed, read GEAK's final_report.json
if results['best_optimized_execution_time'] == 0.0:
geak_results = _read_geak_final_report(workspace, log)
if geak_results:
results['best_optimized_execution_time'] = geak_results['candidate_ms']
results['average_speedup'] = geak_results['verified_speedup']
results['valid_optimized_cases'] = 1
results['valid_baseline_cases'] = 1
results['geak_baseline_ms'] = geak_results['baseline_ms']
results['geak_benchmark_speedup'] = geak_results.get('benchmark_speedup')
results['geak_best_task'] = geak_results.get('best_task')
results['geak_best_round'] = geak_results.get('best_round')
results['geak_round_history'] = geak_results.get('round_history', [])
log.info(
f"Using GEAK verified results: {geak_results['verified_speedup']:.4f}x "
f"(baseline={geak_results['baseline_ms']:.4f}ms, "
f"candidate={geak_results['candidate_ms']:.4f}ms, "
f"benchmark={geak_results.get('benchmark_speedup', 'N/A')}x, "
f"task={geak_results.get('best_task', 'N/A')})"
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking.
Please remove this part, we want final evaluation to come from AgentKernelArena.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This README is out of sync with the code it's documenting:

  • Defaults table says GEAK_MAX_ROUNDS=5 and GEAK_MODEL_ENSEMBLE=gpt-5.2,claude-opus-4.6, but agent_config.yaml ships MAX_ROUNDS=3 and has no MODEL_ENSEMBLE set.
  • "All 18 Triton Kernels" table lists ff_backward at L1, but config_geak_triton_all18.yaml lists it at L2.
  • Same table has a rope kernel at L2 (# 11) — there's no rope task in config_geak_triton_all18.yaml. Did you mean fused_qk_rope_cache_mla / fused_qkv_rope?
  • L1 row count is 7 in the README, 6 in the config.

sdubagun-amd and others added 3 commits May 27, 2026 05:58
…-common-benchmark

# Conflicts:
#	.gitignore
#	src/preprocessing.py
#	src/prompts/cheatsheet/default_cheatsheet.yaml
#	tasks/repository/rocprim/block_radix_rank/config.yaml
#	tasks/repository/rocprim/device_binary_search/config.yaml
#	tasks/repository/rocprim/device_merge_sort/config.yaml
#	tasks/repository/rocprim/device_search_n/config.yaml
Introduce num_parallel knob so GEAK's gwiab-scheduler can run >1
subagent per GPU. Resolution priority:
GEAK_NUM_PARALLEL env > eval config num_parallel > agent config
num_parallel > len(gpu_ids) (historical default).

Wired into both geak_v3 (regex-overrides --num-parallel/--gpu-ids in
run.configs) and geak_v3_triton (helper call replaces hardcoded
len(gpu_ids)).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
agents/geak_v3/geak.yaml was missing the run.budgets.{quick,full}
block that GEAK now requires. Aligning with upstream GEAK's
src/minisweagent/config/geak.yaml — same quick/full budget shape
used by geak_with_modes.yaml — so HIP runs launch without
"No run.budgets.full block in config".

config_smoke_hip.yaml is a one-task hip2hip smoke config used to
validate the GEAK gwiab-scheduler single-GPU + multi-subagent path
on the HIP code path. Allowlisted in .gitignore alongside the
existing triton smoke/all18 configs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants