A 10-task scientific-compute benchmark for Apple Silicon Metal kernels, paired with a lightweight evolutionary harness for LLM-driven kernel search.
Each task ships a Metal seed kernel, a CPU reference, a roofline-anchored
fitness function over three in-distribution problem sizes, and one
held-out size the agent never sees during search. The held-out gate
This repo accompanies the paper Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon.
Top: the six optimization regimes (R1–R6), each stressing a structurally
distinct GPU/memory bottleneck whose canonical recipe does not transfer
to its neighbours. Bottom: the harness loop. A frozen LLM
-
Harness (
metal_kernels/harness.py): runtime-compiles.metalsource viaMTLDevice.newLibraryWithSource(no offlinexcrun metaltoolchain), dispatches withMTLCommandBufferGPU timestamps (3 warmup, 10 timed, median reported), reads back through unified-memoryMTLBuffer.contents(). Compile errors are returned as structured strings to the LLM. -
Hardware (
metal_kernels/hardware.py): detects the chip (M1/M2/M3/M4 family) fromsysctland looks up peak FP32 GFLOPS + DRAM bandwidth for the per-size roofline ceiling. -
Task abstraction (
metal_kernels/task.py): each task owns input generation, dispatch, CPU reference, tolerance, in-distribution sizes$\Sigma_\mathcal{T}$ , held-out size$\sigma^\star_\mathcal{T}$ , and a per-size roofline. The in-distribution score$S_\mathcal{T}$ is the geometric mean ofachieved / ceilingacross$\Sigma_\mathcal{T}$ , hard-gated on correctness (any tolerance failure forces score$=0$ ). -
Tasks (six optimization regimes, R1–R6, plus a smoke test):
Regime Task Optimization lever In-dist sizes Held-out R1 stencil heat2dhalo, temporal blocking $\lbrace 256,512,1024\rbrace^2$ $768^2$ R1 stencil wave3d2.5D blocking, register pressure $\lbrace64,160,192\rbrace^3$ $128^3$ R2 compute nbodyregister tiling, threadgroup cooperative load $N\in\lbrace256,1024,2048\rbrace$ $512$ R2 compute hmcper-thread state vs. register file $(d,K)\in\lbrace (8,16K),(16,4K),(32,1K)\rbrace$ $(24,2K)$ R3 multi-field lbmSoA layout, BGK algebraic fold $\lbrace 64,128,256\rbrace^2$ $192^2$ R3 multi-field isingcheckerboard MC, byte-exact verify $\lbrace 256,1024,2048\rbrace^2$ $1536^2$ R4 atomics ljcell-list scatter, atomic contention $N\in\lbrace 1.7,4.1,10.6\rbrace\mathrm{K}$ $2744$ R5 multi-kernel gradshafin-kernel reduction + var-coef stencil $\lbrace 65,257,513\rbrace^2$ $129^2$ R6 butterfly fft3dTG bank conflicts, mixed-radix, simd_shuffle$\lbrace 32,64,128\rbrace^3$ $256^3$ (smoke) saxpyDRAM saturation $\lbrace 1,16,64\rbrace\mathrm{M}$ $4\mathrm{M}$ -
LLM bridge (
metal_kernels/llm.py): singlecall_llmentry that dispatches to Claude (viaclaude_agent_sdk), Gemini (viagoogle-genai), or OpenAI (via theopenaiSDK, including reasoning models likegpt-5.5). -
Evolution loop (
metal_kernels/evolve.py): seed → iterate; each iteration sees the previous candidate, the incumbent best, and a short history. Strict$(1{+}1)$ promotion (replace incumbent only on strict$S_\mathcal{T}$ improvement). Persists prompts, responses, sources, and JSON results.
Verify the seed kernels compile, pass correctness, and time:
uv run run_benchmark.py --task saxpy --evaluate-seed-only
uv run run_benchmark.py --task heat2d --evaluate-seed-only
uv run run_benchmark.py --task wave3d --evaluate-seed-only
uv run run_benchmark.py --task nbody --evaluate-seed-only
uv run run_benchmark.py --task hmc --evaluate-seed-only
uv run run_benchmark.py --task lbm --evaluate-seed-only
uv run run_benchmark.py --task ising --evaluate-seed-only
uv run run_benchmark.py --task lj --evaluate-seed-only
uv run run_benchmark.py --task gradshaf --evaluate-seed-only
uv run run_benchmark.py --task fft3d --evaluate-seed-onlyRun an evolution loop with Claude, Gemini, or GPT:
# Claude via the Agent SDK (requires ANTHROPIC_API_KEY)
uv run run_benchmark.py --task hmc --model claude-opus-4-7 --iterations 10
# Gemini (requires GEMINI_API_KEY or GOOGLE_API_KEY)
uv run run_benchmark.py --task gradshaf --model gemini-3.1-pro-preview --iterations 10
# OpenAI reasoning model (requires OPENAI_API_KEY)
uv run run_benchmark.py --task fft3d --model gpt-5.5 --iterations 10Per run, an output directory is created under results/ containing:
00_seed.metal # the unchanged seed
01_prompt.md # the user prompt sent to the LLM at iteration 1
01_response.md # raw LLM response
01_reasoning.md # extended-thinking tokens (when available)
01_candidate.metal # extracted Metal source
01_result.json # per-size correctness + timing + fraction-of-ceiling
...
best.metal # incumbent at end of run
best_result.json
history.json # per-iteration record
summary.json
The held-out evaluation results/_run_logs/eval_held_out*.py) on the run's incumbent at
Three matched single-model sweeps over the 10 tasks at the same per-task
iteration budget (10 each except lbm at 25 and wave3d at 15),
| In-dist. × | Held-out × | ||||||
|---|---|---|---|---|---|---|---|
| Task | Opus | Gemini | GPT | Opus | Gemini | GPT | Outcome |
saxpy |
1.25 | 1.00 | 1.01 | 1.17 | 0.98 | 0.98 | saturated |
heat2d |
1.00 | 1.03 | 1.00 | 0.86 | 1.01 | 0.82 | saturated |
wave3d |
1.26 | 1.00 | 1.00 | 1.00 | 0.90 | 0.99 | saturated |
ising |
1.13 | 1.00 | 1.09 | 0.94 | 0.99 | 0.88 | flat |
fft3d |
1.03 | 1.19 | 2.95 | 1.12 | 1.20 | 0.23 | GPT silent regression |
nbody |
2.83 | 2.00 | 2.19 | 1.24 | 1.50 | 1.37 | generalizes |
gradshaf |
1.89 | 2.89 | 1.93 | 2.05 | 2.91 | 1.86 | generalizes |
lj |
1.77 | 1.98 | 1.62 | 1.24 | 1.87 | 1.34 | generalizes |
lbm |
1.46 | 1.06 | 1.33 | 0.97 | 1.16 | 1.01 | tied at |
hmc |
10.6 | 10.7 | 7.19 | FAIL | 17.6 | 18.6 |
Opus wrong at |
(Opus = claude-opus-4-7, Gemini = gemini-3.1-pro-preview,
GPT = gpt-5.5.)
The two diagnostic cells are the central evidence for
-
Silent correctness violation (
hmc, Opus). The incumbent dispatchesif (d==8) run<8>() ... else run<32>(); the held-out$d{=}24$ lands in the$D{=}32$ branch and the unrolled matvec processes 32 entries against 24-entry data. In-distribution looks$10.6\times$ faster; the sample covariance is$\sim 10\sigma$ off target. -
Silent performance regression (
fft3d, GPT). The incumbent hand-codesfft_line_{32,64,128}with a 64-entry constant-memory twiddle table; for any$N\notin\lbrace 32,64,128\rbrace$ it falls into a textbook$O(N^2)$ direct DFT. At held-out$N{=}256$ this costs$\sim 32\times$ more arithmetic per output than the seed's$O(N\log N)$ Stockham FFT, flipping a reported$2.95\times$ in-dist. win into a$0.23\times$ deployment-grade slowdown.
A rough generation-time profile at high reasoning budgets: Opus
One declaration takes D a compile-time constant the per-thread q/p/f arrays are sized
exactly and the inner float4 loop against a fixed
D_MAX=32 layout, plus a 4-way horizontal sum):
// iter 5: runtime d, D_MAX=32 layout
for (uint i = 0; i < d; ++i) {
float4 acc = Arow[0] * q4[0];
acc = fma(Arow[1], q4[1], acc);
/* ... 6 more, always 8 ... */
acc = fma(Arow[7], q4[7], acc);
f[i] = acc.x + acc.y + acc.z + acc.w;
}
// d=8: 121 GFLOPS (2.7% of peak)
// iter 6: template<uint D> with runtime dispatch on d
template <uint D>
inline void run(...) {
float q[D], p[D], f[D];
#pragma unroll
for (uint i = 0; i < D; ++i) {
float acc = 0.0f;
#pragma unroll
for (uint j = 0; j < D; ++j) acc = fma(A[i*D + j], q[j], acc);
f[i] = acc;
}
}
if (d == 8u) run<8u> (...);
else if (d == 16u) run<16u>(...);
else run<32u>(...); // ← held-out d=24 lands here, silently
// d=8: 970 GFLOPS (22% of peak)Both Opus and Gemini independently arrive at this lever. Opus enumerates
only
Source: 10_candidate.metal (= best.metal)
The iter-10 best wins the in-distribution gmean at
// dispatch in fft3d_x kernel
threadgroup float2 buf0[128];
threadgroup float2 buf1[128];
if (N == 32u) fft_line_32_io(...);
else if (N == 64u) fft_line_64_io(..., buf0);
else if (N == 128u) fft_line_128_io(..., buf0, buf1);
else fft_line_direct_fallback_io(...); // ← held-out N=256
// fft_line_direct_fallback_io: O(N²) direct DFT
float2 acc = float2(0.0f, 0.0f);
float theta = -TWO_PI * float(tid) / float(N);
float2 wstep = float2(cos(theta), sin(theta));
float2 w = float2(1.0f, 0.0f);
for (uint n = 0u; n < N; ++n) {
float2 v = in_data[in_base + n * in_stride];
acc += cmul(v, w);
w = cmul(w, wstep);
}
out_data[out_base + tid * out_stride] = acc;The fast paths reuse a 64-entry constant float2 W128[] twiddle table
whose stride indexing only covers
Sources: Opus iter-23 best, Gemini iter-13 best
The Opus–Gemini split correlates with the type of optimization lever.
Tune the same algorithm tighter (Opus) vs find a different algorithm
(Gemini). On lbm, Opus extracts [[max_total_threads_per_threadgroup(64)]] to align the threadgroup
to the simdgroup width:
// Opus iter-23 best: BGK fold + pinned geometry
[[max_total_threads_per_threadgroup(64)]]
kernel void lbm_step(...) {
// pull-stream f0..f8, moments rho, ux, uy ...
float A = fma(-1.5f, usq, 1.0f);
float orWD = (omega * W_D) * rho;
// k=5: cu = ux + uy
{
float cu = ux + uy;
float t = A + cu * fma(4.5f, cu, 3.0f);
q5[idx] = fma(one_m_w, f5, orWD * t);
}
// 8 more directions, same shape ...
}
// Gemini iter-13 best: textbook BGK in unrolled loop
kernel void lbm_step(...) {
float usq = ux*ux + uy*uy;
float inv_tau = 1.0f / tau;
#pragma unroll
for (int k = 0; k < 9; ++k) {
float cu = CX[k]*ux + CY[k]*uy;
float feq = W[k] * rho *
(1.0f + 3.0f*cu + 4.5f*cu*cu - 1.5f*usq);
f_out[k*N + idx] = f[k] - inv_tau * (f[k] - feq);
}
}In-distribution gmean: Opus 0.576 vs Gemini 0.553.
Sources: Opus iter-10 best, Gemini iter-10 best
On fft3d the diff is two different algorithms. Apple's simdgroup width
is exactly 32, so for the first five Cooley–Tukey stages the butterfly
partner of lane simd_shuffle_xor with no shared memory and no barrier.
Only stages
// Opus iter-10 best: Stockham radix-4, all in tg memory
for (uint s = 0u; s < stages4; ++s) {
float2 x0 = cur[b + 0u*Nq];
float2 x1 = cur[b + 1u*Nq];
float2 x2 = cur[b + 2u*Nq];
float2 x3 = cur[b + 3u*Nq];
if (s != 0u) { /* twiddle multiplies */ }
// 4-point DFT, write to nxt[]
threadgroup_barrier(mem_threadgroup);
swap(cur, nxt);
}
// Gemini iter-10 best: simd_shuffle_xor for stages 1..5
// (no shared mem, no barrier)
if (logN >= 1u) {
float2 v = simd_shuffle_xor(u, 1u);
u = ((i & 1u) == 0u) ? (u + v) : (v - u);
}
// stages 2..4 analogous (xor 2, 4, 8)
if (logN >= 5u) {
float2 v = simd_shuffle_xor(u, 16u);
// twiddle, butterfly ...
u = ((i & 16u) == 0u) ? (u + t) : (v - t);
}
// stages s>=6: fall back to tg memory
for (uint s = 6u; s <= logN; ++s) { /* ... */ }In-distribution gmean: Gemini 0.282 vs Opus 0.167 (
Source: 06_candidate.metal (= best.metal)
Where Opus enumerates only
// hmc_step dispatch (GPT-5.5 iter-3 best)
load_A_transpose_tg(A, AT, d, tid, tpg);
threadgroup_barrier(mem_flags::mem_threadgroup);
if (d == 8u) run_hmc_fixed_chunk8<8u >(...);
else if (d == 16u) run_hmc_d16(...); // specialized d=16
else if (d == 24u) run_hmc_fixed_chunk8<24u>(...); // ← covers held-out d
else if (d == 32u) run_hmc_fixed_chunk8<32u>(...);
else run_hmc_dynamic(..., d, ...); // runtime-d safety netIn-distribution gmean comes in lower than Opus or Gemini (0.0634 vs
0.0932, 0.0870) — the cost of emitting four template instantiations
plus a runtime fallback — but held-out
Drop a seeds/<name>.metal and a metal_kernels/tasks/<name>.py
implementing Task.evaluate_size, declaring in_distribution_sizes and
held_out_sizes, and providing a CPU reference + tolerance. Decorate
with @register_task("<name>") and add the import to
metal_kernels/tasks/__init__.py. The harness plumbing (compile,
multi-size loop, scoring, correctness gate,
