trimul: add outgoing forward submission by chokevin · Pull Request #7 · chokevin/swordfish

chokevin · 2026-05-06T17:54:48Z

What

Adds a GPUMODE TriMul outgoing forward submission at repo root, plus local correctness tests.

The implementation avoids per-call module construction, stacks the five input projections into one linear call, skips mask multiplication for the official no-mask dtype shape, and uses BF16 for the CUDA triangle multiplication path before returning FP32 output.

Why

TriMul is the next BioML kernel target after stabilizing the A100/H200 profiling loop. This gives us a correct baseline with real H200 smoke/profiling artifacts before deeper Triton/CUDA optimization.

Non-goals

Does not add a custom Triton/CUDA TriMul kernel yet.
Does not replace the official challenge harness.
Does not attempt gradient support.

Testing

uv run ruff format --check submission.py tests/test_trimul_submission.py
uv run ruff check submission.py tests/test_trimul_submission.py
uv run pytest -q tests/test_trimul_submission.py
H200 smoke sf-trimul-check-174907-h200: matches_reference=True, max abs error 0.007030963897705078, mean 1.926944 ms for bs=2, seqlen=256, dim=128, hiddendim=128, nomask normal.
H200 cauchy smoke sf-trimul-cauchy-175307-h200: matches_reference=True, max abs error 0.004771828651428223.
H200 masked dim=384 smoke sf-trimul-mask384-175327-h200: matches_reference=True, max abs error 0.003912881016731262.
H200 NCU trace sf-trimul-ncu-174943-h200 captured at runs/traces/sf-trimul-ncu-174943-h200-hermes.tar.gz.

Full make test is currently blocked on main by missing checked-in NCU CSV fixtures under runs/airun/week1; the focused TriMul tests pass.

Risk

Low for repo behavior because this is an isolated challenge submission file and tests. Performance risk is that the current implementation is still PyTorch-op based; the NCU trace should guide the next custom-kernel pass.

Implement the GPUMODE TriMul outgoing forward pass with a stacked projection, CUDA BF16 triangle multiply, local correctness tests, and an optional H200 benchmark harness. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add an optional Triton implementation for the triangle contraction with CLI/env tile knobs so H200/A100/H100 sweeps can iterate without editing the submission file. Keep the PyTorch backend as the default because the first per-hidden-channel Triton sweep is correct but slower than einsum. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add explicit batched-GEMM and packed Triton triangle backends. The packed Triton path rearranges the contraction to [B*H, N, N] BF16 matrices so Triton sees contiguous tiles, then auto-selects it only for N<=256 where cluster measurements beat the PyTorch einsum baseline; larger shapes keep the PyTorch backend. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add an optional op-level CUDA timing mode to identify TriMul phase bottlenecks from the same Rune harness. Add cached BF16 projection/output weight paths and promote the safe auto policy: BF16 stacked projection for N<=256 C=128, full BF16 linears for N<=256 C=384, and torch linears for larger shapes where full BF16 exceeded tolerance. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add split timing for the packed triangle path, keep the regressing gate-pack Triton prototype opt-in, and promote the measured Triton tail norm/gate fusion for N<=256. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Record the fused-tail TriMul results, the gate-pack regression, and the next fresh-eyes target in the repo benchmark docs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Ubuntu and others added 6 commits May 6, 2026 17:54

trimul: add outgoing forward submission

fe5569b

Implement the GPUMODE TriMul outgoing forward pass with a stacked projection, CUDA BF16 triangle multiply, local correctness tests, and an optional H200 benchmark harness. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

trimul: fuse tail norm gate

26c543b

Add split timing for the packed triangle path, keep the regressing gate-pack Triton prototype opt-in, and promote the measured Triton tail norm/gate fusion for N<=256. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

docs: add trimul tuning handoff

cd607f5

Record the fused-tail TriMul results, the gate-pack regression, and the next fresh-eyes target in the repo benchmark docs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trimul: add outgoing forward submission#7

trimul: add outgoing forward submission#7
chokevin wants to merge 6 commits into
mainfrom
chokevin/trimul-outgoing-20260506

chokevin commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chokevin commented May 6, 2026

What

Why

Non-goals

Testing

Risk

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant