Skip to content

trimul: add outgoing forward submission#7

Open
chokevin wants to merge 6 commits into
mainfrom
chokevin/trimul-outgoing-20260506
Open

trimul: add outgoing forward submission#7
chokevin wants to merge 6 commits into
mainfrom
chokevin/trimul-outgoing-20260506

Conversation

@chokevin
Copy link
Copy Markdown
Owner

@chokevin chokevin commented May 6, 2026

What

Adds a GPUMODE TriMul outgoing forward submission at repo root, plus local correctness tests.

The implementation avoids per-call module construction, stacks the five input projections into one linear call, skips mask multiplication for the official no-mask dtype shape, and uses BF16 for the CUDA triangle multiplication path before returning FP32 output.

Why

TriMul is the next BioML kernel target after stabilizing the A100/H200 profiling loop. This gives us a correct baseline with real H200 smoke/profiling artifacts before deeper Triton/CUDA optimization.

Non-goals

  • Does not add a custom Triton/CUDA TriMul kernel yet.
  • Does not replace the official challenge harness.
  • Does not attempt gradient support.

Testing

  • uv run ruff format --check submission.py tests/test_trimul_submission.py
  • uv run ruff check submission.py tests/test_trimul_submission.py
  • uv run pytest -q tests/test_trimul_submission.py
  • H200 smoke sf-trimul-check-174907-h200: matches_reference=True, max abs error 0.007030963897705078, mean 1.926944 ms for bs=2, seqlen=256, dim=128, hiddendim=128, nomask normal.
  • H200 cauchy smoke sf-trimul-cauchy-175307-h200: matches_reference=True, max abs error 0.004771828651428223.
  • H200 masked dim=384 smoke sf-trimul-mask384-175327-h200: matches_reference=True, max abs error 0.003912881016731262.
  • H200 NCU trace sf-trimul-ncu-174943-h200 captured at runs/traces/sf-trimul-ncu-174943-h200-hermes.tar.gz.

Full make test is currently blocked on main by missing checked-in NCU CSV fixtures under runs/airun/week1; the focused TriMul tests pass.

Risk

Low for repo behavior because this is an isolated challenge submission file and tests. Performance risk is that the current implementation is still PyTorch-op based; the NCU trace should guide the next custom-kernel pass.

Ubuntu and others added 6 commits May 6, 2026 17:54
Implement the GPUMODE TriMul outgoing forward pass with a stacked projection, CUDA BF16 triangle multiply, local correctness tests, and an optional H200 benchmark harness.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add an optional Triton implementation for the triangle contraction with CLI/env tile knobs so H200/A100/H100 sweeps can iterate without editing the submission file. Keep the PyTorch backend as the default because the first per-hidden-channel Triton sweep is correct but slower than einsum.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add explicit batched-GEMM and packed Triton triangle backends. The packed Triton path rearranges the contraction to [B*H, N, N] BF16 matrices so Triton sees contiguous tiles, then auto-selects it only for N<=256 where cluster measurements beat the PyTorch einsum baseline; larger shapes keep the PyTorch backend.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add an optional op-level CUDA timing mode to identify TriMul phase bottlenecks from the same Rune harness. Add cached BF16 projection/output weight paths and promote the safe auto policy: BF16 stacked projection for N<=256 C=128, full BF16 linears for N<=256 C=384, and torch linears for larger shapes where full BF16 exceeded tolerance.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add split timing for the packed triangle path, keep the regressing gate-pack Triton prototype opt-in, and promote the measured Triton tail norm/gate fusion for N<=256.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Record the fused-tail TriMul results, the gate-pack regression, and the next fresh-eyes target in the repo benchmark docs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant