trimul: add outgoing forward submission#7
Open
chokevin wants to merge 6 commits into
Open
Conversation
Implement the GPUMODE TriMul outgoing forward pass with a stacked projection, CUDA BF16 triangle multiply, local correctness tests, and an optional H200 benchmark harness. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add an optional Triton implementation for the triangle contraction with CLI/env tile knobs so H200/A100/H100 sweeps can iterate without editing the submission file. Keep the PyTorch backend as the default because the first per-hidden-channel Triton sweep is correct but slower than einsum. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add explicit batched-GEMM and packed Triton triangle backends. The packed Triton path rearranges the contraction to [B*H, N, N] BF16 matrices so Triton sees contiguous tiles, then auto-selects it only for N<=256 where cluster measurements beat the PyTorch einsum baseline; larger shapes keep the PyTorch backend. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add an optional op-level CUDA timing mode to identify TriMul phase bottlenecks from the same Rune harness. Add cached BF16 projection/output weight paths and promote the safe auto policy: BF16 stacked projection for N<=256 C=128, full BF16 linears for N<=256 C=384, and torch linears for larger shapes where full BF16 exceeded tolerance. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add split timing for the packed triangle path, keep the regressing gate-pack Triton prototype opt-in, and promote the measured Triton tail norm/gate fusion for N<=256. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Record the fused-tail TriMul results, the gate-pack regression, and the next fresh-eyes target in the repo benchmark docs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a GPUMODE TriMul outgoing forward submission at repo root, plus local correctness tests.
The implementation avoids per-call module construction, stacks the five input projections into one linear call, skips mask multiplication for the official no-mask dtype shape, and uses BF16 for the CUDA triangle multiplication path before returning FP32 output.
Why
TriMul is the next BioML kernel target after stabilizing the A100/H200 profiling loop. This gives us a correct baseline with real H200 smoke/profiling artifacts before deeper Triton/CUDA optimization.
Non-goals
Testing
uv run ruff format --check submission.py tests/test_trimul_submission.pyuv run ruff check submission.py tests/test_trimul_submission.pyuv run pytest -q tests/test_trimul_submission.pysf-trimul-check-174907-h200:matches_reference=True, max abs error0.007030963897705078, mean1.926944 msfor bs=2, seqlen=256, dim=128, hiddendim=128, nomask normal.sf-trimul-cauchy-175307-h200:matches_reference=True, max abs error0.004771828651428223.sf-trimul-mask384-175327-h200:matches_reference=True, max abs error0.003912881016731262.sf-trimul-ncu-174943-h200captured atruns/traces/sf-trimul-ncu-174943-h200-hermes.tar.gz.Full
make testis currently blocked onmainby missing checked-in NCU CSV fixtures underruns/airun/week1; the focused TriMul tests pass.Risk
Low for repo behavior because this is an isolated challenge submission file and tests. Performance risk is that the current implementation is still PyTorch-op based; the NCU trace should guide the next custom-kernel pass.