Add split-K mid-M SDPA Triton kernel for EAGLE-3 target_verify at long context by digantdesai · Pull Request #20344 · pytorch/executorch

digantdesai · 2026-06-17T19:02:28Z

EAGLE-3 verify is a target forward over M = chain+1 query rows. On gemma4-31B's
full-attention (global) layers the standard SDPA scans the whole max_seq_len KV
buffer on a (B, H) grid -- one CTA per head looping the key range serially -- so
at long context the verify attention is occupancy-starved and grows ~linearly
with context, dominating the round and turning speculative decoding into a net
loss; the M query rows otherwise ride along for free on the same K/V read.

This adds a length-bounded split-K mid-M SDPA path for that case. The Triton
kernel (backends/cuda/triton/kernels/sdpa_midm.py) bounds the key range to the
valid length and partitions it across CTAs with a split-K online-softmax plus
cross-split reduce (the flash-decoding trick), with sdpa.py-style guards for
tiles a row's causal mask empties. Gemma4_31B gains opt-in dispatch
(set_midm_sdpa): full-attention layers route verify windows with M in
[2, MIDM_MAX_M] through the kernel, while sliding-window, prefill, decode, and
other models stay on F.sdpa. The valid KV length reaches the kernel as the
length of a new target_verify kv_window input (a backed SymInt); export wires it
up behind --no-midm-sdpa and the runner feeds it each round. Verify global
attention then stays ~flat with context instead of growing.

Because kv_window's shape changes every round, target_verify can no longer be
captured as a CUDA graph, so the runner's --cuda_graph now defaults off.

Lossless: byte-identical to baseline greedy except rare near-tie argmax flips
(M=chain+1 verify vs M=1 decode FP non-associativity; the same prompts flip
without this kernel). Unit coverage in backends/cuda/tests/test_sdpa_midm.py.
Benchmarks need the 31B checkpoints + A100 + a long-context export, so they run
out of CI and are not kept in this message.

Authored with assistance from Claude Code.

[ghstack-poisoned]

digantdesai · 2026-06-17T19:02:29Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2026-06-17T19:02:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20344

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 1 Cancelled Job, 1 Unrelated Failure, 5 Unclassified Failures

As of commit 37fc965 with merge base dc55469 ():

NEW FAILURES - The following jobs have failed:

pull / test-arm-backend-no-driver (test_pytest_models_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t 82c511ea1ed1c0203b83be4ba3cc3de87b8447c35e2fc415947fdd4a6a321f79 /exec failed with exit code 1
pull / test-lora-linux / linux-job (gh)
RuntimeError: Command docker exec -t 4adb13835836ce372cc996a6993d6d88b66213baa611c3acc9e38047bd2a9ec2 /exec failed with exit code 1
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (facebook, dinov2-small-imagenet1k-1-layer, non-quantized) / windows-job (gh)
Process completed with exit code 1.

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

Build Aarch64 Linux Wheels / pytorch/executorch / build-wheel-py3_10-cpu-aarch64 (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
/__w/executorch/executorch/pytorch/executorch/extension/llm/tokenizers/third-party/sentencepiece/src/sentencepiece_processor.cc:129:10: error: no declaration matches ‘uint32_t sentencepiece::ImmutableSentencePieceText_ImmutableSentencePiece::end() const’
Build Aarch64 Linux Wheels / pytorch/executorch / upload / upload-wheel-py3_10-cpu-aarch64 (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
Unable to download artifact(s): Artifact not found for name: pytorch_executorch__3.10_cpu_aarch64
Test CUDA Builds / export-model-cuda-artifact (SocialLocalMobile, gemma-4-31B-it-HQQ-INT4, quantized-int4-tile-packed) / linux-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
Test CUDA Builds / export-model-cuda-artifact (SocialLocalMobile, Qwen3.5-35B-A3B-HQQ-INT4, quantized-int4-tile-packed) / linux-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
Test CUDA Builds / unittest-cuda / linux-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
backends/cuda/tests/test_int4_dispatch.py::TestLimitConsistency::test_shim_gemm_max_m_matches_cuh

CANCELLED JOB - The following job was cancelled. Please retry:

pull / unittest-nxp-neutron / linux-job (gh)
##[error]The operation was canceled.

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / test-moshi-linux / linux-job (gh) (matched linux rule in flaky-rules.json)
E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/m/mesa/mesa-vdpau-drivers_23.2.1-1ubuntu3.1%7e22.04.3_amd64.deb 404 Not Found [IP: 104.20.28.246 80]

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Update

37fc965

[ghstack-poisoned]

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add split-K mid-M SDPA Triton kernel for EAGLE-3 target_verify at long context#20344

Add split-K mid-M SDPA Triton kernel for EAGLE-3 target_verify at long context#20344
digantdesai wants to merge 1 commit into
gh/digantdesai/62/headfrom
gh/digantdesai/63/head

digantdesai commented Jun 17, 2026

Uh oh!

digantdesai commented Jun 17, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

digantdesai commented Jun 17, 2026

Uh oh!

digantdesai commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20344

❌ 3 New Failures, 1 Cancelled Job, 1 Unrelated Failure, 5 Unclassified Failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

digantdesai commented Jun 17, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 17, 2026 •

edited

Loading