Metal backend: Add gated delta rule kernel for linear attention by manuelcandales · Pull Request #18878 · pytorch/executorch

manuelcandales · 2026-04-14T16:25:34Z

Adds Metal kernel for the gated delta rule recurrence used by Qwen 3.5
MoE's GatedDeltaNet linear attention layers. Ported from the MLX delegate
PR (#18785) Metal shader. The kernel processes the full sequence
sequentially within a single GPU dispatch, keeping recurrent state in
per-thread registers.

Grid: [32, Dv, B*Hv], Threadgroup: [32, 4, 1]. Each simdgroup of 32
threads handles Dk/32 elements of the key dimension with SIMD reduction
for dot products.

The op mutates the recurrent state buffer in-place (mutates_args).
Instantiated for both real model (Dk=128, Dv=128, Hk=32, Hv=32) and
tiny test (Dk=64, Dv=64, Hk=4, Hv=4) dimensions.

Includes: Metal shader + C++ host dispatch, Python custom op definition
(metal::gated_delta_rule) with reference CPU impl and Meta impl, C shim
dict, fallback kernel registration, CMakeLists entry, and test module.

Authored with Claude.

[ghstack-poisoned]

manuelcandales · 2026-04-14T16:25:35Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2026-04-14T16:26:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18878

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

metascroy · 2026-04-17T00:06:03Z

@claude review this code

claude · 2026-04-17T00:06:43Z

Claude finished @metascroy's task in 2m 37s —— View job

Read all changed files
Analyze Metal shader and C++ host dispatch
Analyze Python custom op and reference implementation
Analyze tests
Post review

Review: Metal gated delta rule kernel

Overall this is a well-structured addition that follows existing Metal backend patterns. The kernel logic is correct and the in-place state mutation approach is sound. A few issues worth addressing:

Bug: Missing error check on output tensor creation (op_gated_delta_rule.mm:253-270)

The aoti_torch_create_tensor_from_blob_v2 return value is not checked. Compare with op_gather_qmv.mm:492-500 which checks both the return code and handle:

// op_gather_qmv.mm pattern:
AOTITorchError create_result = aoti_torch_create_tensor_from_blob_v2(
    out_contents_ptr, 2, output_sizes.data(), output_strides.data(),
    0, dtype, 13, 0, &out_tensor_handle, 0, nullptr, 0);

if (create_result != Error::Ok || !out_tensor_handle) {
  ET_LOG(Error, "...: Failed to create output tensor");
  aoti_torch_mps_free(out_contents_ptr);
  return Error::Internal;
}

In op_gated_delta_rule.mm:261-268, the return value is discarded and only the handle is null-checked. If aoti_torch_create_tensor_from_blob_v2 returns a non-Ok error but still sets a non-null handle, the error would be silently ignored. More importantly, even the null-handle path calls aoti_torch_mps_free without logging or returning the proper error code — the flow falls through to memory_to_n_tensor[y_ptr] = 1 for the freed pointer. Fix this →

Bug: Python reference impl silently assumes Hk == Hv (gated_delta_rule.py:46)

The reference implementation broadcasts k_t[:, :, None, :] as [B, Hk, 1, Dk] against state [B, Hv, Dv, Dk]. This broadcasting only works when Hk == Hv. The Metal kernel correctly handles Hk != Hv via hk_idx = hv_idx / (Hv / Hk), but the Python reference would produce wrong results (or crash) if someone ever instantiates with Hk != Hv.

Currently all instantiated kernels use Hk == Hv so this isn't a live bug, but it creates a correctness trap. Consider either:

Adding an assert: assert Hk == Hv, "Reference impl requires Hk == Hv"
Or implementing proper GQA key expansion in the reference: k_t = k_t.repeat_interleave(Hv // Hk, dim=1) before the state operations

Nit: Stale comment in instantiation block (op_gated_delta_rule.mm:124)

Line 124 says "Qwen 3.5 MoE dimensions: Dk=128, Dv=128, Hk=16, Hv=32" but line 143 instantiates Hk=32, Hv=32. The comment on line 142 explains this ("Hk=16 after repeat_interleave → 32") but the macro comment on line 124 is potentially confusing since the template instantiation itself uses the post-expansion values.

Nit: Tolerance relaxation for gather_qmv bfloat16 (test_modules.py:736-737)

The bfloat16 tolerances for gather_qmv were relaxed from atol=1e-1, rtol=1e-1 to atol=5.0, rtol=2e-1. An absolute tolerance of 5.0 is extremely loose — it means outputs can differ by up to 5.0 in absolute value. If this was needed to make tests pass, it may indicate a real precision issue in the gather_qmv kernel for bfloat16. Was this change intentional for this PR, or is it papering over a separate issue? It seems unrelated to the gated delta rule addition.

Minor observations (non-blocking):

The g decay gate comment on line 93 of the Metal kernel applies g_[hv_idx] directly (no exp()) — the PR description pseudocode says exp(g_t) but the op signature says "already exp'd". This is consistent but worth noting for documentation clarity: callers must pre-apply exp() to the gate.
The kernel's in-place state mutation (passing state_tensor as both arg 5 and arg 7) is thread-safe because each (b, hv, dv, dk) element is owned by exactly one thread — no cross-thread data races.
Test dimensions (Hk=4, Hv=4, Dk=64, Dv=64) only cover the Hk == Hv case. If GQA support (Hk < Hv) is intended to be supported by the Metal kernel, it would be valuable to add a test configuration for it (though this would also require fixing the Python reference).

[ghstack-poisoned]

github-actions · 2026-04-21T17:55:51Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…rch#18878) Adds Metal kernel for the gated delta rule recurrence used by Qwen 3.5 MoE's GatedDeltaNet linear attention layers. Ported from the MLX delegate PR (pytorch#18785) Metal shader. The kernel processes the full sequence sequentially within a single GPU dispatch, keeping recurrent state in per-thread registers. Grid: [32, Dv, B*Hv], Threadgroup: [32, 4, 1]. Each simdgroup of 32 threads handles Dk/32 elements of the key dimension with SIMD reduction for dot products. The op mutates the recurrent state buffer in-place (mutates_args). Instantiated for both real model (Dk=128, Dv=128, Hk=32, Hv=32) and tiny test (Dk=64, Dv=64, Hk=4, Hv=4) dimensions. Includes: Metal shader + C++ host dispatch, Python custom op definition (metal::gated_delta_rule) with reference CPU impl and Meta impl, C shim dict, fallback kernel registration, CMakeLists entry, and test module.

manuelcandales added 5 commits April 14, 2026 12:25

Update

a3a42e4

[ghstack-poisoned]

Update

1c965c6

[ghstack-poisoned]

Update

1be53ab

[ghstack-poisoned]

Update

47cbe76

[ghstack-poisoned]

Update

805a09d

[ghstack-poisoned]

manuelcandales requested review from kirklandsign, larryliu0820 and shoumikhin as code owners April 14, 2026 16:25

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 14, 2026

manuelcandales marked this pull request as draft April 14, 2026 16:27

manuelcandales added 2 commits April 14, 2026 18:23

Update

958712e

[ghstack-poisoned]

Update

eba74c4

[ghstack-poisoned]

manuelcandales marked this pull request as ready for review April 14, 2026 22:24

manuelcandales added 5 commits April 14, 2026 18:44

Update

e7a7acc

[ghstack-poisoned]

Update

5530242

[ghstack-poisoned]

Update

59f88db

[ghstack-poisoned]

Update

1fbb94f

[ghstack-poisoned]

Update

60ca500

[ghstack-poisoned]

manuelcandales removed request for kirklandsign, larryliu0820 and shoumikhin April 15, 2026 15:14

manuelcandales requested review from mergennachin and metascroy April 15, 2026 15:14

manuelcandales mentioned this pull request Apr 16, 2026

Qwen 3.5 MoE Metal: Use max-sized prefill example for dynamic inputs #18956

Merged

manuelcandales added 3 commits April 20, 2026 14:12

Update

4632a83

[ghstack-poisoned]

Update

98d2f81

[ghstack-poisoned]

Update

95fb7f9

[ghstack-poisoned]

metascroy approved these changes Apr 20, 2026

View reviewed changes

manuelcandales added 13 commits April 20, 2026 15:01

Update

f4f616e

[ghstack-poisoned]

Update

b8e1201

[ghstack-poisoned]

Update

248115a

[ghstack-poisoned]

Update

ee865c3

[ghstack-poisoned]

Update

9000488

[ghstack-poisoned]

Update

a060d19

[ghstack-poisoned]

Update

01c3ce5

[ghstack-poisoned]

Update

0c1a88b

[ghstack-poisoned]

Update

933122c

[ghstack-poisoned]

Update

9def0ed

[ghstack-poisoned]

Update

01ecf6a

[ghstack-poisoned]

Update

7423226

[ghstack-poisoned]

Update

4b791ea

[ghstack-poisoned]

Base automatically changed from gh/manuelcandales/172/head to main April 21, 2026 17:53

Update

f8ebcfb

[ghstack-poisoned]

manuelcandales merged commit d408a10 into main Apr 21, 2026
175 of 181 checks passed

manuelcandales deleted the gh/manuelcandales/173/head branch April 21, 2026 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal backend: Add gated delta rule kernel for linear attention#18878

Metal backend: Add gated delta rule kernel for linear attention#18878
manuelcandales merged 29 commits into
mainfrom
gh/manuelcandales/173/head

manuelcandales commented Apr 14, 2026

Uh oh!

manuelcandales commented Apr 14, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

metascroy commented Apr 17, 2026

Uh oh!

claude Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

manuelcandales commented Apr 14, 2026

Uh oh!

manuelcandales commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18878

Uh oh!

metascroy commented Apr 17, 2026

Uh oh!

claude Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: Metal gated delta rule kernel

Uh oh!

github-actions Bot commented Apr 21, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

manuelcandales commented Apr 14, 2026 •

edited

Loading

pytorch-bot Bot commented Apr 14, 2026 •

edited

Loading

claude Bot commented Apr 17, 2026 •

edited

Loading

This PR needs a `release notes:` label