|
2 | 2 |
|
3 | 3 | Run ExecuTorch models on the GPU via [WebGPU](https://www.w3.org/TR/webgpu/). The backend compiles delegated subgraphs into WGSL compute shaders executed natively through [wgpu-native](https://github.com/gfx-rs/wgpu-native) (Metal on macOS, Vulkan on Linux/Windows). |
4 | 4 |
|
5 | | -> **Status: Prototype.** The backend supports a single operator today and is under active development. See [TODO.md](TODO.md) for the roadmap. |
| 5 | +> **Status: Prototype.** The backend supports `add` and `rms_norm` today and is under active development. See [Progress](#progress) for shipped milestones. |
| 6 | +
|
| 7 | +## Progress |
| 8 | + |
| 9 | +Milestones landed on `main`: |
| 10 | + |
| 11 | +| Date | Milestone | Pull Request | |
| 12 | +|---|---|---| |
| 13 | +| 2026-04 | Made it possible to run ExecuTorch models on the GPU through WebGPU — built the backend from the ground up, including the runtime delegate that builds the GPU graph (buffers, pipelines, bind groups) and runs the model on Metal and Vulkan | [#18808](https://github.com/pytorch/executorch/pull/18808) | |
| 14 | +| 2026-06 | Grew model support beyond element-wise operators — added the root-mean-square normalization operator (`rms_norm`) and named-data weight loading | [#19963](https://github.com/pytorch/executorch/pull/19963) | |
| 15 | +| 2026-06 | Made sure every change is automatically tested — added WebGPU to ExecuTorch's standard backend test suite, running on Linux/x86 in CI | [#19964](https://github.com/pytorch/executorch/pull/19964) | |
| 16 | +| 2026-06 | Removed a class of bugs and manual upkeep — the WGSL shaders are now generated automatically, with a build-time check that fails the build on shader/source drift | [#19981](https://github.com/pytorch/executorch/pull/19981) | |
| 17 | +| 2026-06 | Got the test suite to actually run work on the GPU — added operator-allowlist delegation (unsupported operations fall back to the CPU) and a process-wide GPU device context, so models execute on the GPU during testing | [#20036](https://github.com/pytorch/executorch/pull/20036) | |
| 18 | + |
| 19 | +In review: |
| 20 | + |
| 21 | +| Milestone | Pull Request | |
| 22 | +|---|---| |
| 23 | +| Makes testing match the WebGPU standard exactly — switches the tests to Google's Dawn shader compiler (Tint, the source-of-truth WGSL implementation) running on SwiftShader for headless GPU execution | [#20079](https://github.com/pytorch/executorch/pull/20079) | |
| 24 | +| Strengthens correctness for models that run in several GPU passes — adds dispatch-ordering and scratch-buffer (temporary GPU memory) tests | [#20080](https://github.com/pytorch/executorch/pull/20080) | |
6 | 25 |
|
7 | 26 | ## Architecture |
8 | 27 |
|
@@ -36,8 +55,9 @@ Key design choices: |
36 | 55 | | Operator | WGSL Shader | Notes | |
37 | 56 | |---|---|---| |
38 | 57 | | `aten.add.Tensor` | `binary_add.wgsl` | Element-wise with alpha: `out = in1 + alpha * in2` | |
| 58 | +| `et_vk.rms_norm.default` | `rms_norm.wgsl` | Root-mean-square normalization | |
39 | 59 |
|
40 | | -**Planned:** `sub`, `mul`, `relu`, `linear` (matmul), `softmax`, `layer_norm` |
| 60 | +**Planned:** scaled-dot-product attention (KV cache), quantized linear (4-bit weight-only and 8da4w post-training quantization), quantized embedding, RoPE, `mul`, `sigmoid`, and shape ops (`view`, `permute`, `slice`, `select`, `cat`, `squeeze`/`unsqueeze`). |
41 | 61 |
|
42 | 62 | ## Quick Start |
43 | 63 |
|
@@ -83,27 +103,37 @@ This runs Python export tests, exports a .pte, builds the native runtime, and va |
83 | 103 | backends/webgpu/ |
84 | 104 | ├── CMakeLists.txt |
85 | 105 | ├── README.md |
86 | | -├── TODO.md |
87 | 106 | ├── runtime/ |
88 | 107 | │ ├── WebGPUBackend.h/cpp # BackendInterface (init/execute) |
89 | 108 | │ ├── WebGPUGraph.h/cpp # GPU graph: buffers, pipelines, dispatch |
90 | 109 | │ ├── WebGPUDelegateHeader.h/cpp # VH00 header parser |
91 | 110 | │ ├── WebGPUDevice.h/cpp # wgpu-native device abstraction |
| 111 | +│ ├── WebGPUUtils.h # Workgroup-size helpers |
92 | 112 | │ └── ops/ |
93 | 113 | │ ├── OperatorRegistry.h/cpp # Op dispatch table |
94 | | -│ └── add/ |
95 | | -│ ├── BinaryOp.cpp # aten.add.Tensor implementation |
96 | | -│ ├── binary_add.wgsl # WGSL shader source |
97 | | -│ └── binary_add_wgsl.h # Shader as C++ string constant |
| 114 | +│ ├── add/ |
| 115 | +│ │ ├── BinaryOp.cpp # aten.add.Tensor implementation |
| 116 | +│ │ ├── binary_add.wgsl # WGSL shader source |
| 117 | +│ │ └── binary_add_wgsl.h # Shader as C++ string constant |
| 118 | +│ └── rms_norm/ |
| 119 | +│ ├── RmsNorm.cpp # et_vk.rms_norm implementation |
| 120 | +│ ├── rms_norm.wgsl # WGSL shader source |
| 121 | +│ └── rms_norm_wgsl.h # Shader as C++ string constant |
98 | 122 | ├── scripts/ |
99 | | -│ └── setup-wgpu-native.sh # Download wgpu-native binaries |
| 123 | +│ ├── setup-wgpu-native.sh # Download wgpu-native binaries |
| 124 | +│ └── gen_wgsl_headers.py # Generate the embedded *_wgsl.h shader headers |
100 | 125 | └── test/ |
101 | 126 | ├── conftest.py |
| 127 | + ├── tester.py # Partitioner stages + supported-op list |
102 | 128 | ├── test_build_webgpu.sh # End-to-end build + test |
103 | 129 | ├── test_webgpu_native.cpp # C++ native test runner |
104 | | - └── ops/ |
105 | | - └── add/ |
106 | | - └── test_add.py # Python export tests |
| 130 | + ├── test_wgsl_codegen.py # Shader codegen check |
| 131 | + ├── native/ # C++ operator tests |
| 132 | + └── ops/ # Python export tests |
| 133 | + ├── add/ |
| 134 | + │ └── test_add.py # add export tests |
| 135 | + └── rms_norm/ |
| 136 | + └── test_rms_norm.py # rms_norm export tests |
107 | 137 | ``` |
108 | 138 |
|
109 | 139 | ## Requirements |
|
0 commit comments