Enable backend test suite + x86 CI (#19964)#19986
Conversation
Summary: The Vulkan serializer that the WebGPU backend reuses stores every non-empty constant in the PTE's named-data map with `offset == UINT64_MAX` and a `named_key`, rather than inline in the VK00 blob. `WebGPUGraph::build` previously handled only inline constants, so a delegated op's constant weights were never uploaded and the op produced all zeros. `build` now also fetches named-data constants via `NamedDataMap::get_data`, mirroring the path `VulkanBackend` already uses. `aten.add` was unaffected since it has no constant tensors; the first consumer is the `rms_norm` op in the child diff. ghstack-source-id: 389182397 exported-using-ghexport Reviewed By: SS-JIA Differential Revision: D107288998
Summary: Adds the `et_vk.rms_norm.default` operator to the WebGPU backend: a WGSL compute shader using a cooperative tree reduction, one workgroup per row. The shader mirrors the Vulkan implementation (`backends/vulkan/runtime/graph/ops/impl/RmsNorm.cpp`, `backends/vulkan/runtime/graph/ops/glsl/rms_norm_buffer.glsl`); indexing assumes contiguous fp32 inputs. The handler fails loud (throws, mirroring Vulkan's `VK_CHECK_COND`) on invalid shape/dtype/dispatch-limit conditions, and defaults `eps` to the float32 machine epsilon. The weight constant is uploaded via the named-data path added in the parent diff. ghstack-source-id: 389206169 exported-using-ghexport Reviewed By: SS-JIA Differential Revision: D106887028
Summary: Wires the WebGPU backend into the standard ExecuTorch backend test suite and adds an x86 Linux CI job, mirroring the Vulkan delegate: `backends/test/suite/flows/webgpu.py` plus a `WebGPUTester`, run by `oss/.github/workflows/test-backend-webgpu.yml` on SwiftShader (a software Vulkan adapter, via `wgpu-native`, minimal dependencies, no GPU). Two fixes were needed for SwiftShader's downlevel limits: request the adapter's full `requiredLimits` at device creation (software adapters default storage-buffer limits to 0), and make the `add` op's workgroup size dynamic instead of a hardcoded constant. The WGSL now declares a pipeline-overridable `override wg_size: u32 = 256` and the host clamps it to the device's `maxComputeInvocationsPerWorkgroup` (256 on real GPUs and lavapipe, 128 on SwiftShader), so SwiftShader's 128-invocation cap no longer forces a smaller workgroup size on real hardware. This mirrors the dynamic-workgroup-sizing approach in D107259348 and opens the door to selecting device/algorithm-optimal sizes later. The `add` op also validates its 1D dispatch count before allocating any GPU objects, against the device's queried `maxComputeWorkgroupsPerDimension` (falling back to the WebGPU spec-default floor of 65535 only when the limit query fails). Per Stephen's review, the workgroup-size clamp and the dispatch-count computation are factored into reusable `inline` helpers in `runtime/WebGPUUtils.h` (`clamp_workgroup_size` and `compute_1d_workgroup_count`, mirroring the Vulkan delegate's `utils::div_up`) so the other ops can share them rather than re-inlining the logic. The editable CMake build additionally marks the `vulkan_schema` subdirectory `EXCLUDE_FROM_ALL` so the WebGPU `ALL` build does not pull in targets that need glslc. ghstack-source-id: 389222646 exported-using-ghexport Differential Revision: D107288999
eac1473 to
2310c6d
Compare
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19986
Note: Links to docs will display an error until the docs builds have been completed. ❌ You can merge normally! (1 Unrelated Failure), 2 Unclassified FailuresAs of commit 2310c6d with merge base 22a2daf ( UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Summary:
Wires the WebGPU backend into the standard ExecuTorch backend test suite and adds an x86 Linux CI job, mirroring the Vulkan delegate:
backends/test/suite/flows/webgpu.pyplus aWebGPUTester, run byoss/.github/workflows/test-backend-webgpu.ymlon SwiftShader (a software Vulkan adapter, viawgpu-native, minimal dependencies, no GPU).Two fixes were needed for SwiftShader's downlevel limits: request the adapter's full
requiredLimitsat device creation (software adapters default storage-buffer limits to 0), and make theaddop's workgroup size dynamic instead of a hardcoded constant. The WGSL now declares a pipeline-overridableoverride wg_size: u32 = 256and the host clamps it to the device'smaxComputeInvocationsPerWorkgroup(256 on real GPUs and lavapipe, 128 on SwiftShader), so SwiftShader's 128-invocation cap no longer forces a smaller workgroup size on real hardware. This mirrors the dynamic-workgroup-sizing approach in D107259348 and opens the door to selecting device/algorithm-optimal sizes later. Theaddop also validates its 1D dispatch count before allocating any GPU objects, against the device's queriedmaxComputeWorkgroupsPerDimension(falling back to the WebGPU spec-default floor of 65535 only when the limit query fails). Per Stephen's review, the workgroup-size clamp and the dispatch-count computation are factored into reusableinlinehelpers inruntime/WebGPUUtils.h(clamp_workgroup_sizeandcompute_1d_workgroup_count, mirroring the Vulkan delegate'sutils::div_up) so the other ops can share them rather than re-inlining the logic. The editable CMake build additionally marks thevulkan_schemasubdirectoryEXCLUDE_FROM_ALLso the WebGPUALLbuild does not pull in targets that need glslc.ghstack-source-id: 389222646
exported-using-ghexport
Differential Revision: D107288999