Add rms_norm op (#19893)#19893
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19893
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 158a749 with merge base 3b3f621 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
|
@JulianCloudNTH has exported this pull request. If you are a Meta employee, you can view the originating Diff in D106887028. |
This PR needs a
|
1abcd20 to
38e33e5
Compare
Summary: Adds the `et_vk.rms_norm.default` operator to the WebGPU backend (a WGSL compute shader using a cooperative tree reduction, one workgroup per row), fixes constant upload so the op's weight loads correctly, and wires the backend into CI. The Vulkan serializer that the WebGPU backend reuses stores every non-empty constant (e.g. the rms_norm weight) in the PTE's named-data map with `offset == UINT64_MAX` and a `named_key`, rather than inline in the VK00 blob. `WebGPUGraph::build` previously handled only inline constants, so the weight was never uploaded and the op returned all zeros. `build` now also fetches named-data constants via `NamedDataMap::get_data`, mirroring the path `VulkanBackend` already uses. `aten.add` was unaffected since it has no constant tensors. The shader mirrors the Vulkan implementation (`backends/vulkan/runtime/graph/ops/impl/RmsNorm.cpp`, `backends/vulkan/runtime/graph/ops/glsl/rms_norm_buffer.glsl`); indexing assumes contiguous fp32 inputs. The handler fails loud (throws, mirroring Vulkan's `VK_CHECK_COND`) on invalid shape/dtype/dispatch-limit conditions, and defaults `eps` to the float32 machine epsilon. Also adds a simple x86 Linux CI job, mirroring the Vulkan delegate: `backends/test/suite/flows/webgpu.py` plus a `WebGPUTester`, run by `oss/.github/workflows/test-backend-webgpu.yml` on SwiftShader (a software Vulkan adapter, via `wgpu-native`, minimal dependencies, no GPU). Two fixes were needed for SwiftShader's downlevel limits: request the adapter's full `requiredLimits` at device creation (software adapters default storage-buffer limits to 0), and lower the `add` op `workgroup_size` from 256 to 64 (256 exceeded SwiftShader's 128-invocation cap; the Vulkan delegate uses 64). Differential Revision: D106887028
38e33e5 to
d6f278e
Compare
SS-JIA
left a comment
There was a problem hiding this comment.
Review automatically exported from Phabricator review in Meta.
Summary: The Vulkan serializer that the WebGPU backend reuses stores every non-empty constant in the PTE's named-data map with `offset == UINT64_MAX` and a `named_key`, rather than inline in the VK00 blob. `WebGPUGraph::build` previously handled only inline constants, so a delegated op's constant weights were never uploaded and the op produced all zeros. `build` now also fetches named-data constants via `NamedDataMap::get_data`, mirroring the path `VulkanBackend` already uses. `aten.add` was unaffected since it has no constant tensors; the first consumer is the `rms_norm` op in the child diff. Differential Revision: D107288998
ef77d1d to
cc65e02
Compare
SS-JIA
left a comment
There was a problem hiding this comment.
Review automatically exported from Phabricator review in Meta.
cc65e02 to
ef77d1d
Compare
Summary: Pull Request resolved: pytorch#19893 Adds the `et_vk.rms_norm.default` operator to the WebGPU backend: a WGSL compute shader using a cooperative tree reduction, one workgroup per row. The shader mirrors the Vulkan implementation (`backends/vulkan/runtime/graph/ops/impl/RmsNorm.cpp`, `backends/vulkan/runtime/graph/ops/glsl/rms_norm_buffer.glsl`); indexing assumes contiguous fp32 inputs. The handler fails loud (throws, mirroring Vulkan's `VK_CHECK_COND`) on invalid shape/dtype/dispatch-limit conditions, and defaults `eps` to the float32 machine epsilon. The weight constant is uploaded via the named-data path added in the parent diff. Differential Revision: D106887028
ef77d1d to
158a749
Compare
| uniform_desc.size = sizeof(RmsNormParams); | ||
| uniform_desc.usage = WGPUBufferUsage_Uniform | WGPUBufferUsage_CopyDst; | ||
| uniform_desc.mappedAtCreation = true; | ||
| WGPUBuffer uniform_buffer = wgpuDeviceCreateBuffer(device, &uniform_desc); |
There was a problem hiding this comment.
would this buffer be freed eventually ? When should this be released ?
| kRmsNormWorkgroupSize == 64, | ||
| "must match @workgroup_size and WG_SIZE in rms_norm.wgsl"); | ||
| if (num_rows > 65535u) { | ||
| throw std::runtime_error( |
There was a problem hiding this comment.
if we are throwing exception here, should we release / free all objects that have been allocated thus far such as (uniform buffer, shader module, etc...)
| if (buf.ok() && buf->size() >= tensor.nbytes) { | ||
| wgpuQueueWriteBuffer( | ||
| queue_, tensor.buffer, 0, buf->data(), tensor.nbytes); | ||
| buf->Free(); |
There was a problem hiding this comment.
Should we free this here ? does this step "named_data_map->get_data" allocate new memory at Ln174 ?
| queue_, tensor.buffer, 0, buf->data(), tensor.nbytes); | ||
| buf->Free(); | ||
| } else { | ||
| throw std::runtime_error( |
There was a problem hiding this comment.
nits: The error message conflates "key not found" and "buffer undersized"
into one string. Splitting into two branches with sizes/error codes would make debugging ?
Summary:
Adds the
et_vk.rms_norm.defaultoperator to the WebGPU backend: a WGSL compute shader using a cooperative tree reduction, one workgroup per row. The shader mirrors the Vulkan implementation (backends/vulkan/runtime/graph/ops/impl/RmsNorm.cpp,backends/vulkan/runtime/graph/ops/glsl/rms_norm_buffer.glsl); indexing assumes contiguous fp32 inputs. The handler fails loud (throws, mirroring Vulkan'sVK_CHECK_COND) on invalid shape/dtype/dispatch-limit conditions, and defaultsepsto the float32 machine epsilon.The weight constant is uploaded via the named-data path added in the parent diff.
Differential Revision: D106887028