[ET-VK] Tuning local workgroup size calculation for conv2d pw to improve performance.#11135

Merged

facebook-github-bot merged 6 commits into

gh/trivedivivek/95/basefrom

gh/trivedivivek/95/head

May 28, 2025

trviv commented May 27, 2025 •

edited

Loading

Contributor

Stack from ghstack (oldest at bottom):

This diff adjusts the local workgroup size (local_wg_size) based on batch count (stored in wg_size[1]), to improve conv2d pw performance.

If wg_size[1] is a multiple of 8, local_wg_size_y is set to 8.
If wg_size[1] is a multiple of 4, local_wg_size_y is set to 4.
If wg_size[1] is a multiple of 2, local_wg_size_y is set to 2.
Otherwise, we default to local_wg_size_y = 1.

The dispatch size in 2 dimensions is then calculate based on {64 / local_wg_size_y, local_wg_size_y, 1}.

Differential Revision: D75420517


          [ET-VK] Tuning local workgroup size calculation for conv2d pw to impr…

f0f92a2

…ove performance.

This diff adjusts the local workgroup size (`local_wg_size`) based on batch count (stored in  `wg_size[1]`), to improve conv2d pw performance.

* If `wg_size[1]` is a multiple of 8, `local_wg_size_y` is set to 8.
* If `wg_size[1]` is a multiple of 4, `local_wg_size_y` is set to 4.
* If `wg_size[1]` is a multiple of 2, `local_wg_size_y` is set to 2.
* Otherwise, we default to `local_wg_size_y` = 1.

The dispatch size in 2 dimensions is then calculate based on `{64 / local_wg_size_y, local_wg_size_y, 1}`.

Differential Revision: [D75420517](https://our.internmc.facebook.com/intern/diff/D75420517/)

[ghstack-poisoned]

trviv requested a review from SS-JIA as a code owner

May 27, 2025 04:40

pytorch-bot Bot commented May 27, 2025 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/11135

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit ea2e9d5 with merge base 380eb5f ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / linux / linux-job (gh) (trunk failure)
examples/models/llama/tests/test_export_llama_lib.py::ExportLlamaLibTest::test_has_expected_ops_and_op_counts
pull / unittest-editable / linux / linux-job (gh) (trunk failure)
examples/models/llama/tests/test_export_llama_lib.py::ExportLlamaLibTest::test_has_expected_ops_and_op_counts

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot added the CLA Signed label

This was referenced May 24, 2025

[ET-VK] De vectorise conv2d pw shader to improve perf. #11108

Merged

[ET-VK] Remove the use of shared memory in conv2d pw to improve perf. #11110

Merged

[ET-VK] Tuning conv 2d pw op tile size to improve perf. #11112

Merged

[ET-VK] Minor tuning for conv2d pw op to improve performance. #11113

Merged

[ET-VK] De vectorise positions in conv2d pw shader to improve perf. #11122

Merged

[ET-VK] Minor unroll tuning to improve conv2d pw perf. #11134

Merged

[ET-VK] De vectorise all vectors in conv2d pw shader to improve perf. #11136

Merged

[ET-VK] Creating specialized version of conv2d pw shader for X and Y stride = 1 and padding = 0. #11137

Merged

[ET-VK] Storing positions in uint16 to instead of int in conv2d pw shader. #11138

Merged

[ET-VK] Reducing precision of some in members in conv2d pw to improved performance. #11139

Merged

facebook-github-bot commented May 27, 2025

Contributor

This pull request was exported from Phabricator. Differential Revision: D75420517

facebook-github-bot added the fb-exported label


          Update on "[ET-VK] Tuning local workgroup size calculation for conv2d…

8899cb1

… pw to improve performance."

This diff adjusts the local workgroup size (`local_wg_size`) based on batch count (stored in  `wg_size[1]`), to improve conv2d pw performance.

* If `wg_size[1]` is a multiple of 8, `local_wg_size_y` is set to 8.
* If `wg_size[1]` is a multiple of 4, `local_wg_size_y` is set to 4.
* If `wg_size[1]` is a multiple of 2, `local_wg_size_y` is set to 2.
* Otherwise, we default to `local_wg_size_y` = 1.

The dispatch size in 2 dimensions is then calculate based on `{64 / local_wg_size_y, local_wg_size_y, 1}`.

Differential Revision: [D75420517](https://our.internmc.facebook.com/intern/diff/D75420517/)

[ghstack-poisoned]

trviv mentioned this pull request

[ET-VK] Applying bias after sum calculation in conv2d pw shader to improve performance. #11150

Merged

facebook-github-bot commented May 27, 2025

Contributor

This pull request was exported from Phabricator. Differential Revision: D75420517

SS-JIA approved these changes

View reviewed changes

trviv added the topic: not user facing label


          Update on "[ET-VK] Tuning local workgroup size calculation for conv2d…

84f5d9c

… pw to improve performance."

This diff adjusts the local workgroup size (`local_wg_size`) based on batch count (stored in  `wg_size[1]`), to improve conv2d pw performance.

* If `wg_size[1]` is a multiple of 8, `local_wg_size_y` is set to 8.
* If `wg_size[1]` is a multiple of 4, `local_wg_size_y` is set to 4.
* If `wg_size[1]` is a multiple of 2, `local_wg_size_y` is set to 2.
* Otherwise, we default to `local_wg_size_y` = 1.

The dispatch size in 2 dimensions is then calculate based on `{64 / local_wg_size_y, local_wg_size_y, 1}`.

Differential Revision: [D75420517](https://our.internmc.facebook.com/intern/diff/D75420517/)

[ghstack-poisoned]

facebook-github-bot commented May 27, 2025

Contributor

This pull request was exported from Phabricator. Differential Revision: D75420517

trviv added the release notes: none label


          Update on "[ET-VK] Tuning local workgroup size calculation for conv2d…

02ba09f

… pw to improve performance."

This diff adjusts the local workgroup size (`local_wg_size`) based on batch count (stored in  `wg_size[1]`), to improve conv2d pw performance.

* If `wg_size[1]` is a multiple of 8, `local_wg_size_y` is set to 8.
* If `wg_size[1]` is a multiple of 4, `local_wg_size_y` is set to 4.
* If `wg_size[1]` is a multiple of 2, `local_wg_size_y` is set to 2.
* Otherwise, we default to `local_wg_size_y` = 1.

The dispatch size in 2 dimensions is then calculate based on `{64 / local_wg_size_y, local_wg_size_y, 1}`.

Differential Revision: [D75420517](https://our.internmc.facebook.com/intern/diff/D75420517/)

[ghstack-poisoned]

facebook-github-bot commented May 28, 2025

Contributor

This pull request was exported from Phabricator. Differential Revision: D75420517

This was referenced May 28, 2025

[ET-VK] Modifying should_squeeze function in SqueezeUnsqueezeInputs to not squeeze if significant axis are all 1 and trailing axis are all > 1. #11177

Merged

[ET-VK] Removed shared memory usage and simplied conv2d dw op shader to improve performance. #11178

Merged


          Update on "[ET-VK] Tuning local workgroup size calculation for conv2d…

8b1b647

… pw to improve performance."

This diff adjusts the local workgroup size (`local_wg_size`) based on batch count (stored in  `wg_size[1]`), to improve conv2d pw performance.

* If `wg_size[1]` is a multiple of 8, `local_wg_size_y` is set to 8.
* If `wg_size[1]` is a multiple of 4, `local_wg_size_y` is set to 4.
* If `wg_size[1]` is a multiple of 2, `local_wg_size_y` is set to 2.
* Otherwise, we default to `local_wg_size_y` = 1.

The dispatch size in 2 dimensions is then calculate based on `{64 / local_wg_size_y, local_wg_size_y, 1}`.

Differential Revision: [D75420517](https://our.internmc.facebook.com/intern/diff/D75420517/)

[ghstack-poisoned]

facebook-github-bot commented May 28, 2025

Contributor

This pull request was exported from Phabricator. Differential Revision: D75420517


          Update on "[ET-VK] Tuning local workgroup size calculation for conv2d…

ea2e9d5

… pw to improve performance."

This diff adjusts the local workgroup size (`local_wg_size`) based on batch count (stored in  `wg_size[1]`), to improve conv2d pw performance.

* If `wg_size[1]` is a multiple of 8, `local_wg_size_y` is set to 8.
* If `wg_size[1]` is a multiple of 4, `local_wg_size_y` is set to 4.
* If `wg_size[1]` is a multiple of 2, `local_wg_size_y` is set to 2.
* Otherwise, we default to `local_wg_size_y` = 1.

The dispatch size in 2 dimensions is then calculate based on `{64 / local_wg_size_y, local_wg_size_y, 1}`.

Differential Revision: [D75420517](https://our.internmc.facebook.com/intern/diff/D75420517/)

[ghstack-poisoned]

facebook-github-bot commented May 28, 2025

Contributor

This pull request was exported from Phabricator. Differential Revision: D75420517

facebook-github-bot merged commit f8572ef into gh/trivedivivek/95/base

96 of 98 checks passed

facebook-github-bot deleted the gh/trivedivivek/95/head branch

May 28, 2025 15:53

facebook-github-bot temporarily deployed to cherry-pick-bot

May 28, 2025 15:53

— with

GitHub Actions Inactive

pytorchbot mentioned this pull request

[ET-VK] Tuning local workgroup size calculation for conv2d pw to improve performance. #11188

Merged

trviv added a commit that referenced this pull request


          [ET-VK] Tuning local workgroup size calculation for conv2d pw to impr…

11f8f4a

…ove performance. (#11188)

This PR was created by the merge bot to help merge the original PR into
the main branch.
ghstack PR number: #11135 by
@trivedivivek
^ Please use this as the source of truth for the PR details, comments,
and reviews
ghstack PR base:
https://github.com/pytorch/executorch/tree/gh/trivedivivek/95/base
ghstack PR head:
https://github.com/pytorch/executorch/tree/gh/trivedivivek/95/head
Merge bot PR base:
https://github.com/pytorch/executorch/tree/gh/trivedivivek/94/orig
Merge bot PR head:
https://github.com/pytorch/executorch/tree/gh/trivedivivek/95/orig
@diff-train-skip-merge

---------

Co-authored-by: Vivek Trivedi <5340687+trivedivivek@users.noreply.github.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed fb-exported release notes: none topic: not user facing