[ET-VK] Tuning local workgroup size calculation for conv2d pw to improve performance.#11135
Conversation
…ove performance.
This diff adjusts the local workgroup size (`local_wg_size`) based on batch count (stored in `wg_size[1]`), to improve conv2d pw performance.
* If `wg_size[1]` is a multiple of 8, `local_wg_size_y` is set to 8.
* If `wg_size[1]` is a multiple of 4, `local_wg_size_y` is set to 4.
* If `wg_size[1]` is a multiple of 2, `local_wg_size_y` is set to 2.
* Otherwise, we default to `local_wg_size_y` = 1.
The dispatch size in 2 dimensions is then calculate based on `{64 / local_wg_size_y, local_wg_size_y, 1}`.
Differential Revision: [D75420517](https://our.internmc.facebook.com/intern/diff/D75420517/)
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/11135
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit ea2e9d5 with merge base 380eb5f ( BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D75420517 |
… pw to improve performance."
This diff adjusts the local workgroup size (`local_wg_size`) based on batch count (stored in `wg_size[1]`), to improve conv2d pw performance.
* If `wg_size[1]` is a multiple of 8, `local_wg_size_y` is set to 8.
* If `wg_size[1]` is a multiple of 4, `local_wg_size_y` is set to 4.
* If `wg_size[1]` is a multiple of 2, `local_wg_size_y` is set to 2.
* Otherwise, we default to `local_wg_size_y` = 1.
The dispatch size in 2 dimensions is then calculate based on `{64 / local_wg_size_y, local_wg_size_y, 1}`.
Differential Revision: [D75420517](https://our.internmc.facebook.com/intern/diff/D75420517/)
[ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D75420517 |
… pw to improve performance."
This diff adjusts the local workgroup size (`local_wg_size`) based on batch count (stored in `wg_size[1]`), to improve conv2d pw performance.
* If `wg_size[1]` is a multiple of 8, `local_wg_size_y` is set to 8.
* If `wg_size[1]` is a multiple of 4, `local_wg_size_y` is set to 4.
* If `wg_size[1]` is a multiple of 2, `local_wg_size_y` is set to 2.
* Otherwise, we default to `local_wg_size_y` = 1.
The dispatch size in 2 dimensions is then calculate based on `{64 / local_wg_size_y, local_wg_size_y, 1}`.
Differential Revision: [D75420517](https://our.internmc.facebook.com/intern/diff/D75420517/)
[ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D75420517 |
… pw to improve performance."
This diff adjusts the local workgroup size (`local_wg_size`) based on batch count (stored in `wg_size[1]`), to improve conv2d pw performance.
* If `wg_size[1]` is a multiple of 8, `local_wg_size_y` is set to 8.
* If `wg_size[1]` is a multiple of 4, `local_wg_size_y` is set to 4.
* If `wg_size[1]` is a multiple of 2, `local_wg_size_y` is set to 2.
* Otherwise, we default to `local_wg_size_y` = 1.
The dispatch size in 2 dimensions is then calculate based on `{64 / local_wg_size_y, local_wg_size_y, 1}`.
Differential Revision: [D75420517](https://our.internmc.facebook.com/intern/diff/D75420517/)
[ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D75420517 |
… pw to improve performance."
This diff adjusts the local workgroup size (`local_wg_size`) based on batch count (stored in `wg_size[1]`), to improve conv2d pw performance.
* If `wg_size[1]` is a multiple of 8, `local_wg_size_y` is set to 8.
* If `wg_size[1]` is a multiple of 4, `local_wg_size_y` is set to 4.
* If `wg_size[1]` is a multiple of 2, `local_wg_size_y` is set to 2.
* Otherwise, we default to `local_wg_size_y` = 1.
The dispatch size in 2 dimensions is then calculate based on `{64 / local_wg_size_y, local_wg_size_y, 1}`.
Differential Revision: [D75420517](https://our.internmc.facebook.com/intern/diff/D75420517/)
[ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D75420517 |
… pw to improve performance."
This diff adjusts the local workgroup size (`local_wg_size`) based on batch count (stored in `wg_size[1]`), to improve conv2d pw performance.
* If `wg_size[1]` is a multiple of 8, `local_wg_size_y` is set to 8.
* If `wg_size[1]` is a multiple of 4, `local_wg_size_y` is set to 4.
* If `wg_size[1]` is a multiple of 2, `local_wg_size_y` is set to 2.
* Otherwise, we default to `local_wg_size_y` = 1.
The dispatch size in 2 dimensions is then calculate based on `{64 / local_wg_size_y, local_wg_size_y, 1}`.
Differential Revision: [D75420517](https://our.internmc.facebook.com/intern/diff/D75420517/)
[ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D75420517 |
f8572ef
into
gh/trivedivivek/95/base
…ove performance. (#11188) This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: #11135 by @trivedivivek ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/trivedivivek/95/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/trivedivivek/95/head Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/trivedivivek/94/orig Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/trivedivivek/95/orig @diff-train-skip-merge --------- Co-authored-by: Vivek Trivedi <5340687+trivedivivek@users.noreply.github.com>
Stack from ghstack (oldest at bottom):
This diff adjusts the local workgroup size (
local_wg_size) based on batch count (stored inwg_size[1]), to improve conv2d pw performance.wg_size[1]is a multiple of 8,local_wg_size_yis set to 8.wg_size[1]is a multiple of 4,local_wg_size_yis set to 4.wg_size[1]is a multiple of 2,local_wg_size_yis set to 2.local_wg_size_y= 1.The dispatch size in 2 dimensions is then calculate based on
{64 / local_wg_size_y, local_wg_size_y, 1}.Differential Revision: D75420517