Commit f5c972c
ssjia
Update on "[ET-VK][conv2d_dw] Extract depthwise dispatch into Conv2dDW.cpp with device-based tile selection"
Profiling showed depthwise conv2d is 5-15x slower on Mali GPUs vs Adreno due to
register pressure from the 4x2 output tile (17 vec4 registers per thread).
Benchmarking confirmed that reducing the tile to 1x1 (7 vec4 registers) gives
4-15x speedup on Mali with no regression on Adreno.
This change extracts depthwise conv2d dispatch logic from Convolution.cpp into a
new Conv2dDW.cpp (following the Conv2dPW.cpp pattern), and adds device-based
tile size selection: b1x1 on Mali, b4x2 (current default) on Adreno.
Differential Revision: [D97058158](https://our.internmc.facebook.com/intern/diff/D97058158/)
[ghstack-poisoned]0 file changed
0 commit comments