Skip to content

Commit f5c972c

Browse files
author
ssjia
committed
Update on "[ET-VK][conv2d_dw] Extract depthwise dispatch into Conv2dDW.cpp with device-based tile selection"
Profiling showed depthwise conv2d is 5-15x slower on Mali GPUs vs Adreno due to register pressure from the 4x2 output tile (17 vec4 registers per thread). Benchmarking confirmed that reducing the tile to 1x1 (7 vec4 registers) gives 4-15x speedup on Mali with no regression on Adreno. This change extracts depthwise conv2d dispatch logic from Convolution.cpp into a new Conv2dDW.cpp (following the Conv2dPW.cpp pattern), and adds device-based tile size selection: b1x1 on Mali, b4x2 (current default) on Adreno. Differential Revision: [D97058158](https://our.internmc.facebook.com/intern/diff/D97058158/) [ghstack-poisoned]
2 parents 09c6b4b + d04d87d commit f5c972c

0 file changed

File tree

    0 commit comments

    Comments
     (0)