Commit f5c972c

ssjia

committed

Update on "[ET-VK][conv2d_dw] Extract depthwise dispatch into Conv2dDW.cpp with device-based tile selection"

Profiling showed depthwise conv2d is 5-15x slower on Mali GPUs vs Adreno due to register pressure from the 4x2 output tile (17 vec4 registers per thread). Benchmarking confirmed that reducing the tile to 1x1 (7 vec4 registers) gives 4-15x speedup on Mali with no regression on Adreno. This change extracts depthwise conv2d dispatch logic from Convolution.cpp into a new Conv2dDW.cpp (following the Conv2dPW.cpp pattern), and adds device-based tile size selection: b1x1 on Mali, b4x2 (current default) on Adreno. Differential Revision: [D97058158](https://our.internmc.facebook.com/intern/diff/D97058158/) [ghstack-poisoned]

2 parents 09c6b4b + d04d87d commit f5c972cCopy full SHA for f5c972c

0 file changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit f5c972c

File tree

0 commit comments