Commit 855646e

ssjia

committed

Update on "[ET-VK][matmul] Re-implement fp32/fp16 matmul and linear with tiled compute and blocked weight packing"

Replace all existing matmul/linear operator implementations with new ones built from the ground up using a tiled compute approach. Delete all legacy implementations (MatMulLegacy.cpp, LinearLegacy.cpp, addmm_optimized.glsl, addmm_naive_*.glsl). New matmul (mm/bmm/addmm): - Single matmul.glsl shader handles mm, bmm, and addmm using FPInputTile, FPWeightTile, FPOutTile infrastructure from SDPA - Adaptive tile size selection (TILE_M=4/2/1) based on GPU occupancy - When mat2 is a constant tensor, automatically routes through the linear path for blocked weight packing New linear: - Custom 4OC×4IC blocked weight prepacking via pack_fp_linear_weight.glsl for optimal cache line utilization during tiled matmul - Supports both transposed [N,K] and non-transposed [K,N] weights with batch dimension support - Separate texture2d weight storage with automatic buffer fallback for large dimensions Performance on Adreno 750 (fp16, vs legacy): - Linear [4096,1024]x[256,1024]: 1.33x faster (texture) - Linear [4096,64]x[128,64]: 2.67x faster (texture) - BMM [1,4096,256]x[1,256,1024]: 1.63x faster (texture) Differential Revision: [D96488384](https://our.internmc.facebook.com/intern/diff/D96488384/) [ghstack-poisoned]

2 parents 0c7eecd + 35ffc5d commit 855646eCopy full SHA for 855646e

0 file changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit 855646e

File tree

0 commit comments