Commit 855646e
ssjia
Update on "[ET-VK][matmul] Re-implement fp32/fp16 matmul and linear with tiled compute and blocked weight packing"
Replace all existing matmul/linear operator implementations with new ones built
from the ground up using a tiled compute approach. Delete all legacy
implementations (MatMulLegacy.cpp, LinearLegacy.cpp, addmm_optimized.glsl,
addmm_naive_*.glsl).
New matmul (mm/bmm/addmm):
- Single matmul.glsl shader handles mm, bmm, and addmm using FPInputTile,
FPWeightTile, FPOutTile infrastructure from SDPA
- Adaptive tile size selection (TILE_M=4/2/1) based on GPU occupancy
- When mat2 is a constant tensor, automatically routes through the linear
path for blocked weight packing
New linear:
- Custom 4OC×4IC blocked weight prepacking via pack_fp_linear_weight.glsl
for optimal cache line utilization during tiled matmul
- Supports both transposed [N,K] and non-transposed [K,N] weights with
batch dimension support
- Separate texture2d weight storage with automatic buffer fallback for
large dimensions
Performance on Adreno 750 (fp16, vs legacy):
- Linear [4096,1024]x[256,1024]: 1.33x faster (texture)
- Linear [4096,64]x[128,64]: 2.67x faster (texture)
- BMM [1,4096,256]x[1,256,1024]: 1.63x faster (texture)
Differential Revision: [D96488384](https://our.internmc.facebook.com/intern/diff/D96488384/)
[ghstack-poisoned]0 file changed
0 commit comments