[ET-VK][patterns] Fuse torchao 4-bit quantized embedding to embedding_q4gsw#20381
[ET-VK][patterns] Fuse torchao 4-bit quantized embedding to embedding_q4gsw#20381SS-JIA wants to merge 1 commit into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20381
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New Failures, 3 Unrelated FailuresAs of commit 7783e2f with merge base 23f9021 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This PR needs a
|
Stack from ghstack (oldest at bottom):
TISO and other torchao-quantized models emit a
torchao.dequantize_affine -> aten.embeddingsubgraph for their weight-only int4 quantized embedding. The existingQuantizedEmbeddingMatchonly matches thequantized_decomposed.embedding_4bit.dtypefused op, so the torchao embedding never fused: itsdequantize_affineconst-folded to an fp32 weight, the resultingaten.embeddingexceeded the buffer-element limit and fell back to CPU, and the fp32 constant bloated the serialized model.This adds a separate
TorchAOQuantizedEmbeddingMatchmatcher that recognizes the torchao int4dequantize_affine -> aten.embeddingshape (qmin=-8/qmax=7, per-row group block_size[1, G]) and rewrites it to the existinget_vk.embedding_q4gsw.defaultop, repacking the unpacked int8 weight into the packed 4-bit layout. It asserts symmetric quantization (zero_point == 0, which the shader assumes) and guards against repacking a shared/tied weight more than once via anet_vk_embedding_q4gsw_packedmeta flag. It is kept as a separate class fromQuantizedEmbeddingMatchbecause the two dialects produce different graph shapes (one fused op vs a split dequant+gather), so a single class would only co-locate two disjoint parse paths.On the en_US TISO backbone the embedding now delegates to Vulkan instead of falling back to CPU, and the serialized
.ptedrops from 418 MiB to 348 MiB.This change was authored with Claude.
Differential Revision: D108457797