UPSTREAM PR #21431: vulkan: Tweak Xe2 warptile configuration#1341
Open
loci-dev wants to merge 1 commit into
Open
UPSTREAM PR #21431: vulkan: Tweak Xe2 warptile configuration#1341loci-dev wants to merge 1 commit into
loci-dev wants to merge 1 commit into
Conversation
On native float matmul shaders, the existing warptile configuration for Xe2 ended up spilling quite some registers. By tweaking the warptile config we can drive spills to zero and we get a substantial speedup in BF16 models, and a small one in others. Using the mesa anv driver with the load combining and LICM fix from https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15162 and the spill-reduction improvements from https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40796 on mesa 26.0.3, On a single Arc Pro B60: * gpt-oss 20B MXFP4 MoE pp512: 1356.08 ± 34.83 -> 1378.53 ± 15.17 pp2048: 1311.92 ± 1.20 -> 1331.65 ± 4.11 (+2%) matmul_f16_l spill 75 -> 0, cycles 237414 -> 97336 tg128: 52.01 ± 0.01 -> 51.88 ± 0.23 * qwen35moe 35B.A3B Q4_K - Medium: pp512: 899.38 ± 16.84 -> 903.65 ± 14.92 pp2048: 897.72 ± 1.91 -> 900.93 ± 1.83 matmul_f32_f32_aligned_l spill 66 -> 0, cycles 159052 -> 58102 matmul_f16_aligned_l spill 68 -> 0, cycles 158332 -> 55054 matmul_f16_f32_f16acc_aligned_l spill 0 -> 0, cycles 80040 -> 54872 tg128: 49.31 ± 0.02 -> 49.50 ± 0.01 * qwen35 9B BF16: pp512: 509.34 ± 79.17 -> 844.24 ± 64.5 (+66%) pp2048: 564.64 ± 0.95 -> 949.35 ± 1.39 (+68%) matmul_bf16_aligned_l spill 47 -> 0, cycles 127438 -> 39124 tg128: 22.12 ± 0.02 -> 22.12 ± 0.02 Across four Arc Pro B60s: * qwen35moe 122B.A10B Q5_K - Small pp512: 268.06 ± 8.07 -> 269.08 ± 7.45 pp2048: 318.88 ± 4.69 -> 320.80 ± 1.98 matmul_f32_f32_aligned_l spill 66 -> 0, cycles 159052 -> 58102 matmul_f16_aligned_l spill 68 -> 0, cycles 158332 -> 55054 matmul_f16_f32_f16acc_aligned_l spill 0 -> 0, cycles 80040 -> 54872 tg128: 26.20 ± 0.01 -> 26.40 ± 0.01 * gemma4 31B BF16 pp512: 141.92 ± 4.77 -> 222.61 ± 4.58 (+57%) pp2048: 162.35 ± 1.42 -> 268.07 ± 6.41 (+65%) matmul_bf16_aligned_l spill 48 -> 0, cycles 116834 -> 39124 tg128: 6.40 ± 0.00 -> 6.41 ± 0.00
|
No meaningful performance changes were detected across 125488 analyzed functions in the following binaries: build.bin.libllama.so, build.bin.llama-tts, build.bin.llama-cvector-generator, build.bin.libmtmd.so, build.bin.llama-bench, build.bin.llama-tokenize, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so. 💬 Questions? Tag @loci-dev |
245e873 to
d101579
Compare
7638ab4 to
f1b46d5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note
Source pull request: ggml-org/llama.cpp#21431