Skip to content

UPSTREAM PR #21431: vulkan: Tweak Xe2 warptile configuration#1341

Open
loci-dev wants to merge 1 commit into
mainfrom
loci/pr-21431-xe2-warptile-tuning
Open

UPSTREAM PR #21431: vulkan: Tweak Xe2 warptile configuration#1341
loci-dev wants to merge 1 commit into
mainfrom
loci/pr-21431-xe2-warptile-tuning

Conversation

@loci-dev
Copy link
Copy Markdown

Note

Source pull request: ggml-org/llama.cpp#21431

    vulkan: Tweak Xe2 warptile configuration

    On native float matmul shaders, the existing warptile configuration
    for Xe2 ended up spilling quite some registers. By tweaking the
    warptile config we can drive spills to zero and we get a
    substantial speedup in BF16 models, and a small one in others.

    Using the mesa anv driver with the load combining and LICM fix from
    https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15162 and the
    spill-reduction improvements from
    https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40796 on
    mesa 26.0.3,

    On a single Arc Pro B60:
     * gpt-oss 20B MXFP4 MoE
       pp512: 1356.08 ± 34.83 -> 1378.53 ± 15.17
       pp2048: 1311.92 ± 1.20 -> 1331.65 ± 4.11 (+2%)
       matmul_f16_l spill 75 -> 0, cycles 237414 -> 97336
       tg128: 52.01 ± 0.01 -> 51.88 ± 0.23
     * qwen35moe 35B.A3B Q4_K - Medium:
       pp512: 899.38 ± 16.84 -> 903.65 ± 14.92
       pp2048: 897.72 ± 1.91 -> 900.93 ± 1.83
       matmul_f32_f32_aligned_l spill 66 -> 0, cycles 159052 -> 58102
       matmul_f16_aligned_l spill 68 -> 0, cycles 158332 -> 55054
       matmul_f16_f32_f16acc_aligned_l spill 0 -> 0, cycles 80040 -> 54872
       tg128: 49.31 ± 0.02 -> 49.50 ± 0.01
     * qwen35 9B BF16:
       pp512: 509.34 ± 79.17 -> 844.24 ± 64.5 (+66%)
       pp2048: 564.64 ± 0.95 -> 949.35 ± 1.39 (+68%)
       matmul_bf16_aligned_l spill 47 -> 0, cycles 127438 -> 39124
       tg128: 22.12 ± 0.02 -> 22.12 ± 0.02

    Across four Arc Pro B60s:
     * qwen35moe 122B.A10B Q5_K - Small
       pp512: 268.06 ± 8.07 -> 269.08 ± 7.45
       pp2048: 318.88 ± 4.69 ->  320.80 ± 1.98
       matmul_f32_f32_aligned_l spill 66 -> 0, cycles 159052 -> 58102
       matmul_f16_aligned_l spill 68 -> 0, cycles 158332 -> 55054
       matmul_f16_f32_f16acc_aligned_l spill 0 -> 0, cycles 80040 -> 54872
       tg128: 26.20 ± 0.01 -> 26.40 ± 0.01
     * gemma4 31B BF16
       pp512: 141.92 ± 4.77 -> 222.61 ± 4.58 (+57%)
       pp2048: 162.35 ± 1.42 -> 268.07 ± 6.41 (+65%)
       matmul_bf16_aligned_l spill 48 -> 0, cycles 116834 -> 39124
       tg128: 6.40 ± 0.00 -> 6.41 ± 0.00
  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - Claude suggested this as a part of a larger re-tuning of the parameters, but after doing my own benchmarks Claude's larger suggestions weren't as good as the more minimal fix here, and its first fix was invalid anyway.

On native float matmul shaders, the existing warptile configuration
for Xe2 ended up spilling quite some registers. By tweaking the
warptile config we can drive spills to zero and we get a
substantial speedup in BF16 models, and a small one in others.

Using the mesa anv driver with the load combining and LICM fix from
https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15162 and the
spill-reduction improvements from
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40796 on
mesa 26.0.3,

On a single Arc Pro B60:
 * gpt-oss 20B MXFP4 MoE
   pp512: 1356.08 ± 34.83 -> 1378.53 ± 15.17
   pp2048: 1311.92 ± 1.20 -> 1331.65 ± 4.11 (+2%)
   matmul_f16_l spill 75 -> 0, cycles 237414 -> 97336
   tg128: 52.01 ± 0.01 -> 51.88 ± 0.23
 * qwen35moe 35B.A3B Q4_K - Medium:
   pp512: 899.38 ± 16.84 -> 903.65 ± 14.92
   pp2048: 897.72 ± 1.91 -> 900.93 ± 1.83
   matmul_f32_f32_aligned_l spill 66 -> 0, cycles 159052 -> 58102
   matmul_f16_aligned_l spill 68 -> 0, cycles 158332 -> 55054
   matmul_f16_f32_f16acc_aligned_l spill 0 -> 0, cycles 80040 -> 54872
   tg128: 49.31 ± 0.02 -> 49.50 ± 0.01
 * qwen35 9B BF16:
   pp512: 509.34 ± 79.17 -> 844.24 ± 64.5 (+66%)
   pp2048: 564.64 ± 0.95 -> 949.35 ± 1.39 (+68%)
   matmul_bf16_aligned_l spill 47 -> 0, cycles 127438 -> 39124
   tg128: 22.12 ± 0.02 -> 22.12 ± 0.02

Across four Arc Pro B60s:
 * qwen35moe 122B.A10B Q5_K - Small
   pp512: 268.06 ± 8.07 -> 269.08 ± 7.45
   pp2048: 318.88 ± 4.69 ->  320.80 ± 1.98
   matmul_f32_f32_aligned_l spill 66 -> 0, cycles 159052 -> 58102
   matmul_f16_aligned_l spill 68 -> 0, cycles 158332 -> 55054
   matmul_f16_f32_f16acc_aligned_l spill 0 -> 0, cycles 80040 -> 54872
   tg128: 26.20 ± 0.01 -> 26.40 ± 0.01
 * gemma4 31B BF16
   pp512: 141.92 ± 4.77 -> 222.61 ± 4.58 (+57%)
   pp2048: 162.35 ± 1.42 -> 268.07 ± 6.41 (+65%)
   matmul_bf16_aligned_l spill 48 -> 0, cycles 116834 -> 39124
   tg128: 6.40 ± 0.00 -> 6.41 ± 0.00
@loci-review
Copy link
Copy Markdown

loci-review Bot commented Apr 10, 2026

No meaningful performance changes were detected across 125488 analyzed functions in the following binaries: build.bin.libllama.so, build.bin.llama-tts, build.bin.llama-cvector-generator, build.bin.libmtmd.so, build.bin.llama-bench, build.bin.llama-tokenize, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so.

💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 245e873 to d101579 Compare April 17, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 3 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants