UPSTREAM PR #21431: vulkan: Tweak Xe2 warptile configuration by loci-dev · Pull Request #1341 · auroralabs-loci/llama.cpp

loci-dev · 2026-04-10T02:18:23Z

Note

Source pull request: ggml-org/llama.cpp#21431

    vulkan: Tweak Xe2 warptile configuration

    On native float matmul shaders, the existing warptile configuration
    for Xe2 ended up spilling quite some registers. By tweaking the
    warptile config we can drive spills to zero and we get a
    substantial speedup in BF16 models, and a small one in others.

    Using the mesa anv driver with the load combining and LICM fix from
    https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15162 and the
    spill-reduction improvements from
    https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40796 on
    mesa 26.0.3,

    On a single Arc Pro B60:
     * gpt-oss 20B MXFP4 MoE
       pp512: 1356.08 ± 34.83 -> 1378.53 ± 15.17
       pp2048: 1311.92 ± 1.20 -> 1331.65 ± 4.11 (+2%)
       matmul_f16_l spill 75 -> 0, cycles 237414 -> 97336
       tg128: 52.01 ± 0.01 -> 51.88 ± 0.23
     * qwen35moe 35B.A3B Q4_K - Medium:
       pp512: 899.38 ± 16.84 -> 903.65 ± 14.92
       pp2048: 897.72 ± 1.91 -> 900.93 ± 1.83
       matmul_f32_f32_aligned_l spill 66 -> 0, cycles 159052 -> 58102
       matmul_f16_aligned_l spill 68 -> 0, cycles 158332 -> 55054
       matmul_f16_f32_f16acc_aligned_l spill 0 -> 0, cycles 80040 -> 54872
       tg128: 49.31 ± 0.02 -> 49.50 ± 0.01
     * qwen35 9B BF16:
       pp512: 509.34 ± 79.17 -> 844.24 ± 64.5 (+66%)
       pp2048: 564.64 ± 0.95 -> 949.35 ± 1.39 (+68%)
       matmul_bf16_aligned_l spill 47 -> 0, cycles 127438 -> 39124
       tg128: 22.12 ± 0.02 -> 22.12 ± 0.02

    Across four Arc Pro B60s:
     * qwen35moe 122B.A10B Q5_K - Small
       pp512: 268.06 ± 8.07 -> 269.08 ± 7.45
       pp2048: 318.88 ± 4.69 ->  320.80 ± 1.98
       matmul_f32_f32_aligned_l spill 66 -> 0, cycles 159052 -> 58102
       matmul_f16_aligned_l spill 68 -> 0, cycles 158332 -> 55054
       matmul_f16_f32_f16acc_aligned_l spill 0 -> 0, cycles 80040 -> 54872
       tg128: 26.20 ± 0.01 -> 26.40 ± 0.01
     * gemma4 31B BF16
       pp512: 141.92 ± 4.77 -> 222.61 ± 4.58 (+57%)
       pp2048: 162.35 ± 1.42 -> 268.07 ± 6.41 (+65%)
       matmul_bf16_aligned_l spill 48 -> 0, cycles 116834 -> 39124
       tg128: 6.40 ± 0.00 -> 6.41 ± 0.00

I have read and agree with the contributing guidelines
AI usage disclosure: YES - Claude suggested this as a part of a larger re-tuning of the parameters, but after doing my own benchmarks Claude's larger suggestions weren't as good as the more minimal fix here, and its first fix was invalid anyway.

On native float matmul shaders, the existing warptile configuration for Xe2 ended up spilling quite some registers. By tweaking the warptile config we can drive spills to zero and we get a substantial speedup in BF16 models, and a small one in others. Using the mesa anv driver with the load combining and LICM fix from https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15162 and the spill-reduction improvements from https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40796 on mesa 26.0.3, On a single Arc Pro B60: * gpt-oss 20B MXFP4 MoE pp512: 1356.08 ± 34.83 -> 1378.53 ± 15.17 pp2048: 1311.92 ± 1.20 -> 1331.65 ± 4.11 (+2%) matmul_f16_l spill 75 -> 0, cycles 237414 -> 97336 tg128: 52.01 ± 0.01 -> 51.88 ± 0.23 * qwen35moe 35B.A3B Q4_K - Medium: pp512: 899.38 ± 16.84 -> 903.65 ± 14.92 pp2048: 897.72 ± 1.91 -> 900.93 ± 1.83 matmul_f32_f32_aligned_l spill 66 -> 0, cycles 159052 -> 58102 matmul_f16_aligned_l spill 68 -> 0, cycles 158332 -> 55054 matmul_f16_f32_f16acc_aligned_l spill 0 -> 0, cycles 80040 -> 54872 tg128: 49.31 ± 0.02 -> 49.50 ± 0.01 * qwen35 9B BF16: pp512: 509.34 ± 79.17 -> 844.24 ± 64.5 (+66%) pp2048: 564.64 ± 0.95 -> 949.35 ± 1.39 (+68%) matmul_bf16_aligned_l spill 47 -> 0, cycles 127438 -> 39124 tg128: 22.12 ± 0.02 -> 22.12 ± 0.02 Across four Arc Pro B60s: * qwen35moe 122B.A10B Q5_K - Small pp512: 268.06 ± 8.07 -> 269.08 ± 7.45 pp2048: 318.88 ± 4.69 -> 320.80 ± 1.98 matmul_f32_f32_aligned_l spill 66 -> 0, cycles 159052 -> 58102 matmul_f16_aligned_l spill 68 -> 0, cycles 158332 -> 55054 matmul_f16_f32_f16acc_aligned_l spill 0 -> 0, cycles 80040 -> 54872 tg128: 26.20 ± 0.01 -> 26.40 ± 0.01 * gemma4 31B BF16 pp512: 141.92 ± 4.77 -> 222.61 ± 4.58 (+57%) pp2048: 162.35 ± 1.42 -> 268.07 ± 6.41 (+65%) matmul_bf16_aligned_l spill 48 -> 0, cycles 116834 -> 39124 tg128: 6.40 ± 0.00 -> 6.41 ± 0.00

loci-review · 2026-04-10T03:15:54Z

No meaningful performance changes were detected across 125488 analyzed functions in the following binaries: build.bin.libllama.so, build.bin.llama-tts, build.bin.llama-cvector-generator, build.bin.libmtmd.so, build.bin.llama-bench, build.bin.llama-tokenize, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so.

💬 Questions? Tag @loci-dev

loci-dev temporarily deployed to PROD__AL_DEMO April 10, 2026 02:18 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 9 times, most recently from 245e873 to d101579 Compare April 17, 2026 02:18

loci-dev force-pushed the main branch 3 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #21431: vulkan: Tweak Xe2 warptile configuration#1341

UPSTREAM PR #21431: vulkan: Tweak Xe2 warptile configuration#1341
loci-dev wants to merge 1 commit into
mainfrom
loci/pr-21431-xe2-warptile-tuning

loci-dev commented Apr 10, 2026

Uh oh!

loci-review Bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Apr 10, 2026

Uh oh!

loci-review Bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants