Skip to content

UPSTREAM PR #22066: sycl: Battlemage (BMG) optimizations — AOT, Q5_K reorder, PAD stride fix, new ops, oneMKL routing#1357

Open
loci-dev wants to merge 4 commits into
mainfrom
loci/pr-22066-sycl-bmg-upstream-pr
Open

UPSTREAM PR #22066: sycl: Battlemage (BMG) optimizations — AOT, Q5_K reorder, PAD stride fix, new ops, oneMKL routing#1357
loci-dev wants to merge 4 commits into
mainfrom
loci/pr-22066-sycl-bmg-upstream-pr

Conversation

@loci-dev
Copy link
Copy Markdown

Note

Source pull request: ggml-org/llama.cpp#22066

Overview

A set of four independent SYCL improvements developed on Intel® Arc™ (Xe2-HPG / Battlemage) GPU, e.g. B70, B60, B50, PVC, B580, A770. Together they fix AOT compilation for BMG targets, add missing op support, correct a stride bug in PAD, and improve small-matmul dispatch.

Authors

Commits

  1. sycl: Battlemage AOT + reorder MMVQ/dequant + async mem-op

    • Fix AOT (ahead-of-time) compilation for Battlemage by switching to -fsycl-targets=spir64_gen with -device instead of --offload-arch. The existing --offload-arch flag silently fell back to JIT on newer GPUs.
    • Skip the -ze-intel-greater-than-4GB-buffer-required linker flag when targeting spir64_gen (incompatible with offline compilation).
    • Add Q5_K reorder support for MMVQ and dequantize paths (reorder kernel, vec_dot, dequantize).
    • Decouple g_ggml_sycl_use_async_mem_op from the graph flag so async USM alloc/free can be used on the non-graph reorder staging path (controlled via GGML_SYCL_USE_ASYNC_MEM_OP env var, default on).
  2. sycl: support non-contiguous input in PAD op

    • Replace the existing ggml_is_contiguous(src0) assertion with proper strided addressing (nb0/nb1/nb2/nb3), so PAD works on views produced by reshape / permute without a preceding contiguous copy.
  3. sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET

    • Implement six missing SYCL ops required by Mamba / SSM and newer model architectures:
      • GGML_OP_SET_ROWS_FILL — fill tensor with a constant
      • GGML_OP_CUMSUM — inclusive prefix-sum (work-group scan + fixup)
      • GGML_OP_DIAG — extract / create diagonal
      • GGML_OP_SOLVE_TRI — triangular solve (column-major forward substitution)
      • GGML_OP_SSM_SCAN — selective state-space scan (Mamba)
      • GGML_OP_GATED_DELTA_NET — already present; add missing include guard
    • Register all ops in ggml_backend_sycl_device_supports_op and the dispatch table.
  4. sycl: route small f32 matmuls to oneMKL, bypass oneDNN

    • For f32×f32 GEMMs below 256³ FLOPs, call oneapi::mkl::blas::gemm directly instead of going through oneDNN's DnnlGemmWrapper. oneDNN's planning overhead dominates at small sizes; the direct MKL path avoids it.

Testing

Benchmark Results

Hardware: Intel® Arc™ Pro B70 (BMG-G31, 32 GB GDDR6, Xe2-HPG, 256 EUs)
Container: intel/llm-scaler-vllm:0.14.0-b7.1 (oneAPI 2025.2.2)
Quantization: Q4_K_M unless noted. All runs: -r 3 -ngl 99 -fa 1.

All models — pp512 prefill + tg128 decode (tok/s)

Model AOT+F16 pp512 Optimized pp512 Speedup AOT+F16 tg128 Optimized tg128 Speedup
Llama-3.2-3B 5586 7118 1.27× 141 165 1.17×
Llama-3.1-8B 2813 3400 1.21× 80 88 1.10×
Qwen3.5-9B 353 947 2.68× 50 72 1.43×
Gemma-2-9B 2089 2680 1.28× 58 68 1.18×
Mistral-Nemo-12B 1843 2180 1.18× 54 60 1.13×
Qwen2.5-14B 1462 1728 1.18× 43 48 1.13×

Baseline ("AOT+F16") is an unpatched build with -DGGML_SYCL_F16=ON -DGGML_SYCL_DEVICE_ARCH=bmg-g31. "Optimized" applies all patches; the four in this PR account for the majority of the gain — the PAD fix (commit 2) drives the Qwen3.5-9B outlier, new ops (commit 3) deliver ~20% decode uplift, and oneMKL routing (commit 4) adds ~20% prefill.

Llama-3.1-8B Q4_K_M — Context scaling (prefill tok/s)

Context AOT+F16 Optimized Speedup
pp512 2810 3402 1.21×
pp2048 2392 2843 1.19×
pp4096 1960 2331 1.19×
pp8192 1462 1709 1.17×

Where gains come from

Source Prefill Decode
PAD stride fix (eliminates CPU fallbacks) up to 2.68×
New ops (eliminate CPU↔GPU transfers) ~1.20×
oneMKL small matmul routing ~1.20×
Q5_K reorder + MMVQ SWAR +5–9%

Validation

  • test-backend-ops passes for all affected ops (MUL_MAT, PAD, FILL, CUMSUM, DIAG, SSM_SCAN, GATED_DELTA_NET)
  • Batch-1 generation correctness verified
  • No regression on existing quant types (Q4_0, Q4_K, Q6_K, Q8_0)

To Reproduce

Please follow detailed reproduction steps found at our public fork README.

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes. This work was partially produced with an agentic engineering approach: agents surface issues and explore experiments while engineers identify and reject candidates using domain knowledge. Human feedback involved.

@loci-review
Copy link
Copy Markdown

loci-review Bot commented Apr 18, 2026

No meaningful performance changes were detected across 46869 analyzed functions in the following binaries: build.bin.libmtmd.so, build.bin.libllama.so, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-tts, build.bin.llama-bench, build.bin.llama-cvector-generator, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize.

💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants