UPSTREAM PR #22066: sycl: Battlemage (BMG) optimizations — AOT, Q5_K reorder, PAD stride fix, new ops, oneMKL routing#1357
Open
loci-dev wants to merge 4 commits into
Open
Conversation
|
No meaningful performance changes were detected across 46869 analyzed functions in the following binaries: build.bin.libmtmd.so, build.bin.libllama.so, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-tts, build.bin.llama-bench, build.bin.llama-cvector-generator, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize. 💬 Questions? Tag @loci-dev |
7638ab4 to
f1b46d5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note
Source pull request: ggml-org/llama.cpp#22066
Overview
A set of four independent SYCL improvements developed on Intel® Arc™ (Xe2-HPG / Battlemage) GPU, e.g. B70, B60, B50, PVC, B580, A770. Together they fix AOT compilation for BMG targets, add missing op support, correct a stride bug in PAD, and improve small-matmul dispatch.
Authors
Commits
sycl: Battlemage AOT + reorder MMVQ/dequant + async mem-op
-fsycl-targets=spir64_genwith-deviceinstead of--offload-arch. The existing--offload-archflag silently fell back to JIT on newer GPUs.-ze-intel-greater-than-4GB-buffer-requiredlinker flag when targetingspir64_gen(incompatible with offline compilation).g_ggml_sycl_use_async_mem_opfrom the graph flag so async USM alloc/free can be used on the non-graph reorder staging path (controlled viaGGML_SYCL_USE_ASYNC_MEM_OPenv var, default on).sycl: support non-contiguous input in PAD op
ggml_is_contiguous(src0)assertion with proper strided addressing (nb0/nb1/nb2/nb3), so PAD works on views produced by reshape / permute without a preceding contiguous copy.sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET
GGML_OP_SET_ROWS_FILL— fill tensor with a constantGGML_OP_CUMSUM— inclusive prefix-sum (work-group scan + fixup)GGML_OP_DIAG— extract / create diagonalGGML_OP_SOLVE_TRI— triangular solve (column-major forward substitution)GGML_OP_SSM_SCAN— selective state-space scan (Mamba)GGML_OP_GATED_DELTA_NET— already present; add missing include guardggml_backend_sycl_device_supports_opand the dispatch table.sycl: route small f32 matmuls to oneMKL, bypass oneDNN
oneapi::mkl::blas::gemmdirectly instead of going through oneDNN'sDnnlGemmWrapper. oneDNN's planning overhead dominates at small sizes; the direct MKL path avoids it.Testing
Benchmark Results
Hardware: Intel® Arc™ Pro B70 (BMG-G31, 32 GB GDDR6, Xe2-HPG, 256 EUs)
Container: intel/llm-scaler-vllm:0.14.0-b7.1 (oneAPI 2025.2.2)
Quantization: Q4_K_M unless noted. All runs:
-r 3 -ngl 99 -fa 1.All models — pp512 prefill + tg128 decode (tok/s)
Baseline ("AOT+F16") is an unpatched build with
-DGGML_SYCL_F16=ON -DGGML_SYCL_DEVICE_ARCH=bmg-g31. "Optimized" applies all patches; the four in this PR account for the majority of the gain — the PAD fix (commit 2) drives the Qwen3.5-9B outlier, new ops (commit 3) deliver ~20% decode uplift, and oneMKL routing (commit 4) adds ~20% prefill.Llama-3.1-8B Q4_K_M — Context scaling (prefill tok/s)
Where gains come from
Validation
test-backend-opspasses for all affected ops (MUL_MAT, PAD, FILL, CUMSUM, DIAG, SSM_SCAN, GATED_DELTA_NET)To Reproduce
Please follow detailed reproduction steps found at our public fork README.