UPSTREAM PR #22066: sycl: Battlemage (BMG) optimizations — AOT, Q5_K reorder, PAD stride fix, new ops, oneMKL routing by loci-dev · Pull Request #1357 · auroralabs-loci/llama.cpp

loci-dev · 2026-04-18T02:17:52Z

Note

Source pull request: ggml-org/llama.cpp#22066

Overview

A set of four independent SYCL improvements developed on Intel® Arc™ (Xe2-HPG / Battlemage) GPU, e.g. B70, B60, B50, PVC, B580, A770. Together they fix AOT compilation for BMG targets, add missing op support, correct a stride bug in PAD, and improve small-matmul dispatch.

Authors

Commits

sycl: Battlemage AOT + reorder MMVQ/dequant + async mem-op
- Fix AOT (ahead-of-time) compilation for Battlemage by switching to -fsycl-targets=spir64_gen with -device instead of --offload-arch. The existing --offload-arch flag silently fell back to JIT on newer GPUs.
- Skip the -ze-intel-greater-than-4GB-buffer-required linker flag when targeting spir64_gen (incompatible with offline compilation).
- Add Q5_K reorder support for MMVQ and dequantize paths (reorder kernel, vec_dot, dequantize).
- Decouple g_ggml_sycl_use_async_mem_op from the graph flag so async USM alloc/free can be used on the non-graph reorder staging path (controlled via GGML_SYCL_USE_ASYNC_MEM_OP env var, default on).
sycl: support non-contiguous input in PAD op
- Replace the existing ggml_is_contiguous(src0) assertion with proper strided addressing (nb0/nb1/nb2/nb3), so PAD works on views produced by reshape / permute without a preceding contiguous copy.
sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET
- Implement six missing SYCL ops required by Mamba / SSM and newer model architectures:
  - GGML_OP_SET_ROWS_FILL — fill tensor with a constant
  - GGML_OP_CUMSUM — inclusive prefix-sum (work-group scan + fixup)
  - GGML_OP_DIAG — extract / create diagonal
  - GGML_OP_SOLVE_TRI — triangular solve (column-major forward substitution)
  - GGML_OP_SSM_SCAN — selective state-space scan (Mamba)
  - GGML_OP_GATED_DELTA_NET — already present; add missing include guard
- Register all ops in ggml_backend_sycl_device_supports_op and the dispatch table.
sycl: route small f32 matmuls to oneMKL, bypass oneDNN
- For f32×f32 GEMMs below 256³ FLOPs, call oneapi::mkl::blas::gemm directly instead of going through oneDNN's DnnlGemmWrapper. oneDNN's planning overhead dominates at small sizes; the direct MKL path avoids it.

Testing

Benchmark Results

Hardware: Intel® Arc™ Pro B70 (BMG-G31, 32 GB GDDR6, Xe2-HPG, 256 EUs)
Container: intel/llm-scaler-vllm:0.14.0-b7.1 (oneAPI 2025.2.2)
Quantization: Q4_K_M unless noted. All runs: -r 3 -ngl 99 -fa 1.

All models — pp512 prefill + tg128 decode (tok/s)

Model	AOT+F16 pp512	Optimized pp512	Speedup	AOT+F16 tg128	Optimized tg128	Speedup
Llama-3.2-3B	5586	7118	1.27×	141	165	1.17×
Llama-3.1-8B	2813	3400	1.21×	80	88	1.10×
Qwen3.5-9B	353	947	2.68×	50	72	1.43×
Gemma-2-9B	2089	2680	1.28×	58	68	1.18×
Mistral-Nemo-12B	1843	2180	1.18×	54	60	1.13×
Qwen2.5-14B	1462	1728	1.18×	43	48	1.13×

Baseline ("AOT+F16") is an unpatched build with -DGGML_SYCL_F16=ON -DGGML_SYCL_DEVICE_ARCH=bmg-g31. "Optimized" applies all patches; the four in this PR account for the majority of the gain — the PAD fix (commit 2) drives the Qwen3.5-9B outlier, new ops (commit 3) deliver ~20% decode uplift, and oneMKL routing (commit 4) adds ~20% prefill.

Llama-3.1-8B Q4_K_M — Context scaling (prefill tok/s)

Context	AOT+F16	Optimized	Speedup
pp512	2810	3402	1.21×
pp2048	2392	2843	1.19×
pp4096	1960	2331	1.19×
pp8192	1462	1709	1.17×

Where gains come from

Source	Prefill	Decode
PAD stride fix (eliminates CPU fallbacks)	up to 2.68×	—
New ops (eliminate CPU↔GPU transfers)	—	~1.20×
oneMKL small matmul routing	~1.20×	—
Q5_K reorder + MMVQ SWAR	—	+5–9%

Validation

test-backend-ops passes for all affected ops (MUL_MAT, PAD, FILL, CUMSUM, DIAG, SSM_SCAN, GATED_DELTA_NET)
Batch-1 generation correctness verified
No regression on existing quant types (Q4_0, Q4_K, Q6_K, Q8_0)

To Reproduce

Please follow detailed reproduction steps found at our public fork README.

I have read and agree with the contributing guidelines
AI usage disclosure: Yes. This work was partially produced with an agentic engineering approach: agents surface issues and explore experiments while engineers identify and reject candidates using domain knowledge. Human feedback involved.

loci-review · 2026-04-18T03:08:38Z

No meaningful performance changes were detected across 46869 analyzed functions in the following binaries: build.bin.libmtmd.so, build.bin.libllama.so, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-tts, build.bin.llama-bench, build.bin.llama-cvector-generator, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize.

💬 Questions? Tag @loci-dev

aicss-genai and others added 4 commits April 17, 2026 17:12

sycl: Battlemage AOT + reorder MMVQ/dequant + async mem-op

0fa3a0b

sycl: support non-contiguous input in PAD op

08a946e

sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET

493785d

sycl: route small f32 matmuls to oneMKL, bypass oneDNN

204151c

loci-dev temporarily deployed to PROD__AL_DEMO April 18, 2026 02:17 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 2 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #22066: sycl: Battlemage (BMG) optimizations — AOT, Q5_K reorder, PAD stride fix, new ops, oneMKL routing#1357

UPSTREAM PR #22066: sycl: Battlemage (BMG) optimizations — AOT, Q5_K reorder, PAD stride fix, new ops, oneMKL routing#1357
loci-dev wants to merge 4 commits into
mainfrom
loci/pr-22066-sycl-bmg-upstream-pr

loci-dev commented Apr 18, 2026

Uh oh!

loci-review Bot commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Apr 18, 2026

Overview

Authors

Commits

Testing

Benchmark Results

All models — pp512 prefill + tg128 decode (tok/s)

Llama-3.1-8B Q4_K_M — Context scaling (prefill tok/s)

Where gains come from

Validation

To Reproduce

Uh oh!

loci-review Bot commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants