Skip to content

Latest commit

 

History

History
110 lines (84 loc) · 4.97 KB

File metadata and controls

110 lines (84 loc) · 4.97 KB

References

Primary Papers

  • Olenik et al. 2024Towards a platform-portable linear algebra backend for OpenFOAM, Meccanica doi:10.1007/s11012-024-01806-1 → Defines the OGL design and KIT recommendation "2× MPI subdomains per GPU". We tested with ranksPerGPU 8 (single GPU, 8 ranks) per this guidance.

  • Tsai et al. 2023Providing performance portable numerics for Intel GPUs, Wiley CCPE doi:10.1002/cpe.7400 → Documents ParIC / ParILU / ParICT / ISAI work on DPC++. Earlier versions of this repo claimed a discrepancy with the paper — that was wrong. Per Ginkgo team feedback (issue #2013), ParIc/ParIlu factorization does work on SYCL. The gap we hit on Battlemage is on the apply side: lower_trs / upper_trs kernels are missing in dpcpp/solver/, and ParIct::add_candidates SIGABRTs. The classic Ic/Ilu (sparselib-based) is genuinely not in SYCL. See findings/05 for the corrected mapping.

  • Anzt et al. 2022Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing, ACM TOMS doi:10.1145/3480935 → Architecture / executor model that OGL builds on.

OGL / Ginkgo Upstream

Related Battlemage Pioneer Work

  • PMZFX/intel-arc-pro-b70-benchmarks https://github.com/PMZFX/intel-arc-pro-b70-benchmarks → Independent B70 Pro pioneer for LLM inference. Upstreamed Q8_0 SYCL fix (PRs #21527 / #21638 in llama.cpp), achieving 3.1× speedup. → Validates our broader observation that Battlemage SYCL kernels need targeted fixes per workload — not a generic driver/compiler issue.

  • llama.cpp Issue #21517 ggml-org/llama.cpp#21517 → "Update from CR 26.05 to 26.09 did not improve performance — issue is in kernel code, not driver." Same pattern as our findings/13: driver updates alone do not solve the per-workload software-stack problems.

Phoronix Hardware Reviews

Related Hardware/Software Documentation


Hardware Diagnostic Run — 2026-05-10

Standalone cross-stack SpMV/CG diagnostic on Intel Arc Pro B70 (BMG-G31), Ubuntu 26.04 LTS, oneAPI 2025.3.3 / 2026.0, comparing oneMKL Sparse, PETSc aijkokkos, and Ginkgo dpcpp on an identical 1M-row Poisson 5-point reference matrix (4.996M nnz).

Method. Generator gen_matrix.cpp writes a 1000×1000 5-point Poisson matrix in MatrixMarket format. Three test harnesses load the matrix and run 1000 SpMV iterations after 10 warm-up calls. Timing brackets the inner loop only; CG-loop number includes vector ops + sync per iteration.

Hardware: Intel Arc Pro B70, 32 GB GDDR6, BMG-G31 (device 0xe223). Software: oneAPI 2025.3.3 for PETSc β5h2, oneAPI 2026.0 for Ginkgo (/opt/ginkgo linked against libsycl.so.9).

Results.

Stack ms/iter Effective BW
oneMKL Sparse CG (full loop) 0.741 161 GB/s
PETSc aijkokkos (pure SpMV) 0.287 418 GB/s (79 % Triad)
Ginkgo dpcpp (pure SpMV) 0.089 1340 GB/s*

* Cache-resident x (8 MB fits in B70 L2 ≈ 12 MB). Reported BW is arithmetic; physical peak is 608 GB/s.

Caveat. SpMV-only microbenchmark. The Ginkgo number reflects cache effects that shrink for larger systems. Diagnostic value: confirms B70 hardware functional for sparse linear algebra; the AMG wall in the sister repo is a software bug, not a hardware limitation.

Logs: logs/diag-2026-05-10/ (gzipped).

Cross-stack interpretation: see findings 23-26 (PETSc repo) and finding 23 (Ginkgo repo) for the symmetric write-up.