-
Olenik et al. 2024 — Towards a platform-portable linear algebra backend for OpenFOAM, Meccanica doi:10.1007/s11012-024-01806-1 → Defines the OGL design and KIT recommendation "2× MPI subdomains per GPU". We tested with
ranksPerGPU 8(single GPU, 8 ranks) per this guidance. -
Tsai et al. 2023 — Providing performance portable numerics for Intel GPUs, Wiley CCPE doi:10.1002/cpe.7400 → Documents
ParIC/ParILU/ParICT/ISAIwork on DPC++. Earlier versions of this repo claimed a discrepancy with the paper — that was wrong. Per Ginkgo team feedback (issue #2013),ParIc/ParIlufactorization does work on SYCL. The gap we hit on Battlemage is on the apply side:lower_trs/upper_trskernels are missing indpcpp/solver/, andParIct::add_candidatesSIGABRTs. The classicIc/Ilu(sparselib-based) is genuinely not in SYCL. See findings/05 for the corrected mapping. -
Anzt et al. 2022 — Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing, ACM TOMS doi:10.1145/3480935 → Architecture / executor model that OGL builds on.
- hpsim/OGL — OpenFOAM Ginkgo Layer (GPU plugin)
- ginkgo-project/ginkgo
- intel/compute-runtime
- Bug filing planned for findings/13 (resource_info abort with multi-rank OGL)
-
PMZFX/intel-arc-pro-b70-benchmarks https://github.com/PMZFX/intel-arc-pro-b70-benchmarks → Independent B70 Pro pioneer for LLM inference. Upstreamed Q8_0 SYCL fix (PRs #21527 / #21638 in
llama.cpp), achieving 3.1× speedup. → Validates our broader observation that Battlemage SYCL kernels need targeted fixes per workload — not a generic driver/compiler issue. -
llama.cpp Issue #21517 ggml-org/llama.cpp#21517 → "Update from CR 26.05 to 26.09 did not improve performance — issue is in kernel code, not driver." Same pattern as our findings/13: driver updates alone do not solve the per-workload software-stack problems.
- Intel Arc Pro B70 Linux Benchmarks (Phoronix) → Reference benchmarks on the same hardware for non-CFD workloads (rendering, video, ML inference). Useful for hardware sanity-check comparisons.
- OGL/Ginkgo recommended fvSolution patterns
→ Source for SPD-preconditioner
scaling -1.0requirement we tested in findings/15. - Intel Compute Runtime release notes
- Ginkgo release notes
- oneAPI Base Toolkit notes
Standalone cross-stack SpMV/CG diagnostic on Intel Arc Pro B70 (BMG-G31),
Ubuntu 26.04 LTS, oneAPI 2025.3.3 / 2026.0, comparing oneMKL Sparse,
PETSc aijkokkos, and Ginkgo dpcpp on an identical 1M-row Poisson 5-point
reference matrix (4.996M nnz).
Method. Generator gen_matrix.cpp writes a 1000×1000 5-point Poisson
matrix in MatrixMarket format. Three test harnesses load the matrix and
run 1000 SpMV iterations after 10 warm-up calls. Timing brackets the
inner loop only; CG-loop number includes vector ops + sync per iteration.
Hardware: Intel Arc Pro B70, 32 GB GDDR6, BMG-G31 (device 0xe223).
Software: oneAPI 2025.3.3 for PETSc β5h2, oneAPI 2026.0 for Ginkgo
(/opt/ginkgo linked against libsycl.so.9).
Results.
| Stack | ms/iter | Effective BW |
|---|---|---|
| oneMKL Sparse CG (full loop) | 0.741 | 161 GB/s |
| PETSc aijkokkos (pure SpMV) | 0.287 | 418 GB/s (79 % Triad) |
| Ginkgo dpcpp (pure SpMV) | 0.089 | 1340 GB/s* |
* Cache-resident x (8 MB fits in B70 L2 ≈ 12 MB). Reported BW is
arithmetic; physical peak is 608 GB/s.
Caveat. SpMV-only microbenchmark. The Ginkgo number reflects cache effects that shrink for larger systems. Diagnostic value: confirms B70 hardware functional for sparse linear algebra; the AMG wall in the sister repo is a software bug, not a hardware limitation.
Logs: logs/diag-2026-05-10/ (gzipped).
Cross-stack interpretation: see findings 23-26 (PETSc repo) and finding 23 (Ginkgo repo) for the symmetric write-up.