Add ROCm/HIP GPU backend by mjwilkins18 · Pull Request #5 · cornelisnetworks/cail

mjwilkins18 · 2026-06-15T01:02:32Z

Add ROCm/HIP GPU backend

This adds an AMD ROCm/HIP reduction backend so CAIL's transparent, GPU-aware
MPI_Allreduce interposition runs on AMD GPUs, mirroring the existing CUDA
backend.

What's included

Build: --with-rocm[=PATH] and --with-rocm-arch=<gfx> configure
options; m4/ax_check_rocm.m4 detects hipcc, HIP headers, and the ROCm
runtime (under lib or lib64). CUDA and ROCm are mutually exclusive,
chosen at configure time.
Reduction kernels (src/gpu/rocm/cail_rocm_reduce.hip): elementwise
SUM / PROD / MIN / MAX for all CAIL-supported datatypes.
Memory layer (src/gpu/rocm/cail_rocm_mem.c): HIP device-pointer
detection and staging copies. The staging device-to-device copy is followed
by hipStreamSynchronize(0) so the staged data is ordered before it is
handed to GPU-aware MPI (which reads it from a separate engine). The fence is
scoped to the copy's default stream rather than the whole device, so
unrelated device work is not serialized.
Tests: a single-source compatibility shim (tests/test_gpu_compat.h)
builds the GPU correctness/bench tests for either CUDA or HIP.
Build glue: hip_lt.sh libtool wrapper for linking .hip objects.

Verification

Tested on AMD MI210 (gfx90a), ROCm 6.4.2, OpenMPI 5.0.x + libfabric:

Correctness: 160/160 for np=2/3/4 across all algorithms
(recursive_doubling, ring, rabenseifner), single-node and 2-node.
OSU osu_allreduce interpose (LD_PRELOAD), CPU and ROCm device-to-device.

Performance

Experiments showed a significant speedup over native for messages above the
small-message cutoff, where the GPU-aware reduction path dominates. Below the
cutoff, native is faster because interposition and GPU-staging overhead
dominate at small sizes. The CPU path is within noise of native, as expected.

Notes

No changes to the CPU path or existing CUDA backend behavior.

Add an AMD ROCm/HIP reduction backend to CAIL alongside the existing CUDA path, so GPU-aware MPI_Allreduce interposition works on AMD GPUs. - configure: --with-rocm / --with-rocm-arch; m4/ax_check_rocm.m4 probes hipcc, headers, and runtime; CUDA and ROCm backends are mutually exclusive and selected at configure time. - src/gpu/rocm: cail_rocm_mem.c (device-pointer detection, hipMalloc/ hipMemcpy with a post-copy hipStreamSynchronize(0) release fence so the staged buffer is visible to a subsequent GPU-aware MPI / NIC / GDRCopy read; stream-scoped so unrelated device work is not serialized) and cail_rocm_reduce.hip (elementwise SUM/PROD/MIN/MAX kernels for all supported datatypes). - tests: single-source GPU tests via tests/test_gpu_compat.h, compiled for CUDA or HIP from one source. The GPU correctness test sweeps a small set of element counts {1, 2, 3, base, base+1, base+3} to cover edge and non-power-of-two remainder paths in addition to the requested size. - hip_lt.sh: libtool shim so .hip sources link cleanly under the autotools build. Verified on AMD MI210 (gfx90a), ROCm 6.4.2: correctness 160/160 per count across np=2/3/4 and all algorithms (recursive_doubling, ring, rabenseifner), single-node and 2-node, plus OSU osu_allreduce interpose (CPU and ROCm device-to-device). GPU-aware allreduce shows a significant speedup over native for messages above the small-message cutoff, with native faster for small messages.

mjwilkins18 force-pushed the mjwilkins18/rocm-backend branch 2 times, most recently from 70f660b to 8227dbb Compare June 16, 2026 02:32

mjwilkins18 force-pushed the mjwilkins18/rocm-backend branch from 8227dbb to a67328c Compare June 18, 2026 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ROCm/HIP GPU backend#5

Add ROCm/HIP GPU backend#5
mjwilkins18 wants to merge 1 commit into
mainfrom
mjwilkins18/rocm-backend

mjwilkins18 commented Jun 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Conversation

mjwilkins18 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!