Skip to content

Add ROCm/HIP GPU backend#5

Open
mjwilkins18 wants to merge 1 commit into
mainfrom
mjwilkins18/rocm-backend
Open

Add ROCm/HIP GPU backend#5
mjwilkins18 wants to merge 1 commit into
mainfrom
mjwilkins18/rocm-backend

Conversation

@mjwilkins18

@mjwilkins18 mjwilkins18 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Add ROCm/HIP GPU backend

This adds an AMD ROCm/HIP reduction backend so CAIL's transparent, GPU-aware
MPI_Allreduce interposition runs on AMD GPUs, mirroring the existing CUDA
backend.

What's included

  • Build: --with-rocm[=PATH] and --with-rocm-arch=<gfx> configure
    options; m4/ax_check_rocm.m4 detects hipcc, HIP headers, and the ROCm
    runtime (under lib or lib64). CUDA and ROCm are mutually exclusive,
    chosen at configure time.
  • Reduction kernels (src/gpu/rocm/cail_rocm_reduce.hip): elementwise
    SUM / PROD / MIN / MAX for all CAIL-supported datatypes.
  • Memory layer (src/gpu/rocm/cail_rocm_mem.c): HIP device-pointer
    detection and staging copies. The staging device-to-device copy is followed
    by hipStreamSynchronize(0) so the staged data is ordered before it is
    handed to GPU-aware MPI (which reads it from a separate engine). The fence is
    scoped to the copy's default stream rather than the whole device, so
    unrelated device work is not serialized.
  • Tests: a single-source compatibility shim (tests/test_gpu_compat.h)
    builds the GPU correctness/bench tests for either CUDA or HIP.
  • Build glue: hip_lt.sh libtool wrapper for linking .hip objects.

Verification

Tested on AMD MI210 (gfx90a), ROCm 6.4.2, OpenMPI 5.0.x + libfabric:

  • Correctness: 160/160 for np=2/3/4 across all algorithms
    (recursive_doubling, ring, rabenseifner), single-node and 2-node.
  • OSU osu_allreduce interpose (LD_PRELOAD), CPU and ROCm device-to-device.

Performance

Experiments showed a significant speedup over native for messages above the
small-message cutoff, where the GPU-aware reduction path dominates. Below the
cutoff, native is faster because interposition and GPU-staging overhead
dominate at small sizes. The CPU path is within noise of native, as expected.

Notes

  • No changes to the CPU path or existing CUDA backend behavior.

@mjwilkins18 mjwilkins18 force-pushed the mjwilkins18/rocm-backend branch 2 times, most recently from 70f660b to 8227dbb Compare June 16, 2026 02:32
Add an AMD ROCm/HIP reduction backend to CAIL alongside the existing CUDA
path, so GPU-aware MPI_Allreduce interposition works on AMD GPUs.

- configure: --with-rocm / --with-rocm-arch; m4/ax_check_rocm.m4 probes
  hipcc, headers, and runtime; CUDA and ROCm backends are mutually exclusive
  and selected at configure time.
- src/gpu/rocm: cail_rocm_mem.c (device-pointer detection, hipMalloc/
  hipMemcpy with a post-copy hipStreamSynchronize(0) release fence so the
  staged buffer is visible to a subsequent GPU-aware MPI / NIC / GDRCopy
  read; stream-scoped so unrelated device work is not serialized) and
  cail_rocm_reduce.hip (elementwise SUM/PROD/MIN/MAX kernels for all
  supported datatypes).
- tests: single-source GPU tests via tests/test_gpu_compat.h, compiled for
  CUDA or HIP from one source. The GPU correctness test sweeps a small set
  of element counts {1, 2, 3, base, base+1, base+3} to cover edge and
  non-power-of-two remainder paths in addition to the requested size.
- hip_lt.sh: libtool shim so .hip sources link cleanly under the autotools
  build.

Verified on AMD MI210 (gfx90a), ROCm 6.4.2: correctness 160/160 per count
across np=2/3/4 and all algorithms (recursive_doubling, ring, rabenseifner),
single-node and 2-node, plus OSU osu_allreduce interpose (CPU and ROCm
device-to-device). GPU-aware allreduce shows a significant speedup over
native for messages above the small-message cutoff, with native faster for
small messages.
@mjwilkins18 mjwilkins18 force-pushed the mjwilkins18/rocm-backend branch from 8227dbb to a67328c Compare June 18, 2026 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant