Add ROCm/HIP GPU backend#5
Open
mjwilkins18 wants to merge 1 commit into
Open
Conversation
70f660b to
8227dbb
Compare
Add an AMD ROCm/HIP reduction backend to CAIL alongside the existing CUDA
path, so GPU-aware MPI_Allreduce interposition works on AMD GPUs.
- configure: --with-rocm / --with-rocm-arch; m4/ax_check_rocm.m4 probes
hipcc, headers, and runtime; CUDA and ROCm backends are mutually exclusive
and selected at configure time.
- src/gpu/rocm: cail_rocm_mem.c (device-pointer detection, hipMalloc/
hipMemcpy with a post-copy hipStreamSynchronize(0) release fence so the
staged buffer is visible to a subsequent GPU-aware MPI / NIC / GDRCopy
read; stream-scoped so unrelated device work is not serialized) and
cail_rocm_reduce.hip (elementwise SUM/PROD/MIN/MAX kernels for all
supported datatypes).
- tests: single-source GPU tests via tests/test_gpu_compat.h, compiled for
CUDA or HIP from one source. The GPU correctness test sweeps a small set
of element counts {1, 2, 3, base, base+1, base+3} to cover edge and
non-power-of-two remainder paths in addition to the requested size.
- hip_lt.sh: libtool shim so .hip sources link cleanly under the autotools
build.
Verified on AMD MI210 (gfx90a), ROCm 6.4.2: correctness 160/160 per count
across np=2/3/4 and all algorithms (recursive_doubling, ring, rabenseifner),
single-node and 2-node, plus OSU osu_allreduce interpose (CPU and ROCm
device-to-device). GPU-aware allreduce shows a significant speedup over
native for messages above the small-message cutoff, with native faster for
small messages.
8227dbb to
a67328c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add ROCm/HIP GPU backend
This adds an AMD ROCm/HIP reduction backend so CAIL's transparent, GPU-aware
MPI_Allreduce interposition runs on AMD GPUs, mirroring the existing CUDA
backend.
What's included
--with-rocm[=PATH]and--with-rocm-arch=<gfx>configureoptions;
m4/ax_check_rocm.m4detects hipcc, HIP headers, and the ROCmruntime (under
liborlib64). CUDA and ROCm are mutually exclusive,chosen at configure time.
src/gpu/rocm/cail_rocm_reduce.hip): elementwiseSUM / PROD / MIN / MAX for all CAIL-supported datatypes.
src/gpu/rocm/cail_rocm_mem.c): HIP device-pointerdetection and staging copies. The staging device-to-device copy is followed
by
hipStreamSynchronize(0)so the staged data is ordered before it ishanded to GPU-aware MPI (which reads it from a separate engine). The fence is
scoped to the copy's default stream rather than the whole device, so
unrelated device work is not serialized.
tests/test_gpu_compat.h)builds the GPU correctness/bench tests for either CUDA or HIP.
hip_lt.shlibtool wrapper for linking.hipobjects.Verification
Tested on AMD MI210 (gfx90a), ROCm 6.4.2, OpenMPI 5.0.x + libfabric:
(recursive_doubling, ring, rabenseifner), single-node and 2-node.
osu_allreduceinterpose (LD_PRELOAD), CPU and ROCm device-to-device.Performance
Experiments showed a significant speedup over native for messages above the
small-message cutoff, where the GPU-aware reduction path dominates. Below the
cutoff, native is faster because interposition and GPU-staging overhead
dominate at small sizes. The CPU path is within noise of native, as expected.
Notes