Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,5 @@ ACLOCAL_AMFLAGS = -I m4
pkgconfigdir = $(libdir)/pkgconfig
pkgconfig_DATA = cail.pc

EXTRA_DIST = autogen.sh cail.pc.in cuda_lt.sh
EXTRA_DIST = autogen.sh cail.pc.in cuda_lt.sh hip_lt.sh
DISTCLEANFILES = cail.pc
20 changes: 14 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,15 @@

CAIL is a drop-in GPU-aware `MPI_Allreduce` optimization library. It uses the
MPI profiling interface (PMPI) to transparently intercept `MPI_Allreduce` calls
and route them through optimized algorithms with native CUDA reduction kernels.
and route them through optimized algorithms with native CUDA/HIP reduction kernels.
No application changes are required, just `LD_PRELOAD` the library.

## Quick Start

### Prerequisites

- MPI implementation (OpenMPI, MPICH, Intel MPI, etc.)
- CUDA Toolkit (nvcc, cudart)
- CUDA Toolkit (nvcc, cudart) or ROCm Toolkit (hipcc, amdhip64)
- Autotools (autoconf >= 2.69, automake, libtool)

### Build
Expand Down Expand Up @@ -48,14 +48,22 @@ Look for `[cail] initialized:` and `[cail] algorithm=` on stderr to confirm CAIL
|---------------------------------|------------------------------------------------------|----------|
| `--with-cuda=PATH` | Path to CUDA toolkit installation | auto |
| `--with-cuda-arch=SM` | NVCC architecture flag (e.g. `sm_70`, `sm_90`) | `sm_70` |
| `--with-rocm=PATH` | Path to ROCm toolkit installation | auto |
| `--with-rocm-arch=GFX` | HIP offload architecture (e.g. `gfx90a`); empty lets hipcc autodetect | empty |
| `--with-mpi=PATH` | Path to MPI installation | auto |
| `--enable-host-path` | Build without GPU support (host-only, uses `MPI_Reduce_local`) | no |
| `--enable-debug` | Debug build with `-g -O0` | no |
| `--enable-recursive-doubling` | Enable recursive-doubling algorithm | yes |
| `--enable-ring` | Enable ring algorithm | yes |
| `--enable-rabenseifner` | Enable Rabenseifner algorithm | yes |

### Host-Only Build (No CUDA)
### ROCm Build

```sh
./configure --with-rocm=/opt/rocm --with-rocm-arch=gfx90a
```

### Host-Only Build (No CUDA/ROCm)

```sh
./configure --enable-host-path
Expand All @@ -68,7 +76,7 @@ with `malloc`/`free`. Useful for development or CPU-only clusters.

CAIL intercepts `MPI_Allreduce` when all of these are true:

- Buffer resides on a CUDA device (or built with `--enable-host-path`)
- Buffer resides on a CUDA or ROCm device (or built with `--enable-host-path`)
- Datatype is one of the 20 supported MPI types (see below)
- Operation is SUM, PROD, MAX, or MIN
- Communicator is an intracommunicator
Expand Down Expand Up @@ -211,9 +219,9 @@ CAIL_ALGO=ring CAIL_DEBUG=1 mpirun -np 4 -x LD_PRELOAD=... ./my_app
| `test_allreduce_basic` | CPU | Float SUM across 7 message sizes, internal count sweep |
| `test_allreduce_correctness` | CPU | All 20 datatypes × 4 ops × {normal, MPI_IN_PLACE}. Requires `-c count`. |
| `test_allreduce_edge` | CPU | Edge cases: count=0, count=1, large counts, single process |
| `test_allreduce_correctness_gpu` | GPU | All 20 datatypes × 4 ops × {normal, MPI_IN_PLACE}. Requires `-c count`. CUDA build only. |
| `test_allreduce_correctness_gpu` | GPU | All 20 datatypes × 4 ops × {normal, MPI_IN_PLACE}. Requires `-c count`. CUDA or ROCm build. |
| `bench_allreduce` | CPU | Performance benchmark across message sizes |
| `bench_allreduce_gpu` | GPU | GPU performance benchmark. CUDA build only. |
| `bench_allreduce_gpu` | GPU | GPU performance benchmark. CUDA or ROCm build. |

### Test CLI

Expand Down
65 changes: 63 additions & 2 deletions configure.ac
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,36 @@ AS_IF([test "x$enable_rabenseifner" = "xyes"], [
[Define to 1 to enable Rabenseifner algorithm])
])

dnl ---------------------------------------------------------------------------
dnl GPU backend selection options
dnl ---------------------------------------------------------------------------
AC_ARG_WITH([cuda],
[AS_HELP_STRING([--with-cuda=PATH],
[Path to CUDA toolkit @<:@default=auto@:>@])],
[with_cuda=$withval],
[with_cuda=auto])

AC_ARG_WITH([rocm],
[AS_HELP_STRING([--with-rocm=PATH],
[Path to ROCm toolkit @<:@default=auto@:>@])],
[with_rocm=$withval],
[with_rocm=auto])

user_with_cuda=$with_cuda
user_with_rocm=$with_rocm

AS_IF([test "x$with_rocm" != "xauto" && test "x$with_rocm" != "xno" && test "x$with_cuda" = "xauto"], [
with_cuda=no
])
AS_IF([test "x$with_cuda" != "xauto" && test "x$with_cuda" != "xno" && test "x$with_rocm" = "xauto"], [
with_rocm=no
])

AS_IF([test "x$enable_host_path" = "xyes"], [
with_cuda=no
with_rocm=no
])

dnl ---------------------------------------------------------------------------
dnl --with-cuda-arch : NVCC architecture flag
dnl ---------------------------------------------------------------------------
Expand All @@ -111,10 +141,41 @@ AC_ARG_WITH([cuda-arch],
AC_SUBST([CUDA_ARCH], [$cuda_arch])

dnl ---------------------------------------------------------------------------
dnl CUDA (macro provided by m4/ax_check_cuda.m4)
dnl Must come after --with-cuda-arch so $cuda_arch is set for NVCCFLAGS
dnl --with-rocm-arch : HIP architecture flag
dnl ---------------------------------------------------------------------------
AC_ARG_WITH([rocm-arch],
[AS_HELP_STRING([--with-rocm-arch=GFX],
[Set ROCm offload architecture for hipcc (e.g. gfx90a); default empty uses hipcc autodetect])],
[rocm_arch=$withval],
[rocm_arch=])

AC_SUBST([ROCM_ARCH], [$rocm_arch])

dnl ---------------------------------------------------------------------------
dnl CUDA / ROCm detection
dnl ---------------------------------------------------------------------------
AX_CHECK_CUDA
AX_CHECK_ROCM

dnl ---------------------------------------------------------------------------
dnl Final backend selection validation
dnl ---------------------------------------------------------------------------
AS_IF([test "x$enable_host_path" = "xyes"], [
AS_IF([test "x$user_with_cuda" != "xauto" && test "x$user_with_cuda" != "xno"], [
AC_MSG_ERROR([host-path excludes GPU backends])
])
AS_IF([test "x$user_with_rocm" != "xauto" && test "x$user_with_rocm" != "xno"], [
AC_MSG_ERROR([host-path excludes GPU backends])
])
], [
AS_IF([test "x$have_cuda" = "xyes" && test "x$have_rocm" = "xyes"], [
AC_MSG_ERROR([choose one GPU backend: CUDA or ROCm])
], [
AS_IF([test "x$have_cuda" = "xno" && test "x$have_rocm" = "xno"], [
AC_MSG_ERROR([no GPU backend: pass --with-cuda=PATH, --with-rocm=PATH, or --enable-host-path])
])
])
])

dnl ---------------------------------------------------------------------------
dnl Output
Expand Down
57 changes: 57 additions & 0 deletions hip_lt.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/bin/bash
# Copyright (c) 2026 Cornelis Networks. All rights reserved.

# hip_lt.sh — Wrapper to compile .hip files into libtool .lo objects.
# Implements the same UCC-pattern rationale as cuda_lt.sh: emit PIC/non-PIC
# objects plus .lo metadata so libtool keeps HIP objects during convenience-lib linking.

set -e

libtool_file=$1
lo_filepath=$2

# Derive .o path from .lo path
o_filepath="${lo_filepath%.lo}.o"
lo_dir=$(dirname "$o_filepath")
o_filename=$(basename "$o_filepath")

# Libtool convention: PIC objects go in .libs/, non-PIC in current dir
local_pic_dir=".libs/"
local_npic_dir=""
pic_dir="${lo_dir}/${local_pic_dir}"
npic_dir="${lo_dir}/${local_npic_dir}"

pic_filepath="${pic_dir}${o_filename}"
npic_filepath="${npic_dir}${o_filename}"
local_pic_filepath="${local_pic_dir}${o_filename}"
local_npic_filepath="${local_npic_dir}${o_filename}"

mkdir -p "$pic_dir"

# Build PIC version (for shared library)
cmd="${@:3} -fPIC -o ${pic_filepath}"
echo "$cmd"
$cmd

# Build non-PIC version (for static library)
cmd="${@:3} -o ${npic_filepath}"
echo "$cmd"
$cmd

# Write the .lo metadata file that libtool expects
libtool_version="$(${libtool_file} --version | head -1 | sed 's/^/#/g')"

cat > "${lo_filepath}" <<LOEOF
# ${lo_filepath} - a libtool object file
# Generated by hip_lt.sh for hipcc/libtool integration
# ${libtool_version}

# Please DO NOT delete this file!
# It is necessary for linking the library.

# Name of the PIC object.
pic_object='${local_pic_filepath}'

# Name of the non-PIC object.
non_pic_object='${local_npic_filepath}'
LOEOF
19 changes: 7 additions & 12 deletions m4/ax_check_cuda.m4
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,13 @@ dnl AX_CHECK_CUDA — Detect CUDA toolkit, nvcc, and set build variables
dnl ---------------------------------------------------------------------------
AC_DEFUN([AX_CHECK_CUDA], [
dnl --with-cuda=PATH
AC_ARG_WITH([cuda],
[AS_HELP_STRING([--with-cuda=PATH],
[Path to CUDA toolkit @<:@default=auto@:>@])],
[with_cuda=$withval],
[with_cuda=auto])
AS_IF([test "x$with_cuda" = "x"], [
AC_ARG_WITH([cuda],
[AS_HELP_STRING([--with-cuda=PATH],
[Path to CUDA toolkit @<:@default=auto@:>@])],
[with_cuda=$withval],
[with_cuda=auto])
])

have_cuda=no

Expand Down Expand Up @@ -52,13 +54,6 @@ AC_DEFUN([AX_CHECK_CUDA], [
])
])

dnl If CUDA required (not host-path and not --without-cuda), fail
AS_IF([test "x$have_cuda" = "xno" && test "x$with_cuda" != "xno" && test "x$enable_host_path" != "xyes"], [
AS_IF([test "x$with_cuda" != "xauto"], [
AC_MSG_ERROR([CUDA requested but not found. Use --without-cuda or --enable-host-path for CPU-only build.])
])
])

AC_SUBST([CUDA_HOME])
AC_SUBST([NVCC])
AC_SUBST([CUDA_CFLAGS])
Expand Down
91 changes: 91 additions & 0 deletions m4/ax_check_rocm.m4
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
dnl ---------------------------------------------------------------------------
dnl AX_CHECK_ROCM — Detect ROCm toolkit, hipcc, and set build variables
dnl ---------------------------------------------------------------------------
AC_DEFUN([AX_CHECK_ROCM], [
dnl --with-rocm=PATH
AS_IF([test "x$with_rocm" = "x"], [
AC_ARG_WITH([rocm],
[AS_HELP_STRING([--with-rocm=PATH],
[Path to ROCm toolkit @<:@default=auto@:>@])],
[with_rocm=$withval],
[with_rocm=auto])
])

have_rocm=no
ROCM_HOME=
HIPCC=
ROCM_CFLAGS=
ROCM_LIBS=
HIPCCFLAGS=

dnl Skip ROCm if --without-rocm
AS_IF([test "x$with_rocm" != "xno"], [
dnl Find hipcc
AS_IF([test "x$with_rocm" != "xauto"], [
HIPCC="$with_rocm/bin/hipcc"
ROCM_HOME="$with_rocm"
], [
AC_PATH_PROG([HIPCC], [hipcc], [])
AS_IF([test "x$HIPCC" != "x"], [
dnl Derive ROCM_HOME from hipcc location
rocm_bin_dir=`AS_DIRNAME([$HIPCC])`
ROCM_HOME=`AS_DIRNAME([$rocm_bin_dir])`
], [
AS_IF([test -x "/opt/rocm/bin/hipcc"], [
HIPCC="/opt/rocm/bin/hipcc"
ROCM_HOME="/opt/rocm"
])
])
])

dnl Check hipcc exists
AS_IF([test "x$HIPCC" != "x" && test -x "$HIPCC"], [
rocm_inc="$ROCM_HOME/include"
dnl Prefer lib64 if that is where libamdhip64.so lives (some distros)
AS_IF([test -f "$ROCM_HOME/lib64/libamdhip64.so"],
[rocm_lib="$ROCM_HOME/lib64"],
[rocm_lib="$ROCM_HOME/lib"])

dnl Validate HIP runtime header and ROCm runtime library
AC_CHECK_FILE([$rocm_inc/hip/hip_runtime.h], [
AC_CHECK_FILE([$rocm_lib/libamdhip64.so], [
have_rocm=yes
ROCM_CFLAGS="-I$rocm_inc -D__HIP_PLATFORM_AMD__"
ROCM_LIBS="-L$rocm_lib -lamdhip64"
HIPCCFLAGS="-fPIC"
AS_IF([test "x$rocm_arch" != "x"], [
HIPCCFLAGS="$HIPCCFLAGS --offload-arch=$rocm_arch"
])
AC_DEFINE([HAVE_ROCM], [1], [Define to 1 if ROCm is available])
AC_MSG_NOTICE([ROCm found: $ROCM_HOME])
], [
AS_IF([test "x$with_rocm" != "xauto"], [
AC_MSG_ERROR([libamdhip64.so not found in $rocm_lib])
], [
AC_MSG_NOTICE([libamdhip64.so not found in $rocm_lib — ROCm disabled])
])
])
], [
AS_IF([test "x$with_rocm" != "xauto"], [
AC_MSG_ERROR([hip_runtime.h not found in $rocm_inc/hip])
], [
AC_MSG_NOTICE([hip_runtime.h not found in $rocm_inc/hip — ROCm disabled])
])
])
], [
AS_IF([test "x$with_rocm" != "xauto"], [
AC_MSG_ERROR([hipcc not found at $HIPCC])
], [
AC_MSG_NOTICE([hipcc not found — ROCm disabled])
])
])
])

AC_SUBST([ROCM_HOME])
AC_SUBST([HIPCC])
AC_SUBST([ROCM_CFLAGS])
AC_SUBST([ROCM_LIBS])
AC_SUBST([HIPCCFLAGS])

AM_CONDITIONAL([HAVE_ROCM], [test "x$have_rocm" = "xyes"])
])
4 changes: 4 additions & 0 deletions src/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ if HAVE_CUDA
libcail_la_LDFLAGS += $(CUDA_LIBS) -lstdc++
endif

if HAVE_ROCM
libcail_la_LDFLAGS += $(ROCM_LIBS) -lstdc++
endif

include_HEADERS = core/cail.h

AM_CFLAGS = $(MPI_CFLAGS)
4 changes: 4 additions & 0 deletions src/gpu/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ libcail_gpu_la_LIBADD += host/libcail_host.la
else
if HAVE_CUDA
libcail_gpu_la_LIBADD += cuda/libcail_cuda.la
else
if HAVE_ROCM
libcail_gpu_la_LIBADD += rocm/libcail_rocm.la
endif
endif
endif

Expand Down
4 changes: 2 additions & 2 deletions src/gpu/cail_gpu.h
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@
/* cail_gpu.h — Backend-agnostic GPU interface for cail
* Implemented by: CUDA (cail_cuda_reduce.cu + cail_cuda_mem.c)
* Host-path (cail_host_reduce.c)
* ROCm stub (cail_rocm_reduce_stub.c)
* ROCm (cail_rocm_reduce.hip + cail_rocm_mem.c)
*/
#ifndef CAIL_GPU_H
#define CAIL_GPU_H

#include "cail_types.h"
#include "../core/cail_types.h"
#include <stddef.h>
#include <mpi.h>

Expand Down
21 changes: 20 additions & 1 deletion src/gpu/rocm/Makefile.am
Original file line number Diff line number Diff line change
@@ -1,7 +1,26 @@
# Copyright (c) 2026 Cornelis Networks. All rights reserved.

if HAVE_ROCM

noinst_LTLIBRARIES = libcail_rocm.la

libcail_rocm_la_SOURCES = cail_rocm_stub.c
libcail_rocm_la_SOURCES = cail_rocm_mem.c cail_rocm_reduce.hip

SUFFIXES = .hip

.hip.lo:
/bin/bash $(top_srcdir)/hip_lt.sh "$(LIBTOOL)" $@ \
$(HIPCC) $(HIPCCFLAGS) $(ROCM_CFLAGS) $(MPI_CFLAGS) \
-I$(top_srcdir)/src/core -I$(top_srcdir)/src/gpu -c $<

AM_CFLAGS = $(ROCM_CFLAGS) $(MPI_CFLAGS) -I$(top_srcdir)/src/core -I$(top_srcdir)/src/gpu

CLEANFILES = cail_rocm_reduce.lo

else

noinst_LTLIBRARIES = libcail_rocm.la
libcail_rocm_la_SOURCES = cail_rocm_stub.c
AM_CFLAGS = -I$(top_srcdir)/src/core -I$(top_srcdir)/src/gpu

endif
Loading