Skip to content

Synchronize staging copy before handing buffer to MPI (CUDA)#6

Open
mjwilkins18 wants to merge 1 commit into
mainfrom
mjwilkins18/cuda-staging-fence
Open

Synchronize staging copy before handing buffer to MPI (CUDA)#6
mjwilkins18 wants to merge 1 commit into
mainfrom
mjwilkins18/cuda-staging-fence

Conversation

@mjwilkins18

Copy link
Copy Markdown
Collaborator

cail_gpu_memcpy stages the send buffer with a device-to-device cudaMemcpy
and then hands the staged buffer to GPU-aware MPI, which reads it from a
separate NIC / GDRCopy engine. A device-to-device cudaMemcpy has no host-side
completion guarantee on return, so the external read can race the copy.

Add a cudaStreamSynchronize(0) release fence after the staging copy so the
staged data is visible before the buffer is exposed to PMPI. The fence is scoped
to the default stream (where the synchronous cudaMemcpy runs) rather than the
whole device, so unrelated device work is not serialized. Mirrors the equivalent
fence on the ROCm path.

Draft: not yet validated on CUDA hardware.

@mjwilkins18 mjwilkins18 marked this pull request as ready for review June 18, 2026 13:40
cail_gpu_memcpy stages the send buffer with a device-to-device
cudaMemcpy and then hands the staged buffer straight to GPU-aware MPI,
which reads it from a separate NIC / GDRCopy engine. A device-to-device
cudaMemcpy has no host-side completion guarantee on return, so the
subsequent external read can race the copy.

Add a cudaStreamSynchronize(0) release fence after the staging copy so
the staged data is guaranteed visible before the buffer is exposed to
PMPI. The fence is scoped to the default stream (where the synchronous
cudaMemcpy runs) rather than the whole device, so unrelated device work
is not serialized. This mirrors the equivalent fence on the ROCm path.
@mjwilkins18 mjwilkins18 force-pushed the mjwilkins18/cuda-staging-fence branch from 364ffba to 4bb6962 Compare June 18, 2026 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant