Synchronize staging copy before handing buffer to MPI (CUDA)#6
Open
mjwilkins18 wants to merge 1 commit into
Open
Synchronize staging copy before handing buffer to MPI (CUDA)#6mjwilkins18 wants to merge 1 commit into
mjwilkins18 wants to merge 1 commit into
Conversation
cail_gpu_memcpy stages the send buffer with a device-to-device cudaMemcpy and then hands the staged buffer straight to GPU-aware MPI, which reads it from a separate NIC / GDRCopy engine. A device-to-device cudaMemcpy has no host-side completion guarantee on return, so the subsequent external read can race the copy. Add a cudaStreamSynchronize(0) release fence after the staging copy so the staged data is guaranteed visible before the buffer is exposed to PMPI. The fence is scoped to the default stream (where the synchronous cudaMemcpy runs) rather than the whole device, so unrelated device work is not serialized. This mirrors the equivalent fence on the ROCm path.
364ffba to
4bb6962
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
cail_gpu_memcpystages the send buffer with a device-to-devicecudaMemcpyand then hands the staged buffer to GPU-aware MPI, which reads it from a
separate NIC / GDRCopy engine. A device-to-device
cudaMemcpyhas no host-sidecompletion guarantee on return, so the external read can race the copy.
Add a
cudaStreamSynchronize(0)release fence after the staging copy so thestaged data is visible before the buffer is exposed to PMPI. The fence is scoped
to the default stream (where the synchronous
cudaMemcpyruns) rather than thewhole device, so unrelated device work is not serialized. Mirrors the equivalent
fence on the ROCm path.
Draft: not yet validated on CUDA hardware.