Skip to content

Flush GPU recv buffer after non-pof2 unfold receives#4

Open
mjwilkins18 wants to merge 1 commit into
mainfrom
mjwilkins18/unfold-recv-flush
Open

Flush GPU recv buffer after non-pof2 unfold receives#4
mjwilkins18 wants to merge 1 commit into
mainfrom
mjwilkins18/unfold-recv-flush

Conversation

@mjwilkins18

Copy link
Copy Markdown
Collaborator

Summary

For non-power-of-two process counts, the Phase-3 unfold PMPI_Recv on
even folded ranks lands the final allreduce result in recvbuf via
GPU-aware MPI — posted PCIe writes with no ordering guarantee against
subsequent kernel reads. Every other receive site in the algorithms is
followed by cail_gpu_flush_recv_buf(), but these two are not:

  • src/coll/allreduce/cail_allreduce_recursive_doubling.c (unfold receive)
  • src/coll/allreduce/cail_allreduce_rabenseifner.c (unfold receive)

CAIL launches no kernel after the unfold, but the application may launch
one reading recvbuf immediately after MPI_Allreduce returns and
observe stale device memory — affecting exactly the folded ranks at
non-pof2 scale.

This adds the missing flush after each unfold receive (no-op on the
host path).

The Phase-3 unfold PMPI_Recv on even folded ranks lands the final
result in recvbuf via GPU-aware MPI (posted PCIe writes), but unlike
every other receive site no cail_gpu_flush_recv_buf follows. The
application may launch a kernel reading recvbuf immediately after
MPI_Allreduce returns and observe stale device memory.

Add the missing flush after the unfold receive in recursive doubling
and Rabenseifner.
@mjwilkins18 mjwilkins18 force-pushed the mjwilkins18/unfold-recv-flush branch from a25ffcc to 53c264b Compare June 18, 2026 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant