Flush GPU recv buffer after non-pof2 unfold receives by mjwilkins18 · Pull Request #4 · cornelisnetworks/cail

mjwilkins18 · 2026-06-12T21:11:13Z

Summary

For non-power-of-two process counts, the Phase-3 unfold PMPI_Recv on
even folded ranks lands the final allreduce result in recvbuf via
GPU-aware MPI — posted PCIe writes with no ordering guarantee against
subsequent kernel reads. Every other receive site in the algorithms is
followed by cail_gpu_flush_recv_buf(), but these two are not:

src/coll/allreduce/cail_allreduce_recursive_doubling.c (unfold receive)
src/coll/allreduce/cail_allreduce_rabenseifner.c (unfold receive)

CAIL launches no kernel after the unfold, but the application may launch
one reading recvbuf immediately after MPI_Allreduce returns and
observe stale device memory — affecting exactly the folded ranks at
non-pof2 scale.

This adds the missing flush after each unfold receive (no-op on the
host path).

The Phase-3 unfold PMPI_Recv on even folded ranks lands the final result in recvbuf via GPU-aware MPI (posted PCIe writes), but unlike every other receive site no cail_gpu_flush_recv_buf follows. The application may launch a kernel reading recvbuf immediately after MPI_Allreduce returns and observe stale device memory. Add the missing flush after the unfold receive in recursive doubling and Rabenseifner.

mjwilkins18 force-pushed the mjwilkins18/unfold-recv-flush branch from a25ffcc to 53c264b Compare June 18, 2026 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flush GPU recv buffer after non-pof2 unfold receives#4

Flush GPU recv buffer after non-pof2 unfold receives#4
mjwilkins18 wants to merge 1 commit into
mainfrom
mjwilkins18/unfold-recv-flush

mjwilkins18 commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Conversation

mjwilkins18 commented Jun 12, 2026

Summary

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant