Skip to content

[CRL] Replace all CPU-blocking stream syncs in C2C workflow with asynchronous eventRecord + streamWaitEvent pairs#434

Draft
MC952-arch wants to merge 1 commit into
flagos-ai:mainfrom
MC952-arch:ccl/stream-wait-c2c
Draft

[CRL] Replace all CPU-blocking stream syncs in C2C workflow with asynchronous eventRecord + streamWaitEvent pairs#434
MC952-arch wants to merge 1 commit into
flagos-ai:mainfrom
MC952-arch:ccl/stream-wait-c2c

Conversation

@MC952-arch
Copy link
Copy Markdown
Collaborator

PR Category

CRL

PR Types

Optimizations

PR Description

@MC952-arch MC952-arch force-pushed the ccl/stream-wait-c2c branch 2 times, most recently from 922d4f1 to 53720a3 Compare March 26, 2026 09:25
@MC952-arch MC952-arch requested a review from Copilot March 26, 2026 14:11
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the C2C execution path to avoid CPU-blocking streamSynchronize() calls by switching to GPU-side synchronization using eventRecord + streamWaitEvent between the main stream and a temporary hetStream.

Changes:

  • Introduce a dedicated hetero stream (hetStream) plus two events for cross-stream ordering.
  • Replace multiple streamSynchronize() calls with eventRecord() / streamWaitEvent() pairs.
  • Add teardown for created events and the temporary stream.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread flagcx/runner/c2c_algo.cc Outdated
Comment on lines +2392 to +2393
// Record event on stream so hetStream can wait on preHomoFunc completion
FLAGCXCHECK(deviceAdaptor->eventRecord(syncEvent, stream));
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This eventRecord(syncEvent, stream) has no corresponding streamWaitEvent() before syncEvent is recorded again at the end of refreshFunc_ (so this record is effectively overwritten/unused). Removing it would reduce overhead and avoid confusion about which dependency is actually enforced.

Suggested change
// Record event on stream so hetStream can wait on preHomoFunc completion
FLAGCXCHECK(deviceAdaptor->eventRecord(syncEvent, stream));

Copilot uses AI. Check for mistakes.
Comment thread flagcx/runner/c2c_algo.cc Outdated
Comment on lines 2430 to 2432
// Record hetStream progress before homoInterFunc
FLAGCXCHECK(deviceAdaptor->eventRecord(hetSyncEvent, hetStream));

Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hetSyncEvent is recorded here but nothing waits on it (the actual wait is set up after the later record at line 2438). This first record appears redundant and adds event traffic inside the inner loop; consider removing it or clarifying which consumer uses this record.

Suggested change
// Record hetStream progress before homoInterFunc
FLAGCXCHECK(deviceAdaptor->eventRecord(hetSyncEvent, hetStream));

Copilot uses AI. Check for mistakes.
Comment thread flagcx/runner/c2c_algo.cc Outdated
Comment on lines 2460 to 2462
// Record event on stream so heteroFunc sees refreshed data
FLAGCXCHECK(deviceAdaptor->eventRecord(syncEvent, stream));

Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this sequential-inter section the hetero funcs run on the same stream, so recording syncEvent does not enforce any additional ordering (there is no streamWaitEvent here) and the stream already orders refreshFunc_ before heteroFuncSteps_. This looks like leftover sync logic; consider removing these eventRecord calls (or, if cross-stream ordering was intended, add the corresponding streamWaitEvent on the other stream).

Copilot uses AI. Check for mistakes.
Comment thread flagcx/runner/c2c_algo.cc Outdated
Comment on lines +2376 to +2379
flagcxEvent_t syncEvent, hetSyncEvent;
FLAGCXCHECK(deviceAdaptor->eventCreate(&syncEvent, flagcxEventDisableTiming));
FLAGCXCHECK(
deviceAdaptor->eventCreate(&hetSyncEvent, flagcxEventDisableTiming));
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syncEvent/hetSyncEvent are reused and re-recorded many times while earlier streamWaitEvent() calls on the other stream may still be pending. With CUDA semantics (this adaptor calls cudaEventRecord* / cudaStreamWaitEvent), a wait observes the event’s most recent record, so re-recording can redirect earlier waits and can create circular stream dependencies (deadlock) or unintended extra synchronization. Consider using distinct events per sync point (e.g., per pipeline iteration, or a small ping-pong/event pool sized to the max in-flight waits) and only reusing an event once you know all dependent waits have passed.

Copilot uses AI. Check for mistakes.
Comment thread flagcx/runner/c2c_algo.cc
Comment on lines +2372 to +2379
FLAGCXCHECK(deviceAdaptor->streamCreate(&hetStream));

// Create events for async stream synchronization (avoid CPU-blocking
// streamSync)
flagcxEvent_t syncEvent, hetSyncEvent;
FLAGCXCHECK(deviceAdaptor->eventCreate(&syncEvent, flagcxEventDisableTiming));
FLAGCXCHECK(
deviceAdaptor->eventCreate(&hetSyncEvent, flagcxEventDisableTiming));
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New FLAGCXCHECK(...) calls after allocating scratchBuffer_ mean this function can now return early (on stream/event create/record/wait errors) without freeing scratchBuffer_ and without destroying hetStream / events. Please add a cleanup path (e.g., FLAGCXCHECKGOTO + goto cleanup that destroys any successfully-created resources and frees scratchBuffer_) so error propagation doesn’t leak resources.

Suggested change
FLAGCXCHECK(deviceAdaptor->streamCreate(&hetStream));
// Create events for async stream synchronization (avoid CPU-blocking
// streamSync)
flagcxEvent_t syncEvent, hetSyncEvent;
FLAGCXCHECK(deviceAdaptor->eventCreate(&syncEvent, flagcxEventDisableTiming));
FLAGCXCHECK(
deviceAdaptor->eventCreate(&hetSyncEvent, flagcxEventDisableTiming));
FLAGCXCHECKGOTO(deviceAdaptor->streamCreate(&hetStream), cleanup);
// Create events for async stream synchronization (avoid CPU-blocking
// streamSync)
flagcxEvent_t syncEvent, hetSyncEvent;
FLAGCXCHECKGOTO(deviceAdaptor->eventCreate(&syncEvent, flagcxEventDisableTiming), cleanup);
FLAGCXCHECKGOTO(
deviceAdaptor->eventCreate(&hetSyncEvent, flagcxEventDisableTiming), cleanup);

Copilot uses AI. Check for mistakes.
@MC952-arch MC952-arch force-pushed the ccl/stream-wait-c2c branch from 53720a3 to 1dba93d Compare March 27, 2026 10:11
@MC952-arch MC952-arch marked this pull request as draft March 28, 2026 04:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants