refactor(comm): split transfer role from transport#105
Conversation
|
Warning Review limit reached
More reviews will be available in 41 minutes and 23 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughThis PR refactors the distributed tensor communication infrastructure by replacing a single ChangesTransport enum refactor and Chunk/Bucket IR refactor
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Failed to generate code suggestions for PR |
|
@codex review |
|
Codex Review: Didn't find any major issues. Already looking forward to the next diff. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
The chunk/bucket transfer IR conflated two orthogonal axes into one `TransferType` enum (`P2P`/`BROADCAST`/`SELF_COPY`/`SHADOW`) plus an overloaded `is_source` bool. `is_source` meant "ships data out" for prepare/execute, while `not is_source` was reused as "writes a target" in finalize. A self-copy chunk does both, so `Bucket.finalize`'s `if not is_source` dropped the local write — latent because production uses disjoint meshes, where self-copy never fires. Replace it with three orthogonal fields: - `transport: Transport` — how bytes cross ranks: P2P / BROADCAST / LOCAL (same-rank copy) / NONE (reduce-only, formerly SHADOW). - `is_source` — reads a local tensor into the buffer. - `is_target` — writes the buffer back into a local target tensor. SELF_COPY becomes `LOCAL ∧ is_source ∧ is_target`; SHADOW becomes `NONE ∧ is_source ∧ ¬is_target`. The two degenerate enum values are gone. `prepare` keys off `is_source`, `finalize` off `is_target` (one predicate, shared by Chunk and Bucket) — the self-copy bucket drop is fixed by construction. Slice fields renamed `src_slice`/`dst_slice` to pair with `src_idx`/`dst_idx`. `bucket_key` now includes `transport`, so a local chunk no longer bundles with a co-located broadcast source chunk — they share src_rank/dst_ranks but need different ops and produce broadcast buffers of different sizes across the group (a second latent bug). Verified: CPU suite 55/55; transfer benchmark on 2-node × 8-GPU NCCL across all 3 configs (P2P/BROADCAST/NONE routes live); vLLM weight-sync end-to-end (3/3 rounds). tests/test_self_copy_bucket.py covers both the chunk and bucket local-copy paths. Follow-ups: #103 (small cleanups), #104 (Bucket as sole transfer unit).
a3ef169 to
ca22345
Compare
What did you do
Follow-up to the typed-
Routerefactor (#99 / #101). The chunk/bucket transfer IR conflated two orthogonal axes into oneTransferTypeenum (P2P/BROADCAST/SELF_COPY/SHADOW) plus an overloadedis_sourcebool:is_sourcemeant "ships data out" forprepare/execute,not is_sourcewas reused as "writes a target" infinalize.A self-copy chunk does both, so
Bucket.finalize'sif not is_sourcedropped the local write. Latent in production (disjoint meshes → self-copy never fires), but real — seetests/test_self_copy_bucket.py.Split into three orthogonal fields:
transport: Transport— how bytes cross ranks:P2P/BROADCAST/LOCAL(same-rank copy) /NONE(reduce-only, wasSHADOW).is_source— reads a local tensor into the buffer.is_target— writes the buffer back into a local target tensor.SELF_COPY→LOCAL ∧ is_source ∧ is_target;SHADOW→NONE ∧ is_source ∧ ¬is_target. The two degenerate enum values disappear.preparekeys offis_source,finalizeoffis_target— one predicate shared byChunkandBucket, so the bucket self-copy drop is fixed by construction, not patched. Slice fields renamedsrc_slice/dst_sliceto pair withsrc_idx/dst_idx.Also:
bucket_keynow includestransport, so a local (self-copy) chunk no longer bundles with a co-located broadcast source chunk — they sharesrc_rank/dst_ranksbut need different ops and produce broadcast buffers of different sizes across the group (a second latent bug fixed here).No production behavior change on disjoint meshes; the two latent bugs above are now correct.
New test cases
tests/test_self_copy_bucket.py— a single-rank gloo probe that runs a local (self-copy) chunk through both execution paths (direct chunk + bucket) and asserts the copy lands in the target. The bucket path failed before this change.Test results
All three verification layers (per
CLAUDE.md):test_communication_*,test_partial_chunk_reduce)Transport.P2P/BROADCAST/NONEroutes live; source-Partial reduce exercisedOther comments
Deferred follow-ups, tracked separately:
execute()owningself.work,Chunk.dst_idxdead field, overlapping-mesh self-copy correctness).Bucketthe sole transfer unit, demoteChunkto a data descriptor, collapse the two execution paths into one (benchmark shows the chunk path is also ~2× slower).Summary by CodeRabbit
Release Notes
Tests
Refactor