refactor(comm): Bucket is the sole transfer unit; Chunk is pure data#106
Conversation
|
Warning Review limit reached
More reviews will be available in 44 minutes and 24 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (17)
📝 WalkthroughWalkthroughThis PR removes chunk/Transferable execution, introduces dataclass Chunk and Bucket, centralizes execution in a pipelined bucket_comm, changes get_m2m_map to return M2MMap, replaces map_to_chunk_ops with m2m_to_chunks, and updates benchmarks and tests to use bucketized transfers. ChangesBucket-Based Communication Migration
Sequence Diagram(s)sequenceDiagram
participant Caller
participant get_m2m_map
participant m2m_to_chunks
participant chunk_to_bucket_ops
participant bucket_comm
Caller->>get_m2m_map: compute M2MMap
get_m2m_map->>m2m_to_chunks: m2m, rank -> chunks
m2m_to_chunks->>chunk_to_bucket_ops: chunks
chunk_to_bucket_ops->>bucket_comm: buckets (bucket_size=1)
bucket_comm->>bucket_comm: prepare → launch → finalize (pipelined)
bucket_comm->>Caller: synchronization complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Failed to generate code suggestions for PR |
|
@codex review |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
|
Codex Review: Didn't find any major issues. Breezy! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/etha/comm/comm_methods.py (1)
81-119: Pipeline logic looks correct; one scalability note on the per-channel throttle.FIFO advancement of
prepared[0]/in_flight[0]preserves per-channel ordering, so cross-rank collective/P2P matching is maintained. Themax_in_flightcap at Line 90 is enforced per channel (channels[bucket.key]), so the total number of simultaneously prepared/in-flight buckets — and their buffers and outstanding collectives — scales with the number of distinct channel keys. Withbucket_sizeunset (single-entry buckets) and many distinct(src_rank, dst_ranks, transport)groups (e.g. a target rank receiving P2P from many sources), peak concurrent in-flight resources can benum_channels × max_in_flight. If channel counts grow large on big meshes, consider a global in-flight budget in addition to the per-channel cap to bound memory/descriptor pressure.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/etha/comm/comm_methods.py` around lines 81 - 119, The per-channel max_in_flight cap allows total in-flight buckets to grow as num_channels × max_in_flight; add a global in-flight budget to bound overall concurrency: introduce a shared counter (e.g., global_in_flight) and a config value (e.g., global_max_in_flight) and check/decrement it in the same places that manipulate channel in-flight state — when moving a bucket from candidate -> prepared ensure global_in_flight + total_prepared_in_flight < global_max_in_flight (or only when moving prepared -> in_flight if you want to limit launched work), when prepared[0].launch() succeeds increment global_in_flight, and when in_flight[0].is_complete() and you finalize() decrement global_in_flight; update the loops that reference channels, candidate, prepared, in_flight and bucket.prepare()/launch()/finalize() to consult and maintain this global counter atomically (or under the same lock) so the total memory/descriptor pressure is bounded across all channels.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/etha/comm/comm_methods.py`:
- Around line 81-119: The per-channel max_in_flight cap allows total in-flight
buckets to grow as num_channels × max_in_flight; add a global in-flight budget
to bound overall concurrency: introduce a shared counter (e.g.,
global_in_flight) and a config value (e.g., global_max_in_flight) and
check/decrement it in the same places that manipulate channel in-flight state —
when moving a bucket from candidate -> prepared ensure global_in_flight +
total_prepared_in_flight < global_max_in_flight (or only when moving prepared ->
in_flight if you want to limit launched work), when prepared[0].launch()
succeeds increment global_in_flight, and when in_flight[0].is_complete() and you
finalize() decrement global_in_flight; update the loops that reference channels,
candidate, prepared, in_flight and bucket.prepare()/launch()/finalize() to
consult and maintain this global counter atomically (or under the same lock) so
the total memory/descriptor pressure is bounded across all channels.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 03a04e00-695e-453d-b0d9-c8372b9a2b7a
📒 Files selected for processing (15)
bench/README.mdbench/transfer_benchmark.pysrc/etha/comm/__init__.pysrc/etha/comm/comm_methods.pysrc/etha/comm/execution.pysrc/etha/comm/get_buckets.pysrc/etha/comm/get_chunks.pysrc/etha/comm/ir.pysrc/etha/comm/transfer.pysrc/etha/tensor_bus/agent.pysrc/etha/tensor_bus/batch_state.pytests/test_communication_cpu.pytests/test_communication_replicate_shard.pytests/test_communication_symmetric_mesh.pytests/test_self_copy_bucket.py
💤 Files with no reviewable changes (3)
- src/etha/comm/get_buckets.py
- src/etha/comm/transfer.py
- src/etha/comm/execution.py
Collapse the tensor-transfer layer to one transfer unit and one execution engine, and thread the mesh-to-mesh topology as a single struct. Execution: - Chunk becomes a pure shape-dependent descriptor (no execute()/work). Bucket is the sole transfer unit, owning buffer/work and issuing the wire op in launch() (the Transferable base and Bucket.execute are gone). The two execution paths collapse into one: chunk_comm/execute_chunk_simple are deleted and a non-coalesced transfer is the degenerate bucket_size=1 case of bucket_comm (execute_bucket_pipeline inlined; execution.py removed). - BucketEntry removed: Bucket holds chunks: list[Chunk] directly, byte offsets precomputed once in __post_init__, total_bytes O(1). Topology: - get_m2m_map returns an M2MMap (routes + slicers + partial reductions), moved into comm.ir. m2m_to_chunks (renamed from map_to_chunk_ops) takes the M2MMap directly instead of three unpacked args; the agent neither reassembles nor destructures it. The materialized list[Route] value is named "routes". Agent always bucketizes (bucket_size or 1) and dispatches only bucket_comm; BatchState.send_chunks/recv_chunks removed (chunks are a build-time intermediate). Verified: CPU suite 57/57; transfer benchmark on 2-node x 8-GPU NCCL across all 3 configs (bucket throughput unchanged, ~319 GB/s); vLLM weight-sync 3/3 rounds. Closes #104.
54ffcbe to
773133d
Compare
What did you do
Implements #104 — a cleanup of the tensor-transfer layer that collapses it to one transfer unit, one execution engine, and threads the mesh-to-mesh topology as a single struct. No behavior change for the supported (disjoint-mesh) configs.
Execution layer
Chunkis now a pure shape-dependent descriptor — it no longer issues a wire op (droppedexecute()and theworkfield).Bucketis the sole transfer unit: it ownsbuffer/workand issues the op inlaunch(). TheTransferablebase class andBucket.executeare gone.chunk_comm/execute_chunk_simple(per-chunk synchronous, with acuda.synchronize()after every chunk) are deleted; a non-coalesced transfer is now the degeneratebucket_size=1case ofbucket_comm(one chunk → one single-entry bucket).execute_bucket_pipelinewas inlined intobucket_commandexecution.pyremoved.BucketEntryremoved:Bucketholdschunks: list[Chunk]directly; byte offsets are the prefix sum precomputed once in__post_init__, andtotal_bytesis O(1).Topology layer
get_m2m_mapnow returns anM2MMap(routes + slicers + partial reductions) instead of a loose 4-tuple;M2MMapwas moved fromtensor_bus/pair_state.pyintocomm/ir.py(it is pure comm-layer data).m2m_to_chunks(renamed frommap_to_chunk_ops) takes theM2MMapdirectly rather than three unpacked args — the agent neither reassembles nor destructures it. The materializedlist[Route]value is consistently namedroutes.Agent / state
bucket_size or 1) and dispatches onlybucket_comm.BatchState.send_chunks/recv_chunksare removed — chunks are a build-time intermediate; only buckets are persisted and executed, and the transfer guard checks buckets.Net:
Transferable + Chunk + Bucket + BucketEntry + 2 execution paths→Chunk(descriptor) +Bucket(transfer unit) + 1 engine (get_m2m_map → m2m_to_chunks → chunk_to_bucket_ops → bucket_comm). 17 files, −114 net lines (+297 / −411).New test cases
tests/test_self_copy_bucket.pycovers the twoBucket.preparebranches now that everything is a bucket:test_self_copy_single_entry(single-entry) andtest_self_copy_bundled(two local chunks coalesced → multi-entry path; asserts both the bundling and that the writes land).Test results
All three layers (per
CLAUDE.md):test_communication_*,test_partial_chunk_reduce,test_self_copy_bucket)BucketEntry/ build-once offsets)Reviewed by Codex (clean) and an independent scan agent (no missed renames, no incorrect changes). One regression was caught by the vLLM layer during validation — a
NameErrorafter the M2MMap-return change where the agent still referenced the removedpartial_red_1/2locals — and fixed (the CPU/unit layer didn't catch it since it doesn't drive the agent registration path).Other comments
Deferred follow-ups, tracked separately:
Chunk.dst_idxdead field; overlapping-mesh self-copy correctness — Codex flagged atensor=Nonebucketization crash there, latent under disjoint meshes).max_in_flightis per-channel; a global in-flight budget may be worth it on very large meshes (pre-existing, unmeasured).