Skip to content

refactor(comm): split transfer role from transport#105

Merged
junjzhang merged 1 commit into
mainfrom
refactor/transport-role-split
May 29, 2026
Merged

refactor(comm): split transfer role from transport#105
junjzhang merged 1 commit into
mainfrom
refactor/transport-role-split

Conversation

@junjzhang
Copy link
Copy Markdown
Contributor

@junjzhang junjzhang commented May 29, 2026

What did you do

Follow-up to the typed-Route refactor (#99 / #101). The chunk/bucket transfer IR conflated two orthogonal axes into one TransferType enum (P2P/BROADCAST/SELF_COPY/SHADOW) plus an overloaded is_source bool:

  • is_source meant "ships data out" for prepare/execute,
  • but not is_source was reused as "writes a target" in finalize.

A self-copy chunk does both, so Bucket.finalize's if not is_source dropped the local write. Latent in production (disjoint meshes → self-copy never fires), but real — see tests/test_self_copy_bucket.py.

Split into three orthogonal fields:

  • transport: Transport — how bytes cross ranks: P2P / BROADCAST / LOCAL (same-rank copy) / NONE (reduce-only, was SHADOW).
  • is_source — reads a local tensor into the buffer.
  • is_target — writes the buffer back into a local target tensor.

SELF_COPYLOCAL ∧ is_source ∧ is_target; SHADOWNONE ∧ is_source ∧ ¬is_target. The two degenerate enum values disappear. prepare keys off is_source, finalize off is_target — one predicate shared by Chunk and Bucket, so the bucket self-copy drop is fixed by construction, not patched. Slice fields renamed src_slice/dst_slice to pair with src_idx/dst_idx.

Also: bucket_key now includes transport, so a local (self-copy) chunk no longer bundles with a co-located broadcast source chunk — they share src_rank/dst_ranks but need different ops and produce broadcast buffers of different sizes across the group (a second latent bug fixed here).

No production behavior change on disjoint meshes; the two latent bugs above are now correct.

New test cases

tests/test_self_copy_bucket.py — a single-rank gloo probe that runs a local (self-copy) chunk through both execution paths (direct chunk + bucket) and asserts the copy lands in the target. The bucket path failed before this change.

Test results

All three verification layers (per CLAUDE.md):

Layer Result
CPU suite (test_communication_*, test_partial_chunk_reduce) 55/55
Transfer benchmark — 2-node × 8-GPU NCCL, all 3 configs 24 charts; Transport.P2P / BROADCAST / NONE routes live; source-Partial reduce exercised
vLLM weight-sync (8 GPUs, end-to-end) 3/3 sync rounds, weights propagated (v1→v2)

Other comments

Deferred follow-ups, tracked separately:

Summary by CodeRabbit

Release Notes

  • Tests

    • Added test coverage for self-copy operations in both chunk and bucketized execution paths.
  • Refactor

    • Restructured communication routing logic and internal chunk/bucket handling for improved consistency in distributed tensor operations.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

Warning

Review limit reached

@junjzhang, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 41 minutes and 23 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 994f96d6-8f13-4db7-9ef4-1aecfa3c408f

📥 Commits

Reviewing files that changed from the base of the PR and between a3ef169 and ca22345.

📒 Files selected for processing (6)
  • src/etha/comm/get_buckets.py
  • src/etha/comm/get_chunks.py
  • src/etha/comm/get_m2m_map.py
  • src/etha/comm/ir.py
  • src/etha/comm/transfer.py
  • tests/test_self_copy_bucket.py
📝 Walkthrough

Walkthrough

This PR refactors the distributed tensor communication infrastructure by replacing a single TransferType enum with a new Transport enum that separates data-movement routing (LOCAL, P2P, BROADCAST, NONE) from source/target semantics via is_source/is_target flags. The Chunk data model is updated to use src_slice/dst_slice instead of tuple-based slicing, and Bucket buffer assembly and finalization logic are rewritten to respect the new ownership rules.

Changes

Transport enum refactor and Chunk/Bucket IR refactor

Layer / File(s) Summary
Transport enum and Transferable refactor
src/etha/comm/transfer.py
Introduces Transport enum (P2P, BROADCAST, LOCAL, NONE) replacing TransferType, updates Transferable dataclass with transport field and is_target flag, and rewrites execute() to match on transport instead of transfer_type.
IR data model: Route, Chunk slice fields, and Bucket logic
src/etha/comm/ir.py
Updates Route.kind type to Transport; refactors Chunk slice fields from slice_tuples/src_slice_tuples to src_slice/dst_slice; rewrites Chunk.prepare() to branch on is_source (sources use src_slice, targets use dst_slice); rewrites Chunk.finalize() to finalize only when is_target; adds transport to Chunk.bucket_key for proper bucketing; updates Bucket.prepare() to handle chunk buffer ownership and Bucket.finalize() to finalize only target-side buckets.
Route mapping to Transport kinds
src/etha/comm/get_m2m_map.py
Updates imports to use Transport and assigns route kinds via Transport.NONE (empty destinations), Transport.BROADCAST (broadcast), and Transport.P2P (single destination).
Chunk construction with Transport and slice model
src/etha/comm/get_chunks.py
Switches broadcast group detection to Transport.BROADCAST and refactors chunk construction to use new Chunk fields (transport, src_slice, dst_slice, is_source, is_target); replaces actual_transfer_type branching with explicit local-vs-remote split.
Bucket construction with updated Chunk interface
src/etha/comm/get_buckets.py
Updates _build_bucket to pass transport and is_target from first_chunk to Bucket constructor.
Self-copy tests for local transport validation
tests/test_self_copy_bucket.py
Adds single-rank process group fixture, _make_self_copy_chunk helper, and two test cases validating self-copy behavior through both chunk_comm and bucket_comm paths.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related issues

  • cmriat/Etha#104: Directly overlaps the proposed refactor to use Transport enum, is_source/is_target flags, src_slice/dst_slice Chunk model, and updated Bucket.prepare/finalize semantics.
  • cmriat/Etha#103: Related through overlapping source/target handling, LOCAL/self-copy semantics, and routing/execute behavior changes in the same code paths.

Possibly related PRs

  • cmriat/Etha#101: Both refactor the m2m routing pipeline around typed Route/Endpoint lists in get_m2m_map/get_chunks/ir.py; this PR extends that by introducing Transport enum and updating chunk/bucket construction accordingly.

Poem

🐰 From types to flows, the refactor flows—
Transport routes the way each byte goes,
Chunks slice wise, from source to sight,
Buckets bundle, prepare just right;
Self-copy hops dance LOCAL in light! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the core refactoring: separating transfer role (source/target semantics) from transport (how bytes move across ranks), which is the main point of the PR.
Description check ✅ Passed The description covers all required sections: comprehensive explanation of the refactoring, new test cases added with justification, detailed test results across multiple verification layers, and deferred follow-ups.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/transport-role-split

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

Failed to generate code suggestions for PR

@junjzhang
Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Already looking forward to the next diff.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The chunk/bucket transfer IR conflated two orthogonal axes into one
`TransferType` enum (`P2P`/`BROADCAST`/`SELF_COPY`/`SHADOW`) plus an
overloaded `is_source` bool. `is_source` meant "ships data out" for
prepare/execute, while `not is_source` was reused as "writes a target"
in finalize. A self-copy chunk does both, so `Bucket.finalize`'s
`if not is_source` dropped the local write — latent because production
uses disjoint meshes, where self-copy never fires.

Replace it with three orthogonal fields:
- `transport: Transport` — how bytes cross ranks: P2P / BROADCAST /
  LOCAL (same-rank copy) / NONE (reduce-only, formerly SHADOW).
- `is_source` — reads a local tensor into the buffer.
- `is_target` — writes the buffer back into a local target tensor.

SELF_COPY becomes `LOCAL ∧ is_source ∧ is_target`; SHADOW becomes
`NONE ∧ is_source ∧ ¬is_target`. The two degenerate enum values are
gone. `prepare` keys off `is_source`, `finalize` off `is_target` (one
predicate, shared by Chunk and Bucket) — the self-copy bucket drop is
fixed by construction. Slice fields renamed `src_slice`/`dst_slice` to
pair with `src_idx`/`dst_idx`.

`bucket_key` now includes `transport`, so a local chunk no longer
bundles with a co-located broadcast source chunk — they share
src_rank/dst_ranks but need different ops and produce broadcast buffers
of different sizes across the group (a second latent bug).

Verified: CPU suite 55/55; transfer benchmark on 2-node × 8-GPU NCCL
across all 3 configs (P2P/BROADCAST/NONE routes live); vLLM weight-sync
end-to-end (3/3 rounds). tests/test_self_copy_bucket.py covers both the
chunk and bucket local-copy paths.

Follow-ups: #103 (small cleanups), #104 (Bucket as sole transfer unit).
@junjzhang junjzhang force-pushed the refactor/transport-role-split branch from a3ef169 to ca22345 Compare May 29, 2026 06:27
@junjzhang junjzhang merged commit 016b52b into main May 29, 2026
3 of 4 checks passed
@junjzhang junjzhang deleted the refactor/transport-role-split branch May 29, 2026 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant