Skip to content

refactor(comm): cleanups deferred from the transport×role split #103

@junjzhang

Description

@junjzhang

Tracking issue for small items found while refactoring the transfer IR (SELF_COPY/SHADOWtransport ∈ {P2P, BROADCAST, LOCAL, NONE} + orthogonal is_source/is_target). They were intentionally left out of that PR to keep it scoped to the role/transport split and the self-copy bucket-finalize fix. Collecting them here to do in one pass.

1. execute() should own self.work

Transferable.execute() returns the dist.Work handle, and the only two callers immediately assign it back:

  • src/etha/comm/execution.py:90chunk.work = chunk.execute()
  • src/etha/comm/ir.py:218self.work = self.execute() (in Bucket.launch)

Move the assignment into execute() so it sets self.work directly; call sites become chunk.execute() / self.execute(). Trade-off: execute() mutates self instead of being side-effect-free — but since neither caller uses the return value for anything else, the "pure" property buys nothing. Net simplification.

2. Overlapping source/target meshes (a rank in both) are not correctly handled

Latent — production uses disjoint meshes, so none of these fire. The current single-tensor local (self-copy) chunk encodes "read and write the same tensor", which only holds when the source and target tensors coincide (as in the unit test).

  • (a) Wrong-tensor read. The local chunk is built with tensor=target_tensor but reads src_slice (a source-grid slice) — src/etha/comm/get_chunks.py (the Transport.LOCAL branch, ~L100-110). When a rank holds distinct source and target tensors, this reads target_tensor[src_slice] instead of source_tensor[src_slice]. Proper fix: split the single tensor field into read_tensor/write_tensor (then is_source reads read_tensor[src_slice], is_target writes write_tensor[dst_slice], self-copy uses both).
  • (b) Skipped Partial reduce. The local chunk doesn't carry source_partial_groups, so a rank that is both a Partial source and a target won't reduce before its local write.
  • (c) P2P-to-self hang. A route whose sole dst is the source rank emits a P2P is_source chunk that isends to itself with no matching irecv (self-delivery goes through the LOCAL chunk).

All three disappear if overlapping meshes are either (i) properly supported — read/write tensor split + the self-cell routed only through LOCAL, never P2P, with the reduce applied — or (ii) explicitly rejected at map-build time. Decision needed: support vs reject.

Regression anchor: tests/test_self_copy_bucket.py covers the single-tensor LOCAL path. A distinct-tensor test would expose (a).


More to be appended as review continues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions