You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The bucket execution pipeline (bucket_comm in src/etha/comm/comm_methods.py, formerly execute_bucket_pipeline) throttles with max_in_flightper channel — the cap is checked inside each channels[bucket.key] group:
So the global peak of simultaneously prepared/in-flight buckets — and therefore their buffers and outstanding collectives/descriptors — scales as num_channels × max_in_flight, where a channel is a distinct bucket_key = (src_rank, dst_ranks, partial_sig, cell_key, transport).
On a large mesh this can grow: e.g. a target rank receiving P2P from many source ranks has one channel per source. With bucket_size unset (single-entry buckets, the no-coalesce default) every chunk is its own bucket, so each such channel can keep max_in_flight buffers + outstanding ops live at once. Peak ≈ num_channels × max_in_flight concurrent buffers/descriptors.
Bounded by per-rank peer count, not mesh size — num_channels for a rank is the number of distinct peer-groups it talks to, not the total rank count. On the 2-node × 8-GPU benchmark it's small (≤8 channels/rank, ×2 ≈ 16 concurrent), and no memory/descriptor pressure was observed across all 3 configs.
Unmeasured at scale — there is no evidence yet that this bites on the mesh sizes Etha actually runs.
Proposed fix (when/if it bites)
Add a global in-flight budget across channels, on top of the per-channel cap: a shared counter/semaphore incremented at launch() and decremented at finalize(), so total concurrent in-flight buckets is bounded regardless of channel count. The prepare/launch stages would gate on both the per-channel cap and the global budget.
Action
Don't change #106 (scope: transfer-layer simplification; this is a separate, unmeasured scaling concern). Revisit if peak HBM or NCCL-descriptor pressure is observed on large meshes — first measure peak concurrent buckets/buffers on a representative large run, then add the global budget if warranted.
Observation
The bucket execution pipeline (
bucket_comminsrc/etha/comm/comm_methods.py, formerlyexecute_bucket_pipeline) throttles withmax_in_flightper channel — the cap is checked inside eachchannels[bucket.key]group:So the global peak of simultaneously prepared/in-flight buckets — and therefore their buffers and outstanding collectives/descriptors — scales as
num_channels × max_in_flight, where a channel is a distinctbucket_key = (src_rank, dst_ranks, partial_sig, cell_key, transport).On a large mesh this can grow: e.g. a target rank receiving P2P from many source ranks has one channel per source. With
bucket_sizeunset (single-entry buckets, the no-coalesce default) every chunk is its own bucket, so each such channel can keepmax_in_flightbuffers + outstanding ops live at once. Peak ≈num_channels × max_in_flightconcurrent buffers/descriptors.Why this is not urgent
execute_bucket_pipeline); refactor(comm): Bucket is the sole transfer unit; Chunk is pure data #106 only widened the default (no-coalesce) path's concurrency by routing it through this pipeline instead of the old fully-serial per-chunk path (an intended ~2× speedup, 153→319 GB/s).num_channelsfor a rank is the number of distinct peer-groups it talks to, not the total rank count. On the 2-node × 8-GPU benchmark it's small (≤8 channels/rank, ×2 ≈ 16 concurrent), and no memory/descriptor pressure was observed across all 3 configs.Proposed fix (when/if it bites)
Add a global in-flight budget across channels, on top of the per-channel cap: a shared counter/semaphore incremented at
launch()and decremented atfinalize(), so total concurrent in-flight buckets is bounded regardless of channel count. The prepare/launch stages would gate on both the per-channel cap and the global budget.Action
Don't change #106 (scope: transfer-layer simplification; this is a separate, unmeasured scaling concern). Revisit if peak HBM or NCCL-descriptor pressure is observed on large meshes — first measure peak concurrent buckets/buffers on a representative large run, then add the global budget if warranted.
Surfaced by a CodeRabbit review comment on #106.