Skip to content

perf(comm): bucket pipeline throttle is per-channel; consider a global in-flight budget on large meshes #107

@junjzhang

Description

@junjzhang

Observation

The bucket execution pipeline (bucket_comm in src/etha/comm/comm_methods.py, formerly execute_bucket_pipeline) throttles with max_in_flight per channel — the cap is checked inside each channels[bucket.key] group:

if candidate and len(prepared) + len(in_flight) < max_in_flight:
    ...

So the global peak of simultaneously prepared/in-flight buckets — and therefore their buffers and outstanding collectives/descriptors — scales as num_channels × max_in_flight, where a channel is a distinct bucket_key = (src_rank, dst_ranks, partial_sig, cell_key, transport).

On a large mesh this can grow: e.g. a target rank receiving P2P from many source ranks has one channel per source. With bucket_size unset (single-entry buckets, the no-coalesce default) every chunk is its own bucket, so each such channel can keep max_in_flight buffers + outstanding ops live at once. Peak ≈ num_channels × max_in_flight concurrent buffers/descriptors.

Why this is not urgent

  • Pre-existing design — the per-channel throttle predates the refactor(comm): Bucket is the sole transfer unit; Chunk is pure data #106 refactor (it was the same in execute_bucket_pipeline); refactor(comm): Bucket is the sole transfer unit; Chunk is pure data #106 only widened the default (no-coalesce) path's concurrency by routing it through this pipeline instead of the old fully-serial per-chunk path (an intended ~2× speedup, 153→319 GB/s).
  • Bounded by per-rank peer count, not mesh sizenum_channels for a rank is the number of distinct peer-groups it talks to, not the total rank count. On the 2-node × 8-GPU benchmark it's small (≤8 channels/rank, ×2 ≈ 16 concurrent), and no memory/descriptor pressure was observed across all 3 configs.
  • Unmeasured at scale — there is no evidence yet that this bites on the mesh sizes Etha actually runs.

Proposed fix (when/if it bites)

Add a global in-flight budget across channels, on top of the per-channel cap: a shared counter/semaphore incremented at launch() and decremented at finalize(), so total concurrent in-flight buckets is bounded regardless of channel count. The prepare/launch stages would gate on both the per-channel cap and the global budget.

Action

Don't change #106 (scope: transfer-layer simplification; this is a separate, unmeasured scaling concern). Revisit if peak HBM or NCCL-descriptor pressure is observed on large meshes — first measure peak concurrent buckets/buffers on a representative large run, then add the global budget if warranted.

Surfaced by a CodeRabbit review comment on #106.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions