perf(comm): bucket pipeline throttle is per-channel; consider a global in-flight budget on large meshes

## Observation

The bucket execution pipeline (`bucket_comm` in `src/etha/comm/comm_methods.py`, formerly `execute_bucket_pipeline`) throttles with `max_in_flight` **per channel** — the cap is checked inside each `channels[bucket.key]` group:

```python
if candidate and len(prepared) + len(in_flight) < max_in_flight:
    ...
```

So the global peak of simultaneously prepared/in-flight buckets — and therefore their buffers and outstanding collectives/descriptors — scales as `num_channels × max_in_flight`, where a channel is a distinct `bucket_key = (src_rank, dst_ranks, partial_sig, cell_key, transport)`.

On a large mesh this can grow: e.g. a target rank receiving P2P from many source ranks has one channel per source. With `bucket_size` unset (single-entry buckets, the no-coalesce default) every chunk is its own bucket, so each such channel can keep `max_in_flight` buffers + outstanding ops live at once. Peak ≈ `num_channels × max_in_flight` concurrent buffers/descriptors.

## Why this is not urgent

- **Pre-existing design** — the per-channel throttle predates the #106 refactor (it was the same in `execute_bucket_pipeline`); #106 only widened the *default* (no-coalesce) path's concurrency by routing it through this pipeline instead of the old fully-serial per-chunk path (an intended ~2× speedup, 153→319 GB/s).
- **Bounded by per-rank peer count, not mesh size** — `num_channels` for a rank is the number of distinct peer-groups it talks to, not the total rank count. On the 2-node × 8-GPU benchmark it's small (≤8 channels/rank, ×2 ≈ 16 concurrent), and no memory/descriptor pressure was observed across all 3 configs.
- **Unmeasured at scale** — there is no evidence yet that this bites on the mesh sizes Etha actually runs.

## Proposed fix (when/if it bites)

Add a **global in-flight budget** across channels, on top of the per-channel cap: a shared counter/semaphore incremented at `launch()` and decremented at `finalize()`, so total concurrent in-flight buckets is bounded regardless of channel count. The prepare/launch stages would gate on both the per-channel cap and the global budget.

## Action

Don't change #106 (scope: transfer-layer simplification; this is a separate, unmeasured scaling concern). Revisit if peak HBM or NCCL-descriptor pressure is observed on large meshes — first measure peak concurrent buckets/buffers on a representative large run, then add the global budget if warranted.

Surfaced by a CodeRabbit review comment on #106.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(comm): bucket pipeline throttle is per-channel; consider a global in-flight budget on large meshes #107

Observation

Why this is not urgent

Proposed fix (when/if it bites)

Action

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

perf(comm): bucket pipeline throttle is per-channel; consider a global in-flight budget on large meshes #107

Description

Observation

Why this is not urgent

Proposed fix (when/if it bites)

Action

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions