Skip to content

[ttl] Use receiver-authored SRAM address tables for scalable PipeNet collectives(#620)#622

Draft
brnorris03 wants to merge 29 commits into
bnorris/fix-pipe-unicast-608from
bnorris/pipes-620
Draft

[ttl] Use receiver-authored SRAM address tables for scalable PipeNet collectives(#620)#622
brnorris03 wants to merge 29 commits into
bnorris/fix-pipe-unicast-608from
bnorris/pipes-620

Conversation

@brnorris03
Copy link
Copy Markdown
Contributor

@brnorris03 brnorris03 commented May 23, 2026

Problem description

PR #614 makes PipeNet transfers correct by using receiver-posted destination DFB addresses, but the physical pipe synchronization layout still uses hardware semaphore ids for resources that do not need to be local semaphores. Address publication is data storage, not synchronization, and sender-ready counts need to scale with PipeNet pipe count rather than the local semaphore-id budget.

This matters because the per-pipe layout can exceed the TT hardware semaphore limit (#619). Uniform collective transfers also need a clear correctness boundary: until per-receiver destination addresses are represented (#617), collective receivers must publish equivalent destination DFB addresses.

What's changed

TLDR: This PR separates PipeNet address storage from synchronization, makes the compiler emit the physical pipe resource plan, and uses GlobalSemaphore-backed sender-ready counters when local semaphore ids are not sufficient. Receiver posts publish actual destination DFB addresses into source-core SRAM address tables, receivers increment a counted ready object, and senders consume the receiver-authored address after the ready count is satisfied.

In addition, pre-TTKernel compiler abstractions now use point-to-point and collective terminology for semantic pipe transfer contracts, instead of using unicast and multicast for user-level semantics. A Pipe defines a transfer relation from one source participant to one or more receiver participants. That transfer relation has a contract independent of how it is lowered:

  • point_to_point: exactly one source participant and exactly one receiver participant.
  • collective: one source participant and a receiver set participating in one transfer contract.

The lowering decision, such as a point-to-point NOC write versus a hardware multicast write, is a TTKernel-level implementation detail that can depend on the transfer contract, receiver count, hardware capabilities, and later cost-model decisions.

Details

  • Pipe synchronization previously used hardware semaphore ids for both mailbox storage and synchronization. Address publication now uses compiler-managed source-core SRAM address tables, while ready counting and completion waits are represented as separate resources.
  • Source-in-destination uniform collective transfer previously used the posted-address mailbox protocol. It now uses aggregate ready counting: each receiver contributes to one ready count, the sender waits for the receiver count, and the sender reads one receiver-authored DFB address from its source-core address table.
  • Non-loopback collective transfer previously either needed the posted-address mailbox protocol or required the sender to reconstruct receiver DFB state. Receivers now publish their DFB address directly into the sender's SRAM address table with an inline 32-bit NOC write, so non-loopback uniform collective transfer uses the same aggregate ready counting protocol.
  • Sender-ready counting previously consumed one local semaphore id per source-local pipe. Ready counters are now selected by the compiler resource plan, with GlobalSemaphore-backed counters used for scalable layouts.
  • C++ lowering and Python host allocation previously maintained parallel resource estimates. The compiler now emits the pipe resource plan that specifies local semaphore descriptors, GlobalSemaphore descriptors, SRAM address-table allocations, and runtime argument positions.
  • Collective receive posts previously allowed the same logical pipe to publish different destination DFB addresses. PipeGraph now validates receiver DFB index, DFB type, and static destination tile offset before lowering; non-uniform or untraceable collective destination addresses emit a diagnostic citing Support per-destination pipe receive addresses for multicast lowering #617.
  • Transfer contract was previously inferred only from destination extent. The frontend now preserves the collective contract on ttl.create_pipe, so a slice-origin collective transfer covering one receiver still uses collective ready-counting layout.
  • Pre-TTKernel code previously used unicast and multicast for semantic pipe contracts, which made the user-level contract easy to confuse with hardware multicast lowering. Frontend and compiler wording now use point-to-point and collective transfer contracts; legacy is_unicast / is_multicast accessors remain as deprecated aliases, while TTKernel and profiler code keep hardware multicast terminology.
  • Semaphore over-allocation previously reached TTKernel emission, where it could request invalid semaphore ids. Pipe synchronization resource allocation now reports a deterministic error before TTKernel emission when no valid physical allocation is available.
  • TTKernel previously exposed only semaphore-style remote writes, not a general remote 32-bit L1 write for receiver-authored address publication. The tt-mlir submodule now includes ttkernel.noc_inline_dw_write, which lowers to noc_inline_dw_write<InlineWriteDst::L1>.
  • Aggregate ready counting and GlobalSemaphore-backed ready counters remove semaphore growth from collective receiver count and source-local pipe count for per-device PipeNets, including SPMD mesh execution where the same intra-chip PipeNet runs on each shard. They do not add inter-chip PipeNets or receiver DFB slot reuse; full-device all-to-all patterns with more overlapping arrivals than the supported DFB block count still need batched/phased DFB reuse after explicit pipe transfer IR.

Tests

  • Aggregate ready counting pytests cover loopback collective transfers, full-grid fanout, non-loopback address-table publication, degenerate slice-origin collective transfers, row all-to-all collective transfers, 2D all-to-all collective transfers, and the many-PipeNet semaphore-limit regression.
  • GlobalSemaphore resource tests cover source-local pipe counts that exceed local semaphore capacity, interleaved PipeNets, and full-launch-grid point-to-point fanout at the GlobalSemaphore threshold.
  • Negative pytest coverage checks non-uniform collective destination addresses, impossible physical resource allocations, and the depth-1 protocol case that requires phased pipe transfer lowering (PipeNet phased channel lowering for repeated sends on one logical channel #623).
  • Python-only PipeNet resource model tests cover degenerate collective transfers, non-loopback address-table allocation, overlapping source-local pipes, GlobalSemaphore ready counters, and large 2D all-to-all resource scaling.
  • Host-side runner unit tests cover GlobalSemaphore allocation, explicit device selection, tensor-derived device selection, zero-count allocation, and missing-device diagnostics.
  • MLIR lit tests cover aggregate send/receive lowering, non-loopback address-table lowering, resource-plan emission, GlobalSemaphore-ready lowering, degenerate collective metadata, dynamic/untraceable collective receive offsets, invalid collective metadata, and deterministic resource diagnostics.
  • Python lit coverage includes the issue PipeNet dual delivery-path reproducer deadlocks when both routes are enabled #625 liveness-based resource allocation reproducer, which exercises the combined row/column/helper PipeNet schedule on an 8x7 launch grid and verifies successful device synchronization.

Stacked PR sequence

Pipe compilation work required to match Blaze's scalable communication model: GlobalSemaphore-backed counting, explicit L1 address/state, DFB lifetime allocation, and batching where storage is limiting.

# PR Status Main problem solved
1 #614 Base PR Makes PipeNet transfers correct by switching to receiver-posted destination DFB addresses. This removes sender inference of receiver DFB write-pointer state, but not semaphore pressure.
2 #622 Current PR Removes collective receiver-count and source-local pipe-count semaphore growth by using receiver-authored SRAM address tables, compiler-emitted pipe resource plans, and GlobalSemaphore-backed ready counters. This eliminates the local semaphore-id scaling limit for large uniform collective transfers.
3 #624 Follow-up PR Represents transfer phase, receiver-authored address publication, ready counting, send, and receive-token wait explicitly before control-flow-general lowering. This removes the need to infer transfer phase and queue depth from ttl.copy placement.
4 #627 Follow-up PR Assigns source-core address-table slots and sender-ready counters from explicit pipe transfer lifetimes. This reduces those resources from logical pipe count to concurrently live same-source transfer count while keeping receiver completion per PipeNet.
5 TBD Planned follow-up Supports phased pipe transfer lowering for receive-ahead and pipelined loops with monotonic ready counts, finite address-table depths, and deterministic diagnostics when safe pipe transfer state cannot be allocated. This extends the current one-live-post protocol to programs where later receive posts may be live before earlier sends complete. Issue #623
6 TBD Planned follow-up Lowers large all-to-all into communication batches when receiver DFB capacity is smaller than the number of incoming pipes. This eliminates the simultaneous receiver DFB slot scaling limit when the program can consume batches incrementally.
7 TBD Planned follow-up Groups post/wait state when pipe tokens prove identical lifetime and completion behavior. This reduces duplicate pipe synchronization resources after liveness allocation exists.

Fixes #620, #625.

@brnorris03 brnorris03 force-pushed the bnorris/pipes-620 branch from 3e5ebfb to e33add9 Compare May 23, 2026 01:47
@brnorris03 brnorris03 changed the title [ttl] Pipe verification (part of 620) [ttl] Aggregate rendezvous for uniform multicast PipeNets (part of 620) May 23, 2026
@brnorris03 brnorris03 changed the base branch from main to bnorris/fix-pipe-unicast-608 May 23, 2026 06:30
brnorris03 added 13 commits May 23, 2026 08:20
- Rename aggregate rendezvous lookup helper to match ReceiverDFBInfo.

- Use dfbIndex and dfbType when constructing aggregate channel info.
Lower eligible uniform multicast pipes with a counted sender-ready rendezvous instead of per-pipe posted-address mailbox storage. Source-in-destination multicast derives the destination address from local receiver DFB state; safe non-loopback multicast uses a sender-local epoch counter plus static receiver slot metadata. Overlapping non-loopback multicast keeps the posted-address mailbox protocol because multiple reserve slots can be live.

Also validate multicast receiver DFB uniformity before lowering, preserve semantic multicast kind through create_pipe, align host semaphore counting with C++ lowering, and document the resource model. Adds device, sim, and lit coverage for loopback, degenerate, non-loopback, all-to-all, overlap fallback, and semaphore-limit cases.
Refactor PipeNet channel lowering so ready counting and address storage are represented separately. Source-in-destination uniform multicast now uses aggregate rendezvous by waiting on one sender-ready count and reading the local receiver DFB address, while non-loopback multicast stays on the receiver-posted mailbox protocol until explicit receiver-authored address tables exist.

Remove the non-loopback sender-local epoch reconstruction logic, since it inferred receiver DFB state from sender execution. Keep host-side semaphore accounting aligned with the implemented C++ resource model and restore deterministic PipeGraph slot assignment without reserve-slot metadata that is no longer used.

Update PipeNets documentation and focused tests to cover source-in-destination aggregate lowering, non-loopback posted-mailbox fallback, and semaphore-count expectations.

Validated with: cmake --build build; python -m pytest test/sim/test_operation_pipenets.py -q; llvm-lit -v build/test/ttlang/Dialect/TTL/Transforms/convert_pipe_ops.mlir; llvm-lit -v build/test/ttlang/Dialect/TTL/Transforms/convert_pipe_ops_invalid.mlir; Docker pytest test/python/pipe/test_pipenet_rendezvous.py -xvs -rxX.
- address storage carries receiver-authored DFB addresses;
- ready counting records how many receivers have posted a transfer;
- completion wait records when receiver-owned DFB storage contains the
  payload.
Update the tt-mlir submodule to the TTKernel change that allows remote_sram_write_u32 to use computed SRAM source addresses. This supports PipeNet address tables stored in ordinary SRAM instead of semaphore-backed mailbox words.
Uniform multicast now separates receiver-authored address storage from ready and completion synchronization. Receivers publish DFB write pointers to source-core SRAM address-table entries with TTKernel inline NOC writes, and senders consume those entries after the aggregate ready count instead of using semaphore-backed address mailbox words.

Add hidden L1 scratch allocation and common-runtime-arg plumbing for the address tables, update host semaphore counting to match the compiler layout, and refresh MLIR, simulator, and hardware pytest coverage for non-loopback multicast and semaphore scaling.
Parameterize the backend-neutral fanout semaphore test over several recipient counts, including 50 recipients, to verify that a single multicast pipe keeps constant semaphore usage as destination count grows.

Replace the fixed-row hardware fanout test with a grid=full variant that checks one receiver, a small fanout, and all device nodes except the source. The full-device case decomposes the all-but-source region into rectangular multicast pipes while still exercising receiver-authored SRAM address publication.
Rename the PipeNet rendezvous pytest to test_pipenet_sync.py for a shorter and clearer filename.

Document that aggregate multicast rendezvous removes semaphore growth with destination count but does not remove receiver DFB capacity requirements for overlapping all-to-all arrivals.
@brnorris03 brnorris03 force-pushed the bnorris/pipes-620 branch from 1ff998e to e667ae5 Compare May 23, 2026 15:28
Record compiler-owned PipeNet resource requirements with module attrs for local semaphores, GlobalSemaphore ready counters, and SRAM address-table storage.

Lower receiver posts through receiver-authored SRAM address tables so address publication no longer consumes semaphore ids, and use GlobalSemaphore-backed ready counters when source-local pipes exceed the local semaphore budget.

Thread the resource plan through Python runtime allocation, update host-side PipeNet accounting, and add focused kernel-runner, simulator, MLIR, and hardware pytest coverage for global ready counters and aggregate ready-counting behavior.

Document the current lowering model and Device 2.0 transition points for resource binding.
Replace pre-TTKernel multicast classification with an explicit point-to-point vs collective transfer contract. The frontend now emits isCollective on ttl.create_pipe for slice-origin receiver sets, including degenerate one-receiver collectives, and PipeGraph/PipeLowering carry PipeTransferContract through resource planning instead of using hardware-oriented multicast terminology.

Keep hardware multicast naming in TTKernel emission, where the physical NOC operation is selected independently from the semantic transfer contract.
Replace the cached-kernel GlobalSemaphore lifetime list instead of appending to it on every execution. This keeps the current call's semaphore objects alive without retaining stale semaphore objects across repeated kernel invocations.

Add a Python-only runner test that executes a GlobalSemaphore-backed kernel twice and verifies the owner list remains bounded to the current allocation.
Advance tt-mlir to the fix that preserves noc_async_write_barrier after ttkernel.noc_inline_dw_write. Pipe receive posts rely on that barrier to publish receiver-authored address-table entries before incrementing sender-ready counters.
@brnorris03 brnorris03 changed the title [ttl] Aggregate rendezvous for uniform multicast PipeNets (part of 620) [ttl] Use receiver-authored SRAM address tables for scalable PipeNet collectives(#620) May 23, 2026
Use point-to-point and collective terminology for PipeNet transfer contracts before TTKernel lowering so semantic pipe contracts are not confused with hardware multicast lowering.\n\nAdd non-deprecated Python and C++ accessors for point-to-point/single-receiver and collective/multiple-receiver queries. Keep the old unicast/multicast accessors as deprecated compatibility aliases, while leaving TTKernel, profiler, and NOC hardware multicast terminology intact.\n\nUpdate PipeNet docs, validation diagnostics, and frontend lowering comments to use the new semantic wording.
@brnorris03 brnorris03 force-pushed the bnorris/pipes-620 branch from 65b346d to d2c5897 Compare May 23, 2026 19:08
Split pipe resource, address-table, and receiver-address checks into non-mutating preflight records before TTKernel emission. This keeps pipe send, post, and wait conversion patterns from returning failure after creating partial IR.

Move tensor accessor and DFB rank validation before tensor/DFB copy emission, and switch PipeGraph construction to a typed walk that interrupts on the first receiver validation failure.
Refine pipe resource plan helpers and validation so address storage, ready counters, and completion wait resources are represented explicitly.

Add runtime argument count validation for compiler-emitted pipe resource plans.

Expand MLIR and Python coverage for semaphore spill boundaries, collective metadata, and pipe runtime resource diagnostics.
Stage ttl.constants in the tt-lang-sim wheel because ttl._pipenets imports the shared hardware semaphore limit. This fixes the wheel smoke import failure for ttl.sim after adding the shared PipeNet constants module.
Document the single-receiver collective pipe contract in TableGen.

Use a distinct PipeSourceKey type for source-local ready-counter allocation.

Share the Python ready-counter spill predicate between local and GlobalSemaphore counts.
Factor pipe SRAM scratch, GlobalSemaphore, runtime argument, semaphore descriptor, and io_tensors setup into reusable kernel_runner helpers.

Make emitted runners import those helpers instead of duplicating the pipe runtime body.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement aggregate rendezvous lowering for uniform multicast PipeNets

1 participant