[ttl] Use receiver-authored SRAM address tables for scalable PipeNet collectives(#620)#622
Draft
brnorris03 wants to merge 29 commits into
Draft
[ttl] Use receiver-authored SRAM address tables for scalable PipeNet collectives(#620)#622brnorris03 wants to merge 29 commits into
brnorris03 wants to merge 29 commits into
Conversation
3e5ebfb to
e33add9
Compare
- Rename aggregate rendezvous lookup helper to match ReceiverDFBInfo. - Use dfbIndex and dfbType when constructing aggregate channel info.
Lower eligible uniform multicast pipes with a counted sender-ready rendezvous instead of per-pipe posted-address mailbox storage. Source-in-destination multicast derives the destination address from local receiver DFB state; safe non-loopback multicast uses a sender-local epoch counter plus static receiver slot metadata. Overlapping non-loopback multicast keeps the posted-address mailbox protocol because multiple reserve slots can be live. Also validate multicast receiver DFB uniformity before lowering, preserve semantic multicast kind through create_pipe, align host semaphore counting with C++ lowering, and document the resource model. Adds device, sim, and lit coverage for loopback, degenerate, non-loopback, all-to-all, overlap fallback, and semaphore-limit cases.
Refactor PipeNet channel lowering so ready counting and address storage are represented separately. Source-in-destination uniform multicast now uses aggregate rendezvous by waiting on one sender-ready count and reading the local receiver DFB address, while non-loopback multicast stays on the receiver-posted mailbox protocol until explicit receiver-authored address tables exist. Remove the non-loopback sender-local epoch reconstruction logic, since it inferred receiver DFB state from sender execution. Keep host-side semaphore accounting aligned with the implemented C++ resource model and restore deterministic PipeGraph slot assignment without reserve-slot metadata that is no longer used. Update PipeNets documentation and focused tests to cover source-in-destination aggregate lowering, non-loopback posted-mailbox fallback, and semaphore-count expectations. Validated with: cmake --build build; python -m pytest test/sim/test_operation_pipenets.py -q; llvm-lit -v build/test/ttlang/Dialect/TTL/Transforms/convert_pipe_ops.mlir; llvm-lit -v build/test/ttlang/Dialect/TTL/Transforms/convert_pipe_ops_invalid.mlir; Docker pytest test/python/pipe/test_pipenet_rendezvous.py -xvs -rxX.
- address storage carries receiver-authored DFB addresses; - ready counting records how many receivers have posted a transfer; - completion wait records when receiver-owned DFB storage contains the payload.
Update the tt-mlir submodule to the TTKernel change that allows remote_sram_write_u32 to use computed SRAM source addresses. This supports PipeNet address tables stored in ordinary SRAM instead of semaphore-backed mailbox words.
Uniform multicast now separates receiver-authored address storage from ready and completion synchronization. Receivers publish DFB write pointers to source-core SRAM address-table entries with TTKernel inline NOC writes, and senders consume those entries after the aggregate ready count instead of using semaphore-backed address mailbox words. Add hidden L1 scratch allocation and common-runtime-arg plumbing for the address tables, update host semaphore counting to match the compiler layout, and refresh MLIR, simulator, and hardware pytest coverage for non-loopback multicast and semaphore scaling.
Parameterize the backend-neutral fanout semaphore test over several recipient counts, including 50 recipients, to verify that a single multicast pipe keeps constant semaphore usage as destination count grows. Replace the fixed-row hardware fanout test with a grid=full variant that checks one receiver, a small fanout, and all device nodes except the source. The full-device case decomposes the all-but-source region into rectangular multicast pipes while still exercising receiver-authored SRAM address publication.
Rename the PipeNet rendezvous pytest to test_pipenet_sync.py for a shorter and clearer filename. Document that aggregate multicast rendezvous removes semaphore growth with destination count but does not remove receiver DFB capacity requirements for overlapping all-to-all arrivals.
1ff998e to
e667ae5
Compare
This was referenced May 23, 2026
Record compiler-owned PipeNet resource requirements with module attrs for local semaphores, GlobalSemaphore ready counters, and SRAM address-table storage. Lower receiver posts through receiver-authored SRAM address tables so address publication no longer consumes semaphore ids, and use GlobalSemaphore-backed ready counters when source-local pipes exceed the local semaphore budget. Thread the resource plan through Python runtime allocation, update host-side PipeNet accounting, and add focused kernel-runner, simulator, MLIR, and hardware pytest coverage for global ready counters and aggregate ready-counting behavior. Document the current lowering model and Device 2.0 transition points for resource binding.
Replace pre-TTKernel multicast classification with an explicit point-to-point vs collective transfer contract. The frontend now emits isCollective on ttl.create_pipe for slice-origin receiver sets, including degenerate one-receiver collectives, and PipeGraph/PipeLowering carry PipeTransferContract through resource planning instead of using hardware-oriented multicast terminology. Keep hardware multicast naming in TTKernel emission, where the physical NOC operation is selected independently from the semantic transfer contract.
Replace the cached-kernel GlobalSemaphore lifetime list instead of appending to it on every execution. This keeps the current call's semaphore objects alive without retaining stale semaphore objects across repeated kernel invocations. Add a Python-only runner test that executes a GlobalSemaphore-backed kernel twice and verifies the owner list remains bounded to the current allocation.
Advance tt-mlir to the fix that preserves noc_async_write_barrier after ttkernel.noc_inline_dw_write. Pipe receive posts rely on that barrier to publish receiver-authored address-table entries before incrementing sender-ready counters.
Use point-to-point and collective terminology for PipeNet transfer contracts before TTKernel lowering so semantic pipe contracts are not confused with hardware multicast lowering.\n\nAdd non-deprecated Python and C++ accessors for point-to-point/single-receiver and collective/multiple-receiver queries. Keep the old unicast/multicast accessors as deprecated compatibility aliases, while leaving TTKernel, profiler, and NOC hardware multicast terminology intact.\n\nUpdate PipeNet docs, validation diagnostics, and frontend lowering comments to use the new semantic wording.
65b346d to
d2c5897
Compare
Split pipe resource, address-table, and receiver-address checks into non-mutating preflight records before TTKernel emission. This keeps pipe send, post, and wait conversion patterns from returning failure after creating partial IR. Move tensor accessor and DFB rank validation before tensor/DFB copy emission, and switch PipeGraph construction to a typed walk that interrupts on the first receiver validation failure.
Refine pipe resource plan helpers and validation so address storage, ready counters, and completion wait resources are represented explicitly. Add runtime argument count validation for compiler-emitted pipe resource plans. Expand MLIR and Python coverage for semaphore spill boundaries, collective metadata, and pipe runtime resource diagnostics.
Stage ttl.constants in the tt-lang-sim wheel because ttl._pipenets imports the shared hardware semaphore limit. This fixes the wheel smoke import failure for ttl.sim after adding the shared PipeNet constants module.
Document the single-receiver collective pipe contract in TableGen. Use a distinct PipeSourceKey type for source-local ready-counter allocation. Share the Python ready-counter spill predicate between local and GlobalSemaphore counts.
Factor pipe SRAM scratch, GlobalSemaphore, runtime argument, semaphore descriptor, and io_tensors setup into reusable kernel_runner helpers. Make emitted runners import those helpers instead of duplicating the pipe runtime body.
This was referenced May 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem description
PR #614 makes PipeNet transfers correct by using receiver-posted destination DFB addresses, but the physical pipe synchronization layout still uses hardware semaphore ids for resources that do not need to be local semaphores. Address publication is data storage, not synchronization, and sender-ready counts need to scale with PipeNet pipe count rather than the local semaphore-id budget.
This matters because the per-pipe layout can exceed the TT hardware semaphore limit (#619). Uniform collective transfers also need a clear correctness boundary: until per-receiver destination addresses are represented (#617), collective receivers must publish equivalent destination DFB addresses.
What's changed
TLDR: This PR separates PipeNet address storage from synchronization, makes the compiler emit the physical pipe resource plan, and uses GlobalSemaphore-backed sender-ready counters when local semaphore ids are not sufficient. Receiver posts publish actual destination DFB addresses into source-core SRAM address tables, receivers increment a counted ready object, and senders consume the receiver-authored address after the ready count is satisfied.
In addition, pre-TTKernel compiler abstractions now use point-to-point and collective terminology for semantic pipe transfer contracts, instead of using
unicastandmulticastfor user-level semantics. APipedefines a transfer relation from one source participant to one or more receiver participants. That transfer relation has a contract independent of how it is lowered:point_to_point: exactly one source participant and exactly one receiver participant.collective: one source participant and a receiver set participating in one transfer contract.The lowering decision, such as a point-to-point NOC write versus a hardware multicast write, is a TTKernel-level implementation detail that can depend on the transfer contract, receiver count, hardware capabilities, and later cost-model decisions.
Details
ttl.create_pipe, so a slice-origin collective transfer covering one receiver still uses collective ready-counting layout.unicastandmulticastfor semantic pipe contracts, which made the user-level contract easy to confuse with hardware multicast lowering. Frontend and compiler wording now use point-to-point and collective transfer contracts; legacyis_unicast/is_multicastaccessors remain as deprecated aliases, while TTKernel and profiler code keep hardware multicast terminology.ttkernel.noc_inline_dw_write, which lowers tonoc_inline_dw_write<InlineWriteDst::L1>.Tests
Stacked PR sequence
Pipe compilation work required to match Blaze's scalable communication model: GlobalSemaphore-backed counting, explicit L1 address/state, DFB lifetime allocation, and batching where storage is limiting.
ttl.copyplacement.Fixes #620, #625.