Skip to content

[ttl] Allocate pipe resources by transfer liveness#626

Closed
brnorris03 wants to merge 61 commits into
mainfrom
bnorris/pipe-transfer-liveness
Closed

[ttl] Allocate pipe resources by transfer liveness#626
brnorris03 wants to merge 61 commits into
mainfrom
bnorris/pipe-transfer-liveness

Conversation

@brnorris03
Copy link
Copy Markdown
Contributor

No description provided.

phizalev-TT and others added 30 commits May 15, 2026 19:16
- Update tt-metal/tt-mlir versions and version checks.
- Create and validate the toolchain Python venv before tt-metal configure.
- Build tt-metal with source-local runtime roots and explicit firmware precompile.
- Install uplifted tt-metal runtime artifacts, descriptors, ttnn extensions, and precompiled firmware.
- Add build/install regression coverage.
Allocate pipe synchronization state from a shared runtime layout instead of deriving semaphore IDs directly from the PipeNet ID.

Receiver completion semaphores remain per PipeNet. Sender-ready and mailbox semaphores are allocated per pipe on each source core, so distinct pipes from the same source do not alias. Receive-post mailbox staging is allocated per NOC data-movement thread to avoid BRISC/NCRISC races before the remote SRAM write consumes the staged address.

Update lowering tests and PipeNet documentation to match the posted-receive protocol and the implemented semaphore layout.
Make pipe receive expansion validate before mutating IR, remove stale post-expansion receive-copy handling, and tighten internal pipe receive op verification.

Stabilize PipeGraph slot assignment by sorting on the complete pipe key, improve duplicate receiver diagnostics, and stop PipeNet guard verification after unknown PipeNet references.

Add regression coverage for internal receive op verifier diagnostics and update invalid guard expectations for the cleaner unknown-PipeNet diagnostic.
- Reject non-uniform multicast receive addresses until per-destination addresses are supported
- Diagnose semaphore id over-allocation before TTKernel emission
- Add tests
brnorris03 added 29 commits May 23, 2026 08:20
- Rename aggregate rendezvous lookup helper to match ReceiverDFBInfo.

- Use dfbIndex and dfbType when constructing aggregate channel info.
Lower eligible uniform multicast pipes with a counted sender-ready rendezvous instead of per-pipe posted-address mailbox storage. Source-in-destination multicast derives the destination address from local receiver DFB state; safe non-loopback multicast uses a sender-local epoch counter plus static receiver slot metadata. Overlapping non-loopback multicast keeps the posted-address mailbox protocol because multiple reserve slots can be live.

Also validate multicast receiver DFB uniformity before lowering, preserve semantic multicast kind through create_pipe, align host semaphore counting with C++ lowering, and document the resource model. Adds device, sim, and lit coverage for loopback, degenerate, non-loopback, all-to-all, overlap fallback, and semaphore-limit cases.
Refactor PipeNet channel lowering so ready counting and address storage are represented separately. Source-in-destination uniform multicast now uses aggregate rendezvous by waiting on one sender-ready count and reading the local receiver DFB address, while non-loopback multicast stays on the receiver-posted mailbox protocol until explicit receiver-authored address tables exist.

Remove the non-loopback sender-local epoch reconstruction logic, since it inferred receiver DFB state from sender execution. Keep host-side semaphore accounting aligned with the implemented C++ resource model and restore deterministic PipeGraph slot assignment without reserve-slot metadata that is no longer used.

Update PipeNets documentation and focused tests to cover source-in-destination aggregate lowering, non-loopback posted-mailbox fallback, and semaphore-count expectations.

Validated with: cmake --build build; python -m pytest test/sim/test_operation_pipenets.py -q; llvm-lit -v build/test/ttlang/Dialect/TTL/Transforms/convert_pipe_ops.mlir; llvm-lit -v build/test/ttlang/Dialect/TTL/Transforms/convert_pipe_ops_invalid.mlir; Docker pytest test/python/pipe/test_pipenet_rendezvous.py -xvs -rxX.
- address storage carries receiver-authored DFB addresses;
- ready counting records how many receivers have posted a transfer;
- completion wait records when receiver-owned DFB storage contains the
  payload.
Update the tt-mlir submodule to the TTKernel change that allows remote_sram_write_u32 to use computed SRAM source addresses. This supports PipeNet address tables stored in ordinary SRAM instead of semaphore-backed mailbox words.
Uniform multicast now separates receiver-authored address storage from ready and completion synchronization. Receivers publish DFB write pointers to source-core SRAM address-table entries with TTKernel inline NOC writes, and senders consume those entries after the aggregate ready count instead of using semaphore-backed address mailbox words.

Add hidden L1 scratch allocation and common-runtime-arg plumbing for the address tables, update host semaphore counting to match the compiler layout, and refresh MLIR, simulator, and hardware pytest coverage for non-loopback multicast and semaphore scaling.
Parameterize the backend-neutral fanout semaphore test over several recipient counts, including 50 recipients, to verify that a single multicast pipe keeps constant semaphore usage as destination count grows.

Replace the fixed-row hardware fanout test with a grid=full variant that checks one receiver, a small fanout, and all device nodes except the source. The full-device case decomposes the all-but-source region into rectangular multicast pipes while still exercising receiver-authored SRAM address publication.
Rename the PipeNet rendezvous pytest to test_pipenet_sync.py for a shorter and clearer filename.

Document that aggregate multicast rendezvous removes semaphore growth with destination count but does not remove receiver DFB capacity requirements for overlapping all-to-all arrivals.
Record compiler-owned PipeNet resource requirements with module attrs for local semaphores, GlobalSemaphore ready counters, and SRAM address-table storage.

Lower receiver posts through receiver-authored SRAM address tables so address publication no longer consumes semaphore ids, and use GlobalSemaphore-backed ready counters when source-local pipes exceed the local semaphore budget.

Thread the resource plan through Python runtime allocation, update host-side PipeNet accounting, and add focused kernel-runner, simulator, MLIR, and hardware pytest coverage for global ready counters and aggregate ready-counting behavior.

Document the current lowering model and Device 2.0 transition points for resource binding.
Replace pre-TTKernel multicast classification with an explicit point-to-point vs collective transfer contract. The frontend now emits isCollective on ttl.create_pipe for slice-origin receiver sets, including degenerate one-receiver collectives, and PipeGraph/PipeLowering carry PipeTransferContract through resource planning instead of using hardware-oriented multicast terminology.

Keep hardware multicast naming in TTKernel emission, where the physical NOC operation is selected independently from the semantic transfer contract.
Replace the cached-kernel GlobalSemaphore lifetime list instead of appending to it on every execution. This keeps the current call's semaphore objects alive without retaining stale semaphore objects across repeated kernel invocations.

Add a Python-only runner test that executes a GlobalSemaphore-backed kernel twice and verifies the owner list remains bounded to the current allocation.
Advance tt-mlir to the fix that preserves noc_async_write_barrier after ttkernel.noc_inline_dw_write. Pipe receive posts rely on that barrier to publish receiver-authored address-table entries before incrementing sender-ready counters.
Use point-to-point and collective terminology for PipeNet transfer contracts before TTKernel lowering so semantic pipe contracts are not confused with hardware multicast lowering.\n\nAdd non-deprecated Python and C++ accessors for point-to-point/single-receiver and collective/multiple-receiver queries. Keep the old unicast/multicast accessors as deprecated compatibility aliases, while leaving TTKernel, profiler, and NOC hardware multicast terminology intact.\n\nUpdate PipeNet docs, validation diagnostics, and frontend lowering comments to use the new semantic wording.
Split pipe resource, address-table, and receiver-address checks into non-mutating preflight records before TTKernel emission. This keeps pipe send, post, and wait conversion patterns from returning failure after creating partial IR.

Move tensor accessor and DFB rank validation before tensor/DFB copy emission, and switch PipeGraph construction to a typed walk that interrupts on the first receiver validation failure.
Refine pipe resource plan helpers and validation so address storage, ready counters, and completion wait resources are represented explicitly.

Add runtime argument count validation for compiler-emitted pipe resource plans.

Expand MLIR and Python coverage for semaphore spill boundaries, collective metadata, and pipe runtime resource diagnostics.
Stage ttl.constants in the tt-lang-sim wheel because ttl._pipenets imports the shared hardware semaphore limit. This fixes the wheel smoke import failure for ttl.sim after adding the shared PipeNet constants module.
Document the single-receiver collective pipe contract in TableGen.

Use a distinct PipeSourceKey type for source-local ready-counter allocation.

Share the Python ready-counter spill predicate between local and GlobalSemaphore counts.
Factor pipe SRAM scratch, GlobalSemaphore, runtime argument, semaphore descriptor, and io_tensors setup into reusable kernel_runner helpers.

Make emitted runners import those helpers instead of duplicating the pipe runtime body.
Use Pipe Transfer IR intervals to reuse sender-ready counters and source-core address-table slots when transfer lifetimes do not overlap. Keep receiver completion per PipeNet.

Add focused MLIR checks and a Python lit runtime reproducer for the dual-route PipeNet case.
@brnorris03 brnorris03 closed this May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants