[fix][ttl] Materialize tensor loop state before compute lowering (#527)#540
Draft
brnorris03 wants to merge 7 commits into
Draft
[fix][ttl] Materialize tensor loop state before compute lowering (#527)#540brnorris03 wants to merge 7 commits into
brnorris03 wants to merge 7 commits into
Conversation
…rgs before compute conversion. Preserve additive recurrences by lowering them to the existing accumulate-store form, and materialize other tensor recurrences through compiler-allocated DFB state.
brnorris03
commented
Apr 30, 2026
Contributor
Author
There was a problem hiding this comment.
Code just moved into this common file from TTLInsertIntermediateDFBs.cpp, no changes
brnorris03
added a commit
that referenced
this pull request
May 13, 2026
- New pass ttl-verify-dfb-spsc (lib/Dialect/TTL/Transforms/TTLVerifyDFBSPSC.cpp): module-level walk that groups cb_reserve/cb_wait ops by cb_index + enclosing ttl.kernel_thread func, rejects any DFB with >1 producer or >1 consumer thread. - TableGen pass record in Passes.td; build entry in Transforms/CMakeLists.txt. - Shared helpers: kKernelThreadAttrName constant (TTL.h), getEnclosingKernelThread (TTLOpsUtils.h). - Pass wired into three pipeline definitions: TTLPipelines.cpp, python/ttl/ttl_api.py:1451, test/me2e/builder/pipeline.py:66. - SPSC refactor in test/python/pipe/test_pipenet_multi_iter.py: split partial_cb/sum_cb into per-consumer copies, dropped xfail on test_gather_bcast_multi_iter, added TODO referencing PR #540. - SPSC refactor in test/python/test_store_patterns.py::store_then_forward_kernel: split main_dfb into main_for_compute_dfb + main_for_write_dfb. - SPSC refactor in _bugs/repro_574.py: matches the test refactor. - New lit tests: verify_dfb_spsc.mlir (positive cases), verify_dfb_spsc_invalid.mlir (negative cases). - Docs: new "Single-producer Single-consumer Semantics" section in docs/development/DFBManagement.md (Contract / Violation / Correct form / Verification). - Filed tt-lang issue #580 for pipeline-definition consolidation.
brnorris03
added a commit
that referenced
this pull request
May 13, 2026
- New pass ttl-verify-dfb-spsc (lib/Dialect/TTL/Transforms/TTLVerifyDFBSPSC.cpp): module-level walk that groups cb_reserve/cb_wait ops by cb_index + enclosing ttl.kernel_thread func, rejects any DFB with >1 producer or >1 consumer thread. - TableGen pass record in Passes.td; build entry in Transforms/CMakeLists.txt. - Shared helpers: kKernelThreadAttrName constant (TTL.h), getEnclosingKernelThread (TTLOpsUtils.h). - Pass wired into three pipeline definitions: TTLPipelines.cpp, python/ttl/ttl_api.py:1451, test/me2e/builder/pipeline.py:66. - SPSC refactor in test/python/pipe/test_pipenet_multi_iter.py: split partial_cb/sum_cb into per-consumer copies, dropped xfail on test_gather_bcast_multi_iter, added TODO referencing PR #540. - SPSC refactor in test/python/test_store_patterns.py::store_then_forward_kernel: split main_dfb into main_for_compute_dfb + main_for_write_dfb. - SPSC refactor in _bugs/repro_574.py: matches the test refactor. - New lit tests: verify_dfb_spsc.mlir (positive cases), verify_dfb_spsc_invalid.mlir (negative cases). - Docs: new "Single-producer Single-consumer Semantics" section in docs/development/DFBManagement.md (Contract / Violation / Correct form / Verification). - Filed tt-lang issue #580 for pipeline-definition consolidation.
brnorris03
added a commit
that referenced
this pull request
May 13, 2026
- New pass ttl-verify-dfb-spsc (lib/Dialect/TTL/Transforms/TTLVerifyDFBSPSC.cpp): module-level walk that groups cb_reserve/cb_wait ops by cb_index + enclosing ttl.kernel_thread func, rejects any DFB with >1 producer or >1 consumer thread. - TableGen pass record in Passes.td; build entry in Transforms/CMakeLists.txt. - Shared helpers: kKernelThreadAttrName constant (TTL.h), getEnclosingKernelThread (TTLOpsUtils.h). - Pass wired into three pipeline definitions: TTLPipelines.cpp, python/ttl/ttl_api.py:1451, test/me2e/builder/pipeline.py:66. - SPSC refactor in test/python/pipe/test_pipenet_multi_iter.py: split partial_cb/sum_cb into per-consumer copies, dropped xfail on test_gather_bcast_multi_iter, added TODO referencing PR #540. - SPSC refactor in test/python/test_store_patterns.py::store_then_forward_kernel: split main_dfb into main_for_compute_dfb + main_for_write_dfb. - SPSC refactor in _bugs/repro_574.py: matches the test refactor. - New lit tests: verify_dfb_spsc.mlir (positive cases), verify_dfb_spsc_invalid.mlir (negative cases). - Docs: new "Single-producer Single-consumer Semantics" section in docs/development/DFBManagement.md (Contract / Violation / Correct form / Verification). - Filed tt-lang issue #580 for pipeline-definition consolidation.
- Drop dead FailureOr from materializeToDFB; cleanup caller. - Restore block_count=2 and BindCBOp-at-function-entry rationale on the shared DFB helpers; document them as contracts in the header. - Replace defensive parent-null branches in TTLMaterializeLoopState with asserts; add accumulator-recognition and next-state-store comments. - Extend Python AST self-rebind / if-carried detection to support tuple and starred targets and drop the RankedTensorType filter so scalar iter args carry too; factor a shared _ScopedCollector base. - Fix TTKernelInsertL1Accumulation::precededByNonAccumulatingPack so a cb_reserve_back or cb_push_back for a CB already covered by a closer pack no longer aborts the scan; multiple accumulators in one loop now enable L1 acc correctly. - Hardware tests: N_ITERS bumped from 1 to 3, dtype-parametrized (bf16, f32), new tuple-target test, conditional-rebind test added as strict xfail referencing #587. - Lit: add multi-use-add fallback case; tighten binary_recurrence with captured SSA + CHECK-NEXT.
- `+=` on a CB-attached block (`out_blk = cb.reserve()`): emit an L1 accumulating `ttl.store` via the registered `__iadd__` method. Guarantees L1 accumulation at lowering time. - Any other case on a tensor target (non-block target, or non-`Add` op on a block): rewrite `target op= value` to `target = target op value` and visit, so the result flows through `ttl.add` / `ttl.sub` / ... and (when inside a loop) `ttl-materialize-loop-state`. L1 acc is then used opportunistically when the surrounding pattern matches the accumulator preconditions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem description
Python kernels can express tensor recurrences by rebinding a tensor variable to an expression that reads its previous value:
The value after the loop must be the final recurrence value. Before this change, the compiler did not consistently represent and lower that tensor state through control flow, so compute lowering could not consume loops that produced tensor results.
Additive recurrences also have a performance requirement. They should preserve the L1 accumulation strategy that
+=already uses on a reserved block, instead of storing every intermediate accumulator value in a compiler-allocated DFB. After this PR,acc = acc + xandout_blk += xlower to the same in-loop accumulatingttl.store;+=is now a special case of self-rebinding.What's changed
ttl-materialize-loop-state, a TTL transform that removes ranked-tensorscf.foriter args before compute lowering. Additive recurrences of the formttl.add(%iter_arg, %contribution)(and the commuted form) lower to a pre-loop initial store and an in-loop accumulatingttl.store; non-additive recurrences lower through a compiler-allocated DFB state slot.scf.foriter args for self-rebinding loop variables (including tuple targets likea, b = a + d, b + d) and resultfulscf.iffor tensor variables assigned in branches.ttl-insert-intermediate-dfbsandttl-materialize-loop-stateuse the same allocation convention.TTKernelInsertL1Accumulation::precededByNonAccumulatingPackso acb_reserve_backorcb_push_backfor a CB already covered by a closer pack no longer aborts the scan. Loops that accumulate into multiple CBs now detect their pre-loop init packs and reconfig L1 acc as "enable" before the loop.xfailend-to-end:ttl-assign-dstdoes not yet descend into nested regions, tracked as Extend ttl-assign-dst to cover tile ops nested in scf.if / scf.for regions #587.Fixes #527