[fix][ttl] Materialize tensor loop state before compute lowering (#527) by brnorris03 · Pull Request #540 · tenstorrent/tt-lang

brnorris03 · 2026-04-29T04:24:30Z

Problem description

Python kernels can express tensor recurrences by rebinding a tensor variable to an expression that reads its previous value:

acc = acc + x

The value after the loop must be the final recurrence value. Before this change, the compiler did not consistently represent and lower that tensor state through control flow, so compute lowering could not consume loops that produced tensor results.

Additive recurrences also have a performance requirement. They should preserve the L1 accumulation strategy that += already uses on a reserved block, instead of storing every intermediate accumulator value in a compiler-allocated DFB. After this PR, acc = acc + x and out_blk += x lower to the same in-loop accumulating ttl.store; += is now a special case of self-rebinding.

What's changed

Added ttl-materialize-loop-state, a TTL transform that removes ranked-tensor scf.for iter args before compute lowering. Additive recurrences of the form ttl.add(%iter_arg, %contribution) (and the commuted form) lower to a pre-loop initial store and an in-loop accumulating ttl.store; non-additive recurrences lower through a compiler-allocated DFB state slot.
Extended Python AST lowering to emit scf.for iter args for self-rebinding loop variables (including tuple targets like a, b = a + d, b + d) and resultful scf.if for tensor variables assigned in branches.
Factored shared compiler-allocated DFB materialization helpers so ttl-insert-intermediate-dfbs and ttl-materialize-loop-state use the same allocation convention.
Fixed TTKernelInsertL1Accumulation::precededByNonAccumulatingPack so a cb_reserve_back or cb_push_back for a CB already covered by a closer pack no longer aborts the scan. Loops that accumulate into multiple CBs now detect their pre-loop init packs and reconfig L1 acc as "enable" before the loop.
Lit coverage for additive, commuted additive, unary and binary recurrences, mixed tensor states, scalar iter args, multiple final-result users, multi-use add fallback, conditional recurrence, and zero-trip loop semantics. Device coverage (dtype-parametrized bf16/f32) for additive, non-additive, tuple-target, three-accumulator, multi-tile-block, and zero-trip recurrences.
Conditional tensor rebinding inside a loop is correct at the materialize-loop-state level but is left as xfail end-to-end: ttl-assign-dst does not yet descend into nested regions, tracked as Extend ttl-assign-dst to cover tile ops nested in scf.if / scf.for regions #587.

Fixes #527

…rgs before compute conversion. Preserve additive recurrences by lowering them to the existing accumulate-store form, and materialize other tensor recurrences through compiler-allocated DFB state.

brnorris03 · 2026-04-30T01:21:55Z

Code just moved into this common file from TTLInsertIntermediateDFBs.cpp, no changes

- New pass ttl-verify-dfb-spsc (lib/Dialect/TTL/Transforms/TTLVerifyDFBSPSC.cpp): module-level walk that groups cb_reserve/cb_wait ops by cb_index + enclosing ttl.kernel_thread func, rejects any DFB with >1 producer or >1 consumer thread. - TableGen pass record in Passes.td; build entry in Transforms/CMakeLists.txt. - Shared helpers: kKernelThreadAttrName constant (TTL.h), getEnclosingKernelThread (TTLOpsUtils.h). - Pass wired into three pipeline definitions: TTLPipelines.cpp, python/ttl/ttl_api.py:1451, test/me2e/builder/pipeline.py:66. - SPSC refactor in test/python/pipe/test_pipenet_multi_iter.py: split partial_cb/sum_cb into per-consumer copies, dropped xfail on test_gather_bcast_multi_iter, added TODO referencing PR #540. - SPSC refactor in test/python/test_store_patterns.py::store_then_forward_kernel: split main_dfb into main_for_compute_dfb + main_for_write_dfb. - SPSC refactor in _bugs/repro_574.py: matches the test refactor. - New lit tests: verify_dfb_spsc.mlir (positive cases), verify_dfb_spsc_invalid.mlir (negative cases). - Docs: new "Single-producer Single-consumer Semantics" section in docs/development/DFBManagement.md (Contract / Violation / Correct form / Verification). - Filed tt-lang issue #580 for pipeline-definition consolidation.

- Drop dead FailureOr from materializeToDFB; cleanup caller. - Restore block_count=2 and BindCBOp-at-function-entry rationale on the shared DFB helpers; document them as contracts in the header. - Replace defensive parent-null branches in TTLMaterializeLoopState with asserts; add accumulator-recognition and next-state-store comments. - Extend Python AST self-rebind / if-carried detection to support tuple and starred targets and drop the RankedTensorType filter so scalar iter args carry too; factor a shared _ScopedCollector base. - Fix TTKernelInsertL1Accumulation::precededByNonAccumulatingPack so a cb_reserve_back or cb_push_back for a CB already covered by a closer pack no longer aborts the scan; multiple accumulators in one loop now enable L1 acc correctly. - Hardware tests: N_ITERS bumped from 1 to 3, dtype-parametrized (bf16, f32), new tuple-target test, conditional-rebind test added as strict xfail referencing #587. - Lit: add multi-use-add fallback case; tighten binary_recurrence with captured SSA + CHECK-NEXT.

- `+=` on a CB-attached block (`out_blk = cb.reserve()`): emit an L1 accumulating `ttl.store` via the registered `__iadd__` method. Guarantees L1 accumulation at lowering time. - Any other case on a tensor target (non-block target, or non-`Add` op on a block): rewrite `target op= value` to `target = target op value` and visit, so the result flows through `ttl.add` / `ttl.sub` / ... and (when inside a loop) `ttl-materialize-loop-state`. L1 acc is then used opportunistically when the surrounding pattern matches the accumulator preconditions.

Add ttl-materialize-loop-state to remove ranked-tensor scf.for iter a…

a6b339d

…rgs before compute conversion. Preserve additive recurrences by lowering them to the existing accumulate-store form, and materialize other tensor recurrences through compiler-allocated DFB state.

brnorris03 commented Apr 30, 2026

View reviewed changes

brnorris03 mentioned this pull request May 8, 2026

[fix][ttl] ttl-insert-cb-sync: Allow tensor SSA uses past the next-acquire boundary (#536) #554

Merged

brnorris03 added 2 commits May 13, 2026 16:18

Merge remote-tracking branch 'origin/main' into bnorris/fix-527

dcdddb2

remove xfail fixed by this pr

1114a24

brnorris03 mentioned this pull request May 14, 2026

Extend ttl-assign-dst to cover tile ops nested in scf.if / scf.for regions #587

Open

brnorris03 added 4 commits May 13, 2026 19:08

fix error message; add more tests

7714183

add more tests

cd64e8a

brnorris03 mentioned this pull request May 15, 2026

Element Read/Write #575

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix][ttl] Materialize tensor loop state before compute lowering (#527)#540

[fix][ttl] Materialize tensor loop state before compute lowering (#527)#540
brnorris03 wants to merge 7 commits into
mainfrom
bnorris/fix-527

brnorris03 commented Apr 29, 2026 •

edited

Loading

Uh oh!

brnorris03 Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brnorris03 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem description

What's changed

Uh oh!

brnorris03 Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

brnorris03 commented Apr 29, 2026 •

edited

Loading