Skip to content

[fix][ttl] Materialize tensor loop state before compute lowering (#527)#540

Draft
brnorris03 wants to merge 7 commits into
mainfrom
bnorris/fix-527
Draft

[fix][ttl] Materialize tensor loop state before compute lowering (#527)#540
brnorris03 wants to merge 7 commits into
mainfrom
bnorris/fix-527

Conversation

@brnorris03
Copy link
Copy Markdown
Contributor

@brnorris03 brnorris03 commented Apr 29, 2026

Problem description

Python kernels can express tensor recurrences by rebinding a tensor variable to an expression that reads its previous value:

acc = acc + x

The value after the loop must be the final recurrence value. Before this change, the compiler did not consistently represent and lower that tensor state through control flow, so compute lowering could not consume loops that produced tensor results.

Additive recurrences also have a performance requirement. They should preserve the L1 accumulation strategy that += already uses on a reserved block, instead of storing every intermediate accumulator value in a compiler-allocated DFB. After this PR, acc = acc + x and out_blk += x lower to the same in-loop accumulating ttl.store; += is now a special case of self-rebinding.

What's changed

  • Added ttl-materialize-loop-state, a TTL transform that removes ranked-tensor scf.for iter args before compute lowering. Additive recurrences of the form ttl.add(%iter_arg, %contribution) (and the commuted form) lower to a pre-loop initial store and an in-loop accumulating ttl.store; non-additive recurrences lower through a compiler-allocated DFB state slot.
  • Extended Python AST lowering to emit scf.for iter args for self-rebinding loop variables (including tuple targets like a, b = a + d, b + d) and resultful scf.if for tensor variables assigned in branches.
  • Factored shared compiler-allocated DFB materialization helpers so ttl-insert-intermediate-dfbs and ttl-materialize-loop-state use the same allocation convention.
  • Fixed TTKernelInsertL1Accumulation::precededByNonAccumulatingPack so a cb_reserve_back or cb_push_back for a CB already covered by a closer pack no longer aborts the scan. Loops that accumulate into multiple CBs now detect their pre-loop init packs and reconfig L1 acc as "enable" before the loop.
  • Lit coverage for additive, commuted additive, unary and binary recurrences, mixed tensor states, scalar iter args, multiple final-result users, multi-use add fallback, conditional recurrence, and zero-trip loop semantics. Device coverage (dtype-parametrized bf16/f32) for additive, non-additive, tuple-target, three-accumulator, multi-tile-block, and zero-trip recurrences.
  • Conditional tensor rebinding inside a loop is correct at the materialize-loop-state level but is left as xfail end-to-end: ttl-assign-dst does not yet descend into nested regions, tracked as Extend ttl-assign-dst to cover tile ops nested in scf.if / scf.for regions #587.

Fixes #527

…rgs before

compute conversion. Preserve additive recurrences by lowering them to the
existing accumulate-store form, and materialize other tensor recurrences through
compiler-allocated DFB state.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code just moved into this common file from TTLInsertIntermediateDFBs.cpp, no changes

brnorris03 added a commit that referenced this pull request May 13, 2026
  - New pass ttl-verify-dfb-spsc (lib/Dialect/TTL/Transforms/TTLVerifyDFBSPSC.cpp):
  module-level walk that groups cb_reserve/cb_wait ops by cb_index + enclosing
  ttl.kernel_thread func, rejects any DFB with >1 producer or >1 consumer thread.
  - TableGen pass record in Passes.td; build entry in Transforms/CMakeLists.txt.
  - Shared helpers: kKernelThreadAttrName constant (TTL.h), getEnclosingKernelThread
  (TTLOpsUtils.h).
  - Pass wired into three pipeline definitions: TTLPipelines.cpp,
  python/ttl/ttl_api.py:1451, test/me2e/builder/pipeline.py:66.
  - SPSC refactor in test/python/pipe/test_pipenet_multi_iter.py: split
  partial_cb/sum_cb into per-consumer copies, dropped xfail on
  test_gather_bcast_multi_iter, added TODO referencing PR #540.
  - SPSC refactor in test/python/test_store_patterns.py::store_then_forward_kernel:
  split main_dfb into main_for_compute_dfb + main_for_write_dfb.
  - SPSC refactor in _bugs/repro_574.py: matches the test refactor.
  - New lit tests: verify_dfb_spsc.mlir (positive cases), verify_dfb_spsc_invalid.mlir
  (negative cases).
  - Docs: new "Single-producer Single-consumer Semantics" section in
  docs/development/DFBManagement.md (Contract / Violation / Correct form /
  Verification).
  - Filed tt-lang issue #580 for pipeline-definition consolidation.
brnorris03 added a commit that referenced this pull request May 13, 2026
  - New pass ttl-verify-dfb-spsc (lib/Dialect/TTL/Transforms/TTLVerifyDFBSPSC.cpp):
  module-level walk that groups cb_reserve/cb_wait ops by cb_index + enclosing
  ttl.kernel_thread func, rejects any DFB with >1 producer or >1 consumer thread.
  - TableGen pass record in Passes.td; build entry in Transforms/CMakeLists.txt.
  - Shared helpers: kKernelThreadAttrName constant (TTL.h), getEnclosingKernelThread
  (TTLOpsUtils.h).
  - Pass wired into three pipeline definitions: TTLPipelines.cpp,
  python/ttl/ttl_api.py:1451, test/me2e/builder/pipeline.py:66.
  - SPSC refactor in test/python/pipe/test_pipenet_multi_iter.py: split
  partial_cb/sum_cb into per-consumer copies, dropped xfail on
  test_gather_bcast_multi_iter, added TODO referencing PR #540.
  - SPSC refactor in test/python/test_store_patterns.py::store_then_forward_kernel:
  split main_dfb into main_for_compute_dfb + main_for_write_dfb.
  - SPSC refactor in _bugs/repro_574.py: matches the test refactor.
  - New lit tests: verify_dfb_spsc.mlir (positive cases), verify_dfb_spsc_invalid.mlir
  (negative cases).
  - Docs: new "Single-producer Single-consumer Semantics" section in
  docs/development/DFBManagement.md (Contract / Violation / Correct form /
  Verification).
  - Filed tt-lang issue #580 for pipeline-definition consolidation.
brnorris03 added a commit that referenced this pull request May 13, 2026
  - New pass ttl-verify-dfb-spsc (lib/Dialect/TTL/Transforms/TTLVerifyDFBSPSC.cpp):
  module-level walk that groups cb_reserve/cb_wait ops by cb_index + enclosing
  ttl.kernel_thread func, rejects any DFB with >1 producer or >1 consumer thread.
  - TableGen pass record in Passes.td; build entry in Transforms/CMakeLists.txt.
  - Shared helpers: kKernelThreadAttrName constant (TTL.h), getEnclosingKernelThread
  (TTLOpsUtils.h).
  - Pass wired into three pipeline definitions: TTLPipelines.cpp,
  python/ttl/ttl_api.py:1451, test/me2e/builder/pipeline.py:66.
  - SPSC refactor in test/python/pipe/test_pipenet_multi_iter.py: split
  partial_cb/sum_cb into per-consumer copies, dropped xfail on
  test_gather_bcast_multi_iter, added TODO referencing PR #540.
  - SPSC refactor in test/python/test_store_patterns.py::store_then_forward_kernel:
  split main_dfb into main_for_compute_dfb + main_for_write_dfb.
  - SPSC refactor in _bugs/repro_574.py: matches the test refactor.
  - New lit tests: verify_dfb_spsc.mlir (positive cases), verify_dfb_spsc_invalid.mlir
  (negative cases).
  - Docs: new "Single-producer Single-consumer Semantics" section in
  docs/development/DFBManagement.md (Contract / Violation / Correct form /
  Verification).
  - Filed tt-lang issue #580 for pipeline-definition consolidation.
- Drop dead FailureOr from materializeToDFB; cleanup caller.
- Restore block_count=2 and BindCBOp-at-function-entry rationale on the
  shared DFB helpers; document them as contracts in the header.
- Replace defensive parent-null branches in TTLMaterializeLoopState with
  asserts; add accumulator-recognition and next-state-store comments.
- Extend Python AST self-rebind / if-carried detection to support tuple
  and starred targets and drop the RankedTensorType filter so scalar
  iter args carry too; factor a shared _ScopedCollector base.
- Fix TTKernelInsertL1Accumulation::precededByNonAccumulatingPack so a
  cb_reserve_back or cb_push_back for a CB already covered by a closer
  pack no longer aborts the scan; multiple accumulators in one loop now
  enable L1 acc correctly.
- Hardware tests: N_ITERS bumped from 1 to 3, dtype-parametrized (bf16,
  f32), new tuple-target test, conditional-rebind test added as
  strict xfail referencing #587.
- Lit: add multi-use-add fallback case; tighten binary_recurrence with
  captured SSA + CHECK-NEXT.
- `+=` on a CB-attached block (`out_blk = cb.reserve()`): emit an
 L1 accumulating `ttl.store` via the registered `__iadd__`
 method. Guarantees L1 accumulation at lowering time.

- Any other case on a tensor target (non-block target, or
 non-`Add` op on a block): rewrite `target op= value` to
 `target = target op value` and visit, so the result flows
 through `ttl.add` / `ttl.sub` / ... and (when inside a loop)
 `ttl-materialize-loop-state`. L1 acc is then used
 opportunistically when the surrounding pattern matches the
 accumulator preconditions.
@brnorris03 brnorris03 mentioned this pull request May 15, 2026
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[external bug report][from claude] ttl.add inside scf.for is dropped when result is not carried out of the loop

1 participant