Skip to content

[fix][ttl] ttl-insert-cb-sync: Allow tensor SSA uses past the next-acquire boundary (#536)#554

Merged
brnorris03 merged 15 commits into
mainfrom
bnorris/auto-push-pop-fixes
May 12, 2026
Merged

[fix][ttl] ttl-insert-cb-sync: Allow tensor SSA uses past the next-acquire boundary (#536)#554
brnorris03 merged 15 commits into
mainfrom
bnorris/auto-push-pop-fixes

Conversation

@brnorris03
Copy link
Copy Markdown
Contributor

@brnorris03 brnorris03 commented May 8, 2026

Problem:
ttl-insert-cb-sync inserts cb_pop / cb_push after the last use of each acquired slot, but only searched up to the next same-class acquire on the same DFB. Tensor SSA uses can legitimately appear after a later acquire (a tile from cb_wait t1 may be consumed after cb_wait t2). Those uses were missed, the pop was inserted right after the acquire, and the read pointer advanced before the data was read. The case_a / case_b reproducers from the issue 536 follow-up comment miscompiled silently.

Independently, deferred-consume patterns (t1 = cb.wait(); t2 = cb.wait(); use(t1); use(t2)) lowered to N independent cb_wait_front(k) / cb_pop_front(k) calls. Because metal's wait/pop primitives are non-cumulative, the first pop advances the front before the producer has pushed enough tiles to satisfy the next read.

Fixes (~530 LOC):

  1. Split findLastOwnedUse: direct CB uses keep the boundary, tensor SSA uses ignore it. Extend findOwnedReleases' upper bound to the last owned use.
  2. New pass ttl-coalesce-dfb-acquires rewrites a maximal run of same-DFB acquires + their matched releases into the canonical tt-metal cumulative-wait shape (cb_wait_front(N*k) + per-block tensor.extract_slice views + cb_pop_front(N*k)). addSliceOffset already folds the slice offsets into the per-tile src_idx / dst_idx, so no lowering changes. Same template handles producer-side cb_reserve/cb_push. The detection rule is locally-checkable: an op between members terminates the group iff it touches our DFB or consumes a group member's result, or carries a region (the conservative envelope of "could trigger a release on the DFB"). This means matmul-style patterns where acquires on two DFBs are interleaved (a1, b1, a2, b2) coalesce per-DFB independently, since an acquire on a different DFB is benign. See docs/development/DFBManagement.md for the full correctness argument. Fixes [ttl] Coalesce consecutive cb_wait acquires into multi-tile cb_wait_front(N) with per-acquire src_idx #556.

Tests:

  • test/python/test_auto_pop_push.py: device coverage including the case_a / case_b reproducers, an interleaved-baseline regression guard, and axis-coverage cases (fanout, mixed immediate/deferred, long chains, multi-DFB, multi-tile geometry, tight block_count, single-slot, long DM loops, producer/consumer in scf.for, multiple direct DFB uses on one acquire). The reordered-consume case (formerly xfail [ttl] Coalesce consecutive cb_wait acquires into multi-tile cb_wait_front(N) with per-acquire src_idx #556) now runs as a passing test.
  • test/ttlang/Dialect/TTL/Transforms/insert_cb_sync.mlir: IR-level coverage of the deferred-use pattern for compute-thread consumers (flat and inside scf.for) and producers, plus a regression guard for the scf.for iter_args pattern.
  • test/ttlang/Dialect/TTL/Transforms/coalesce_dfb_acquires.mlir: positive (consumer / producer / loop-body / matmul-style two-DFB interleaved), negative (interleaved consume, single-acquire-per-DFB alternation, pre-existing num_tiles), and idempotency coverage for the new pass.
  • docs/development/DFBManagement.md: extended "DFB Sync Insertion" with the ownership criteria, release-placement invariants, and idempotency.

Future work:

Fixes #536 follow-up. Fixes #556.

@brnorris03 brnorris03 changed the title fix][ttl] Allow tensor SSA uses past the next-acquire boundary [fix][ttl] Allow tensor SSA uses past the next-acquire boundary May 8, 2026
@brnorris03 brnorris03 changed the title [fix][ttl] Allow tensor SSA uses past the next-acquire boundary [fix][ttl] ttl-insert-cb-sync: Allow tensor SSA uses past the next-acquire boundary (#536) May 8, 2026
@brnorris03 brnorris03 marked this pull request as ready for review May 8, 2026 15:40
@brnorris03 brnorris03 requested a review from a team as a code owner May 8, 2026 15:40
brnorris03 added 9 commits May 8, 2026 08:52
… acquires + N matching releases into the canonical tt-metal cumulative-wait shape (`cb_wait_front(N*k)` + per-block `tensor.extract_slice` views + `cb_pop_front(N*k)`). `addSliceOffset` already folds the slice offsets into the per-tile `src_idx` / `dst_idx`, so no lowering changes. Symmetric on the producer side. Fixes #556.
extend correctness argument and idempotency notes in DFBManagement.md;
add adversarial pytests (matmul-style two-DFB interleave, t1-fanout,
interposed third acquire) and matching lit coverage; rename CB->DFB in
new content.
ttl-auto-sync pipeline; update C++, Python, and me2e callers to use it.

Remove helper used just once.
Copy link
Copy Markdown
Contributor

@zoecarver zoecarver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown
Contributor

@arichinsTT arichinsTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@brnorris03 brnorris03 enabled auto-merge (squash) May 12, 2026 16:34
@brnorris03 brnorris03 merged commit 5368cce into main May 12, 2026
13 checks passed
@brnorris03 brnorris03 deleted the bnorris/auto-push-pop-fixes branch May 12, 2026 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants