Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
237 changes: 220 additions & 17 deletions docs/development/DFBManagement.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,29 +63,88 @@ correct DFB interval boundary.

The pass treats every acquire as opening a DFB live interval. The interval
starts at `cb_reserve` or `cb_wait` and ends after the last operation that can
use the acquired slot. A later acquire in the same DFB sync class bounds
release matching and use discovery, because its release belongs to a different
live interval.
use the acquired slot.

DFB sync classes separate the producer side from the consumer side:
`cb_reserve`/`cb_push` form producer intervals, and `cb_wait`/`cb_pop` form
consumer intervals. Producer acquires bound other producer intervals; consumer
acquires bound other consumer intervals.

The pass finds owned uses from two sources:

- Tensor-form uses follow the result of `cb_reserve` or `cb_wait` through
`ttl.attach_cb`, `ttl.store`, and compute operations.
- Direct DFB uses follow `ttl.copy` operations where the DFB operand direction
matches the interval's DFB sync class. Producer intervals include copies into
the DFB; consumer intervals include copies from the DFB. This is required for
data movement kernels, where copies do not use the tensor value returned by
the acquire op.

Uses inside descendant regions are projected to their ancestor operation in the
acquire's block. This conservatively places the release after the enclosing
structured op when the exact use is nested in an `scf.for` or `scf.if` body.

### Ownership

A use `U` is *owned by* `acquire` if `U` accesses the slot `acquire` acquired.
Two disjoint criteria establish ownership:

- **Tile-SSA ownership** -- `U` is reachable from `acquire`'s result through
the def-use chain over `attach_cb`, `tensor.extract`,
`tensor.extract_slice`, compute ops, and `ttl.store`. Per-tile SSA values
uniquely identify their source acquire, so this criterion has no positional
bound: a use of `cb_wait t1`'s tile is owned by `t1` regardless of where it
appears, even past later acquires on the same DFB.

- **Direct-CB ownership** -- `U` references the CB directly as a `ttl.copy`
operand on the side matching the acquire's sync class (the DM-thread case,
e.g. `ttl.copy %cb, %slice` for a writer). With no SSA tile handle,
ownership is positional: `U` belongs to the latest acquire on
`(cb, sync class)` that precedes it in op order. Equivalently, `U` is
bounded between `acquire` and the next acquire on the same sync class
(`interval.syncClassBoundary` in the pass).

The criteria are disjoint. DM-thread `ttl.copy` does not flow through
`attach_cb` (it takes the CB directly). Compute-thread uses always go through
`attach_cb` and never reference the CB as a direct operand of a tile op.

#### Why two criteria

Compute threads work through SSA tile handles
(`cb_wait` result -> `attach_cb` -> `ttl.store` / compute ops), so tile-SSA
ownership applies and the next-acquire boundary is irrelevant -- SSA already
distinguishes which slot the use refers to. DM threads use direct CB
references (`ttl.copy %cb, %slice`) where no tile handle exists, so direct-CB
ownership is the fallback and the boundary is essential to disambiguate
consecutive direct uses on the same CB. Unifying would require changing
`ttl.copy` to take the attached tensor instead of the CB, a dialect change
tracked as future work.

### Invariants on the inserted release

For each acquire `A`, the inserted release `R_A` must satisfy:

1. **Causal dominance** -- every owned use of `A` precedes `R_A` in op order
(after projecting nested uses to `A`'s block). The pass enforces this
directly: the release is positioned after the last owned use returned by
`findLastOwnedUse`.

2. **FIFO monotonicity** -- for `A_0 < A_1 < ...` on the same `(cb, sync
class)`, the inserted releases satisfy `R_0 < R_1 < ...` in op order. The
CB front (or back) pointer advances monotonically; out-of-order pops would
advance it past slots whose data is still needed.

(1) is enforced explicitly by the pass. (2) is enforced *implicitly* when
consumers under criterion (a) appear in declaration order
(`use(t1); use(t2); use(t3)`). Reordered consumes (`use(t2); use(t1)`) would
violate FIFO monotonicity on their own, but in the current pipeline `TTLCoalesceDFBAcquires`
runs immediately after `TTLInsertCBSync` and rewrites N consecutive same-DFB
acquires into one multi-tile acquire plus per-block `tensor.extract_slice`
views and a single coalesced release with `num_tiles = N*k`. Per-tile
`src_idx` values fall out of `extract_slice` offsets, so consume order is
decoupled from release order and (2) is preserved by construction.

### Idempotency

When the pass runs twice on the same IR, the second run must observe the
releases inserted by the first as already-present and skip re-injection.
Because tile-SSA ownership can place a release past the next-acquire boundary
(when a tile is consumed later than the next acquire on the same DFB),
`findOwnedReleases` extends its release-search upper bound to the acquire's
last owned use. Without this extension, the second run sees the inserted
release as past the boundary and treats the acquire as needing another
release.

### Slot State Model

The pass models producer and consumer acquires as separate slot lifetimes:
Expand Down Expand Up @@ -134,10 +193,11 @@ cb_wait A -> owned reads -> cb_pop A -> cb_wait B
inserted release
```

Once a later acquire in the same DFB sync class starts, subsequent releases are
considered part of that later interval. They cannot release the slot acquired by
the earlier operation because the earlier slot must already be released before
the DFB read or write pointer is reused.
Direct-CB ownership is positional: a release after the next acquire in the
same sync class is owned by that next acquire, not the earlier one. Tile-SSA
ownership is unbounded: a release placed after a tile's last use can sit past
the next acquire and still belong to the earlier interval. The pass
distinguishes these two cases by use criterion, not by a single bound.

### Algorithm

Expand Down Expand Up @@ -171,6 +231,149 @@ The same-block release check makes the pass idempotent. A release after the
next acquire in the same DFB sync class belongs to that later interval and does
not satisfy the earlier acquire.

## DFB Acquire Coalescing

`TTLCoalesceDFBAcquires` runs immediately after `TTLInsertCBSync` and
rewrites a maximal run of consecutive same-DFB acquires (and their matched
releases) into a single multi-tile acquire plus per-block
`tensor.extract_slice` views, with the matched releases collapsed into one
release carrying `num_tiles = N*k`.

```
%t1 = ttl.cb_wait %cb %g = ttl.cb_wait %cb {num_tiles=N*k}
%t2 = ttl.cb_wait %cb %t1 = extract_slice %g [0, 0] [1,k]
... %t2 = extract_slice %g [0, k] [1,k]
ttl.cb_pop %cb ...
ttl.cb_pop %cb ttl.cb_pop %cb {num_tiles=N*k}
```

This matches the canonical tt-metal "cumulative wait + indexed reads +
coalesced pop" pattern (eltwise_binary.cpp, bcast_h.cpp, the matmul
kernels). Without coalescing each acquire lowers to its own
non-cumulative `cb_wait_front(k)` / `cb_pop_front(k)`, which races
whenever consumes are deferred: the first pop advances the front before
the producer has pushed enough tiles to satisfy the next read.

`addSliceOffset` (`include/ttlang/Dialect/Utils/ConversionUtils.h`) folds
each `extract_slice` offset into the per-tile `src_idx` / `dst_idx` at
lowering, so no lowering changes are required. The producer side
(`cb_reserve` / `cb_push`) uses the same templated helpers — per-block
`extract_slice`s become the views of downstream `ttl.tile_store` /
`ttl.store` ops, and `addSliceOffset` handles store-side dst indices the
same way.

### Correctness criterion

For a candidate group of acquires `G = {a_1, ..., a_N}` on DFB `c`, the
rewrite is correct iff every op `O` between consecutive group members
preserves the synchronization invariant of `c` under the coalesced
schedule. The coalesced acquire blocks until `N*k` tiles are present
*before* anything between original `a_i` and `a_{i+1}` runs; the
coalesced release runs only after the last group member's last use.

This holds iff no op between members causes a release on `c` (directly or
transitively): the original IR may have allowed the producer to recycle
slots between `a_i` and `a_{i+1}`, and the coalesced version forbids that
until the very end. Forbidding inter-member releases is therefore
necessary for correctness at low `block_count`, and sufficient when paired
with the coalesced release placement.

A locally-checkable (sound, conservative) version of that criterion: an
op `O` between members is safe to skip past iff none of:

1. `O` operates on `c` directly (`c` appears as an operand). Covers
`cb_pop` / `cb_push` on `c` and any other op that reads or writes `c`.
2. `O` consumes the SSA result of any current group member. A consume can
flow into a release on `c` somewhere downstream, and we don't perform
transitive analysis.
3. `O` carries a region. Region bodies might contain a release on `c`;
conservative cutoff.

Anything else — an acquire or release on a different DFB, `arith.constant`,
pure compute on other DFBs — cannot affect `c` and is safe. `ttl.attach_cb`
is explicitly excluded from rules (1)–(2): it is an SSA-only identity op
(the metal lowering erases it) that always references the group's results
and `cb` as operands, so the generic check would otherwise wrongly break
the group at every `attach_cb`.

#### Why this is sufficient

Suppose `O` between `a_i` and `a_{i+1}` satisfies all three negations
above. Then:

- `O` does not directly call any release on `c` (rule 1).
- `O`'s outputs do not consume any tile from `G` (rule 2 on operands; the
outputs cannot make further data depend on `G`'s tiles).
- `O` has no inner region that could hide an indirect release on `c`
(rule 3).

So the only way a release on `c` could appear before the coalesced
release is via a transitive use of some non-`G` value. Because rule 2
forbids `G`'s outputs from being inputs to `O`, no fresh dataflow path is
created from `G` into a `c` release. Any release on `c` reachable from
some unrelated value would have run in the original IR too, at exactly
the same op-order position, so the coalesced version is no worse.

#### Why this is necessary

If `O` is itself a release on `c` (e.g., a user-written `cb_pop`), the
original IR lets the producer recycle one slot at `O`, but the coalesced
acquire holds all `N*k` slots from the start. With `block_count` only
slightly larger than the working set, the producer cannot push the next
batch and the consumer cannot release until all members are consumed —
deadlock. Same argument for transitive releases via group results.

### Detection algorithm

Per block, pre-collect all acquires of the kind under consideration
(`cb_wait` for the consumer pass; `cb_reserve` for the producer pass).
For each candidate leader (in op order):

```
if leader is already coalesced (num_tiles set) or already erased:
continue

group = [leader]
for op = leader.nextOp; op != nullptr; op = op.nextOp:
if op is a same-kind same-cb acquire with no num_tiles:
group.push_back(op); continue
if op is a same-kind acquire on a different DFB:
continue # benign: cannot touch our DFB or our group's results
if mayReleaseDFB(op, cb=leader.cb, group):
break
# else: tolerate (different-DFB op, attach_cb, arith, ...)

if group.size() < 2: continue
match N releases on cb after the last group member, in op order
apply rewrite, mark group members as erased
```

Because the candidate set is fixed before any rewrite, acquires on a
different DFB that the inner loop skips past (e.g., the matmul-style
`a1, b1, a2, b2` interleave) still get a chance to lead their own group
on a later iteration of the outer loop.

### Idempotency

The coalesced acquire and release carry a `num_tiles` attribute, and
`detectGroup` skips acquires that already have one. A second run of the
pass therefore finds no candidate groups and is a no-op. The doubled-pass
lit invocation
(`--pass-pipeline='builtin.module(func.func(ttl-coalesce-dfb-acquires,
ttl-coalesce-dfb-acquires))'`) verifies this.

### Limitations

- Non-rank-2 acquire result types are not coalesced. The existing
`num_tiles` convention (matching `TTLSubblockComputeForDST`) produces
`tensor<1, num_tiles, elem>`; the pass conservatively bails on other
ranks rather than picking an axis to scale.
- Acquires already carrying `num_tiles` (set by
`TTLSubblockComputeForDST`) are not extended.
- Region-bearing ops between members terminate the group, so coalescing
does not span control flow within an `scf.if` or `scf.for` (loop-body
coalescing still works because the body is its own block).

## Index Reuse

`TTLFinalizeDFBIndices` reduces the physical DFB count by assigning the same index to compiler-allocated DFBs whose lifetimes do not overlap. The algorithm runs per function. Compiler-allocated DFBs are intra-thread (both producer and consumer are in the same compute function), so their lifetimes are independent across functions.
Expand Down
16 changes: 12 additions & 4 deletions include/ttlang/Dialect/TTL/IR/TTLOps.td
Original file line number Diff line number Diff line change
Expand Up @@ -1199,15 +1199,19 @@ def TTL_CBWaitOp : TTL_Op<"cb_wait",
This operation is used by consumer threads (typically compute kernels or
data movement kernels writing to DRAM) to wait for data from producers.

The number of pages is derived from the CB's shape (elements per block).
The number of pages defaults to the CB's elements per block, but can be
overridden via the optional `num_tiles` attribute when the wait spans
multiple coalesced blocks (see `ttl-coalesce-dfb-acquires`).

Example:
```mlir
%view = ttl.cb_wait %cb : <[1, 1], !ttcore.tile<32x32, bf16>, 2> -> tensor<1x1x!ttcore.tile<32x32, bf16>>
%coalesced = ttl.cb_wait %cb {num_tiles = 3 : i64} : <[1, 1], !ttcore.tile<32x32, bf16>, 4> -> tensor<1x3x!ttcore.tile<32x32, bf16>>
```
}];
let arguments = (ins
TTL_CircularBuffer:$cb
TTL_CircularBuffer:$cb,
OptionalAttr<I64Attr>:$num_tiles
);
let results = (outs AnyRankedTensor:$result);
let assemblyFormat = "$cb attr-dict `:` type($cb) `->` type($result)";
Expand All @@ -1223,15 +1227,19 @@ def TTL_CBPopOp : TTL_Op<"cb_pop", [MemoryEffects<[MemWrite]>]> {
This operation must be called after reading data acquired via `ttl.cb_wait`.
It increments the CB's consumer pointer.

The number of pages is derived from the CB's shape (elements per block).
The number of pages defaults to the CB's elements per block, but can be
overridden via the optional `num_tiles` attribute when the pop releases
multiple coalesced blocks (see `ttl-coalesce-dfb-acquires`).

Example:
```mlir
ttl.cb_pop %cb : <[1, 1], !ttcore.tile<32x32, bf16>, 2>
ttl.cb_pop %cb {num_tiles = 3 : i64} : <[1, 1], !ttcore.tile<32x32, bf16>, 4>
```
}];
let arguments = (ins
TTL_CircularBuffer:$cb
TTL_CircularBuffer:$cb,
OptionalAttr<I64Attr>:$num_tiles
);
let assemblyFormat = "$cb attr-dict `:` type($cb)";
let hasVerifier = 1;
Expand Down
9 changes: 9 additions & 0 deletions include/ttlang/Dialect/TTL/IR/TTLOpsUtils.h
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,15 @@ inline mlir::Value traceUnrealizedCasts(mlir::Value value) {
return value;
}

/// Walk through `tensor.extract_slice` ops and return the underlying
/// `ttl.cb_reserve` op, or null if the chain doesn't end at one.
inline mlir::tt::ttl::CBReserveOp findCBReserveForView(mlir::Value view) {
while (auto slice = view.getDefiningOp<mlir::tensor::ExtractSliceOp>()) {
view = slice.getSource();
}
return view.getDefiningOp<mlir::tt::ttl::CBReserveOp>();
}

/// Resolve the CB index attached to `cb` by tracing through unrealized
/// conversion casts to its defining BindCBOp. Returns std::nullopt when the
/// value does not trace to a BindCBOp.
Expand Down
Loading
Loading