Skip to content

[ttl] Add block expressions (lazy IR creation)#446

Draft
brnorris03 wants to merge 13 commits into
mainfrom
bnorris/block-expr
Draft

[ttl] Add block expressions (lazy IR creation)#446
brnorris03 wants to merge 13 commits into
mainfrom
bnorris/block-expr

Conversation

@brnorris03
Copy link
Copy Markdown
Contributor

No description provided.

(preserves topological order that signpost categorization depends on).
For multi-store groups, merge traces via block-order walk.

Grouping criterion: Replaced union-find (shared block_expr ops) with identity
grouping (same getTensor() SSA value). Diamond DAGs where one sub-expression feeds
different tile ops packed to different CBs can't share a compute because init_sfpu
configures PACK for a single output CB.
brnorris03 added a commit that referenced this pull request Apr 14, 2026
### Problem description

Matmul K accumulation across multiple K blocks had no L1 accumulation
support. The DST-based workaround (`prev + a @ b`, lowered to
`copy_tile` + `matmul_block`) required an extra accumulator DFB, a
per-K-block copy from L1 back into DST, and a final copy to the output
DFB.

### What's changed

Switches K accumulation from DST-based to L1 packer accumulation
(`pack_reconfig_l1_acc`). Each compute is now independent -- the packer
adds to the existing L1 value instead of overwriting.

Accumulation is explicit via `+=` on reserved blocks. Plain `store()`
always overwrites. This is an interim mechanism; the spec's full
`BlockExpr` pattern (`fill` + lazy `+=` + `store`) is deferred to #446.

DSL pattern:

```python
out_blk = out_dfb.reserve()
for kt in range(K):
    a_blk = a_dfb.wait()
    b_blk = b_dfb.wait()
    out_blk += a_blk @ b_blk     # explicit L1 accumulation
    a_blk.pop()
    b_blk.pop()
out_blk.push()
```

`+=` emits `ttl.store` with `{accumulate}`, which the compiler detects
and annotates for L1 packer accumulation. `store()` emits a plain
`ttl.store` and always overwrites.

Compiler pipeline additions:

- `TTLAnnotateL1AccLoops`: walks accumulating stores and uses dominance
to verify the `cb_reserve` is outside the enclosing loop. Rejects `+=`
inside conditionals (#504). Annotates enclosing loops with
`ttl.l1_acc_loop`.
- `TTKernelInsertL1Accumulation`: groups annotated loops into
accumulation scopes by shared pack CB targets. Consecutive sibling loops
sharing a CB get a single disable pair with re-enable between loops
(init ops between loops reset packer state). Nested annotated loops are
folded into the outermost ancestor. Max-reduce loops are excluded.
- `TTKernelCombinePackTiles`: updated to not combine pack tiles across
L1 acc loop boundaries.
- Subblocking (`TTLSubblockComputeForDST`): now allows matmul
accumulating computes. Added `--ttl-strict-f32-acc` option that errors
if a non-f32 output accumulation loop requires subblocking.
- `StoreOp`: added optional `{accumulate}` attribute (emitted by `+=`,
consumed by `TTLAnnotateL1AccLoops`).
- `operators.py`: added `__iadd__` on `TensorBlock` for explicit
accumulation.

### Performance

L1-only 2048^3 matmul with L1 acc runs at 0.70x of ttnn.matmul (30%
faster); DRAM-bound 4096^3 is 2.76x (bottlenecked by per-core DRAM
reads, no multicast yet). Benchmarking and examples are on the
[`bnorris/matmul-bench`](https://github.com/tenstorrent/tt-lang/blob/bnorris/matmul-bench/examples/matmul_bench/MatmulPerformance.md)
branch (will be a separate PR).

### Future work (not in this PR)

- Compiler-generated outer K loop: user writes `out.store(a @ b)` with
K_block-sized DFBs, compiler generates the outer K_num_blocks L1 acc
loop and DM-side K-chunked reads (#446)
- Full `BlockExpr` accumulation syntax (`y = fill(0); y += a @ b;
out.store(y)`): lazy block expressions (#446)
- L1 acc for compiler-generated K loops: tile K to 1 in subblocking when
no enclosing user K loop, handle nested reduction loops
- Fused post-ops after K reduction (`out.store(relu(a @ b))`) (#486)
- Conditional `+=` support: track whether a pack actually happened
instead of using the loop IV (#504)

---------

Co-authored-by: Peter Hizalev <phizalev@tenstorrent.com>
Co-authored-by: Alex Richins <arichins@tenstorrent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant