[ttl] Add block expressions (lazy IR creation) by brnorris03 · Pull Request #446 · tenstorrent/tt-lang

brnorris03 · 2026-04-01T01:58:48Z

No description provided.

(preserves topological order that signpost categorization depends on). For multi-store groups, merge traces via block-order walk. Grouping criterion: Replaced union-find (shared block_expr ops) with identity grouping (same getTensor() SSA value). Diamond DAGs where one sub-expression feeds different tile ops packed to different CBs can't share a compute because init_sfpu configures PACK for a single output CB.

### Problem description Matmul K accumulation across multiple K blocks had no L1 accumulation support. The DST-based workaround (`prev + a @ b`, lowered to `copy_tile` + `matmul_block`) required an extra accumulator DFB, a per-K-block copy from L1 back into DST, and a final copy to the output DFB. ### What's changed Switches K accumulation from DST-based to L1 packer accumulation (`pack_reconfig_l1_acc`). Each compute is now independent -- the packer adds to the existing L1 value instead of overwriting. Accumulation is explicit via `+=` on reserved blocks. Plain `store()` always overwrites. This is an interim mechanism; the spec's full `BlockExpr` pattern (`fill` + lazy `+=` + `store`) is deferred to #446. DSL pattern: ```python out_blk = out_dfb.reserve() for kt in range(K): a_blk = a_dfb.wait() b_blk = b_dfb.wait() out_blk += a_blk @ b_blk # explicit L1 accumulation a_blk.pop() b_blk.pop() out_blk.push() ``` `+=` emits `ttl.store` with `{accumulate}`, which the compiler detects and annotates for L1 packer accumulation. `store()` emits a plain `ttl.store` and always overwrites. Compiler pipeline additions: - `TTLAnnotateL1AccLoops`: walks accumulating stores and uses dominance to verify the `cb_reserve` is outside the enclosing loop. Rejects `+=` inside conditionals (#504). Annotates enclosing loops with `ttl.l1_acc_loop`. - `TTKernelInsertL1Accumulation`: groups annotated loops into accumulation scopes by shared pack CB targets. Consecutive sibling loops sharing a CB get a single disable pair with re-enable between loops (init ops between loops reset packer state). Nested annotated loops are folded into the outermost ancestor. Max-reduce loops are excluded. - `TTKernelCombinePackTiles`: updated to not combine pack tiles across L1 acc loop boundaries. - Subblocking (`TTLSubblockComputeForDST`): now allows matmul accumulating computes. Added `--ttl-strict-f32-acc` option that errors if a non-f32 output accumulation loop requires subblocking. - `StoreOp`: added optional `{accumulate}` attribute (emitted by `+=`, consumed by `TTLAnnotateL1AccLoops`). - `operators.py`: added `__iadd__` on `TensorBlock` for explicit accumulation. ### Performance L1-only 2048^3 matmul with L1 acc runs at 0.70x of ttnn.matmul (30% faster); DRAM-bound 4096^3 is 2.76x (bottlenecked by per-core DRAM reads, no multicast yet). Benchmarking and examples are on the [`bnorris/matmul-bench`](https://github.com/tenstorrent/tt-lang/blob/bnorris/matmul-bench/examples/matmul_bench/MatmulPerformance.md) branch (will be a separate PR). ### Future work (not in this PR) - Compiler-generated outer K loop: user writes `out.store(a @ b)` with K_block-sized DFBs, compiler generates the outer K_num_blocks L1 acc loop and DM-side K-chunked reads (#446) - Full `BlockExpr` accumulation syntax (`y = fill(0); y += a @ b; out.store(y)`): lazy block expressions (#446) - L1 acc for compiler-generated K loops: tile K to 1 in subblocking when no enclosing user K loop, handle nested reduction loops - Fused post-ops after K reduction (`out.store(relu(a @ b))`) (#486) - Conditional `+=` support: track whether a pack actually happened instead of using the loop IV (#504) --------- Co-authored-by: Peter Hizalev <phizalev@tenstorrent.com> Co-authored-by: Alex Richins <arichins@tenstorrent.com>

brnorris03 added 11 commits March 25, 2026 22:08

initial block expr implementation

452f54b

update signpost related functionality

aad248f

multiple stores fusion

3ae70a5

Merge remote-tracking branch 'origin/main' into bnorris/block-expr

c9b4d9f

redesign signposts

92a9861

Merge remote-tracking branch 'origin/main' into bnorris/block-expr

be3e3b9

Merge remote-tracking branch 'origin/main' into bnorris/block-expr

7c4c14f

[no ci] new files

8f9e1ee

Merge remote-tracking branch 'origin/main' into bnorris/block-expr

78ef51b

Merge remote-tracking branch 'origin/main' into bnorris/block-expr

b4b09f5

brnorris03 mentioned this pull request Apr 10, 2026

[ttl] Matmul with pack_reconfig_l1_acc #490

Merged

brnorris03 added 2 commits April 28, 2026 20:51

Merge remote-tracking branch 'origin/main' into bnorris/block-expr

a00eda4

post-merge cleanup

764df12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ttl] Add block expressions (lazy IR creation)#446

[ttl] Add block expressions (lazy IR creation)#446
brnorris03 wants to merge 13 commits into
mainfrom
bnorris/block-expr

brnorris03 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brnorris03 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant