[ttl] Add block expressions (lazy IR creation)#446
Draft
brnorris03 wants to merge 13 commits into
Draft
Conversation
(preserves topological order that signpost categorization depends on). For multi-store groups, merge traces via block-order walk. Grouping criterion: Replaced union-find (shared block_expr ops) with identity grouping (same getTensor() SSA value). Diamond DAGs where one sub-expression feeds different tile ops packed to different CBs can't share a compute because init_sfpu configures PACK for a single output CB.
brnorris03
added a commit
that referenced
this pull request
Apr 14, 2026
### Problem description Matmul K accumulation across multiple K blocks had no L1 accumulation support. The DST-based workaround (`prev + a @ b`, lowered to `copy_tile` + `matmul_block`) required an extra accumulator DFB, a per-K-block copy from L1 back into DST, and a final copy to the output DFB. ### What's changed Switches K accumulation from DST-based to L1 packer accumulation (`pack_reconfig_l1_acc`). Each compute is now independent -- the packer adds to the existing L1 value instead of overwriting. Accumulation is explicit via `+=` on reserved blocks. Plain `store()` always overwrites. This is an interim mechanism; the spec's full `BlockExpr` pattern (`fill` + lazy `+=` + `store`) is deferred to #446. DSL pattern: ```python out_blk = out_dfb.reserve() for kt in range(K): a_blk = a_dfb.wait() b_blk = b_dfb.wait() out_blk += a_blk @ b_blk # explicit L1 accumulation a_blk.pop() b_blk.pop() out_blk.push() ``` `+=` emits `ttl.store` with `{accumulate}`, which the compiler detects and annotates for L1 packer accumulation. `store()` emits a plain `ttl.store` and always overwrites. Compiler pipeline additions: - `TTLAnnotateL1AccLoops`: walks accumulating stores and uses dominance to verify the `cb_reserve` is outside the enclosing loop. Rejects `+=` inside conditionals (#504). Annotates enclosing loops with `ttl.l1_acc_loop`. - `TTKernelInsertL1Accumulation`: groups annotated loops into accumulation scopes by shared pack CB targets. Consecutive sibling loops sharing a CB get a single disable pair with re-enable between loops (init ops between loops reset packer state). Nested annotated loops are folded into the outermost ancestor. Max-reduce loops are excluded. - `TTKernelCombinePackTiles`: updated to not combine pack tiles across L1 acc loop boundaries. - Subblocking (`TTLSubblockComputeForDST`): now allows matmul accumulating computes. Added `--ttl-strict-f32-acc` option that errors if a non-f32 output accumulation loop requires subblocking. - `StoreOp`: added optional `{accumulate}` attribute (emitted by `+=`, consumed by `TTLAnnotateL1AccLoops`). - `operators.py`: added `__iadd__` on `TensorBlock` for explicit accumulation. ### Performance L1-only 2048^3 matmul with L1 acc runs at 0.70x of ttnn.matmul (30% faster); DRAM-bound 4096^3 is 2.76x (bottlenecked by per-core DRAM reads, no multicast yet). Benchmarking and examples are on the [`bnorris/matmul-bench`](https://github.com/tenstorrent/tt-lang/blob/bnorris/matmul-bench/examples/matmul_bench/MatmulPerformance.md) branch (will be a separate PR). ### Future work (not in this PR) - Compiler-generated outer K loop: user writes `out.store(a @ b)` with K_block-sized DFBs, compiler generates the outer K_num_blocks L1 acc loop and DM-side K-chunked reads (#446) - Full `BlockExpr` accumulation syntax (`y = fill(0); y += a @ b; out.store(y)`): lazy block expressions (#446) - L1 acc for compiler-generated K loops: tile K to 1 in subblocking when no enclosing user K loop, handle nested reduction loops - Fused post-ops after K reduction (`out.store(relu(a @ b))`) (#486) - Conditional `+=` support: track whether a pack actually happened instead of using the loop IV (#504) --------- Co-authored-by: Peter Hizalev <phizalev@tenstorrent.com> Co-authored-by: Alex Richins <arichins@tenstorrent.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.