[ttl] Matmul with pack_reconfig_l1_acc by brnorris03 · Pull Request #490 · tenstorrent/tt-lang

brnorris03 · 2026-04-10T04:43:02Z

Problem description

Matmul K accumulation across multiple K blocks had no L1 accumulation support. The DST-based workaround (prev + a @ b, lowered to copy_tile + matmul_block) required an extra accumulator DFB, a per-K-block copy from L1 back into DST, and a final copy to the output DFB.

What's changed

Switches K accumulation from DST-based to L1 packer accumulation (pack_reconfig_l1_acc). Each compute is now independent -- the packer adds to the existing L1 value instead of overwriting.

Accumulation is explicit via += on reserved blocks. Plain store() always overwrites. This is an interim mechanism; the spec's full BlockExpr pattern (fill + lazy += + store) is deferred to #446.

DSL pattern:

out_blk = out_dfb.reserve()
for kt in range(K):
    a_blk = a_dfb.wait()
    b_blk = b_dfb.wait()
    out_blk += a_blk @ b_blk     # explicit L1 accumulation
    a_blk.pop()
    b_blk.pop()
out_blk.push()

+= emits ttl.store with {accumulate}, which the compiler detects and annotates for L1 packer accumulation. store() emits a plain ttl.store and always overwrites.

Compiler pipeline additions:

TTLAnnotateL1AccLoops: walks accumulating stores and uses dominance to verify the cb_reserve is outside the enclosing loop. Rejects += inside conditionals (Support += inside conditionals in accumulation loops #504). Annotates enclosing loops with ttl.l1_acc_loop.
TTKernelInsertL1Accumulation: groups annotated loops into accumulation scopes by shared pack CB targets. Consecutive sibling loops sharing a CB get a single disable pair with re-enable between loops (init ops between loops reset packer state). Nested annotated loops are folded into the outermost ancestor. Max-reduce loops are excluded.
TTKernelCombinePackTiles: updated to not combine pack tiles across L1 acc loop boundaries.
Subblocking (TTLSubblockComputeForDST): now allows matmul accumulating computes. Added --ttl-strict-f32-acc option that errors if a non-f32 output accumulation loop requires subblocking.
StoreOp: added optional {accumulate} attribute (emitted by +=, consumed by TTLAnnotateL1AccLoops).
operators.py: added __iadd__ on TensorBlock for explicit accumulation.

Performance

L1-only 2048^3 matmul with L1 acc runs at 0.70x of ttnn.matmul (30% faster); DRAM-bound 4096^3 is 2.76x (bottlenecked by per-core DRAM reads, no multicast yet). Benchmarking and examples are on the bnorris/matmul-bench branch (will be a separate PR).

Future work (not in this PR)

Compiler-generated outer K loop: user writes out.store(a @ b) with K_block-sized DFBs, compiler generates the outer K_num_blocks L1 acc loop and DM-side K-chunked reads ([ttl] Add block expressions (lazy IR creation) #446)
Full BlockExpr accumulation syntax (y = fill(0); y += a @ b; out.store(y)): lazy block expressions ([ttl] Add block expressions (lazy IR creation) #446)
L1 acc for compiler-generated K loops: tile K to 1 in subblocking when no enclosing user K loop, handle nested reduction loops
Fused post-ops after K reduction (out.store(relu(a @ b))) ([ttl] Add support matmul + elementwise fusion (fixes #476) #486)
Conditional += support: track whether a pack actually happened instead of using the loop IV (Support += inside conditionals in accumulation loops #504)

SubblockComputeForDST previously skipped all accumulating computes. Matmul K accumulates in-place in DST without consuming DST slots, so subblocking the parallel (M, N) dims is safe. LowerMatmulBlock handles the subblocked compute before LowerToLoops sees it.

### Problem description The existing matmul examples had stale patterns from before the `acc=True` removal, `split_work_to_nodes` in `utils/block_allocation.py` had a bug that produced incorrect block parameters for certain matrix shapes. Adding 2D multicast matmul example. Generally cleaning up and fixing test in metal_examples ### What's changed - **API updates across metal examples**: Updated all existing metal matmul examples (`1d_mcast_matmul`, `multinode_matmul`, `multinode_reuse_matmul`, `single_node_matmul`) fixing back change ttnn apis and new update metal headers - **ttlang matmul accumulation pattern**: Updated all ttlang matmul examples to use the explicit `+=` accumulation pattern, replacing the removed `store(..., acc=True)` API. Accumulation is now expressed as `acc = ttl.math.fill(out_blk, 0)` followed by `acc += a_blk @ b_blk` in the K loop, with a final `out_blk.store(acc)`. - **2D mcast matmul (metal)**: Added a metal reference implementation under `examples/metal_examples/2d_mcast_matmul/metal/` - **2D mcast matmul (tt-lang)**: Added a tt-lang implementation under `examples/metal_examples/2d_mcast_matmul/ttlang/`. Uses `ttl.Pipe` and `ttl.PipeNet` to express the A (row-wise) and B (column-wise) multicast patterns, with `get_large_matmul_params` for block parameter selection. ### Checklist - [ ] New/Existing tests provide coverage for changes - `test_block_allocation.py` updated with new coverage for the block allocation fix - 2D mcast matmul example manually tested on hardware and in sim

Problem description When users pass host tensors (e.g., after ttnn.from_device()) to a @ttl.operation, they get an opaque AttributeError: 'NoneType' object has no attribute 'compute_with_storage_grid_size' with no indication that the tensor needs to be on a device. This happens at two call sites: _resolve_grid when grid='auto' -- tries to query the device compute grid from a None device CompiledTTNNKernel.__call__ -- validates kernel grid against device compute grid on a None device closes #389 What's changed Added a shared _require_device(args) helper that scans tensor arguments for an on-device tensor and returns the device. When all tensors are on host, it raises a ValueError listing the host tensor shapes and showing how to fix it: ``` ValueError: No device found on any tensor argument. All ttnn tensor inputs are on host: arg[0]: Shape([32, 32]) Place tensors on device before calling the operation, e.g.: ttnn.to_device(tensor, device) ttnn.from_torch(tensor, ..., device=device) ``` Both crash sites (_resolve_grid and CompiledTTNNKernel.__call__) now use this helper instead of calling .device() without a None check. The five other .device() call sites in the file already had proper None guards and were not changed. The simulator path (python/sim/) is unaffected Checklist - [x] New/Existing tests provide coverage for changes

…er cleanup; add tests

zoecarver · 2026-04-13T20:51:31Z

    scale = math.sqrt(k_tiles)
+    # DST capacity with fp32_dest_acc_en=true is 4. Output block is
+    # 1 x block_n. When block_n > 4, L1 acc activates with per-K-step
+    # bf16 truncation, requiring relaxed error bounds.


This is really unfortunate. As a user, if I care about acc accuracy, is there any way to enforce always f32 acc and error if the block size is too big?

Not currently, but it would be straightforward to add. Maybe a compiler option like --strict-f32-acc that errors during SubblockComputeForDST if the output block exceeds f32 DST capacity (4 tiles) and L1 accumulation would be needed. Filed issue #497.

Added the --ttl-strict-f32-acc option which errors at compile time if a += accumulation loop's output block requires subblocking (exceeds f32 DST capacity of 4 tiles with double-buffering), which would accumulate in bf16 for bf16 operands if subblocked by the compiler. Reduction loops (compiler-generated) are not affected by this check.

Also added a TODO for potentially adding an f32 L1 temporary (when the compiler can do thread-local temps) and emit a cast to bf16 after the loop. This would give full f32 precision regardless of block size, at the cost of 2x L1 per tile.

Fantastic, thank you!

I measured actual error accumulation out of curiosity, here it is. I think in practice, the error growth is pretty slow, so this option would probably not be needed frequently.

+= (L1 packer accumulation):

┌─────┬───────┬────────┬─────────┬──────────┬────────┬─────────┬─────────────┐ │ K │ M, N │ subblk │ max_abs │ mean_abs │ max/K1 │ mean/K1 │ pcc │ ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤ │ 1 │ 2 │ no │ 0.059 │ 0.007 │ 1.00 │ 1.00 │ 0.999998391 │ ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤ │ 1 │ 4 │ no │ 0.062 │ 0.007 │ 1.00 │ 1.00 │ 0.999998152 │ ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤ │ 1 │ 8 │ yes │ 0.062 │ 0.007 │ 1.00 │ 1.00 │ 0.999998629 │ ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤ │ 1 │ 16 │ yes │ 0.068 │ 0.006 │ 1.00 │ 1.00 │ 0.999998629 │ ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤ │ 4 │ 2 │ no │ 0.228 │ 0.026 │ 3.87 │ 3.87 │ 0.999995112 │ ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤ │ 4 │ 16 │ yes │ 0.244 │ 0.026 │ 3.59 │ 4.08 │ 0.999995291 │ ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤ │ 16 │ 2 │ no │ 0.659 │ 0.086 │ 11.18 │ 13.05 │ 0.999985993 │ ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤ │ 16 │ 16 │ yes │ 0.906 │ 0.086 │ 13.31 │ 13.64 │ 0.999985516 │ ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤ │ 64 │ 2 │ no │ 2.434 │ 0.320 │ 41.28 │ 48.46 │ 0.999950767 │ ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤ │ 64 │ 16 │ yes │ 3.497 │ 0.309 │ 51.35 │ 48.82 │ 0.999953330 │ └─────┴───────┴────────┴─────────┴──────────┴────────┴─────────┴─────────────┘

manual (DFB round-trip accumulation):

┌─────┬───────┬────────┬─────────┬──────────┬────────┬─────────┬─────────────┐ │ K │ M, N │ subblk │ max_abs │ mean_abs │ max/K1 │ mean/K1 │ pcc │ ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤ │ 4 │ 2 │ no │ 0.225 │ 0.027 │ 3.82 │ 4.13 │ 0.999995947 │ ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤ │ 4 │ 16 │ yes │ 0.307 │ 0.028 │ 4.51 │ 4.40 │ 0.999996006 │ ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤ │ 16 │ 2 │ no │ 0.853 │ 0.123 │ 14.47 │ 18.63 │ 0.999984086 │ ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤ │ 16 │ 16 │ yes │ 1.191 │ 0.122 │ 17.49 │ 19.29 │ 0.999983847 │ ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤ │ 64 │ 2 │ no │ 3.934 │ 0.638 │ 66.72 │ 96.70 │ 0.999920487 │ ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤ │ 64 │ 16 │ yes │ 5.405 │ 0.644 │ 79.37 │ 101.76 │ 0.999918163 │ └─────┴───────┴────────┴─────────┴──────────┴────────┴─────────┴─────────────┘

zoecarver · 2026-04-13T20:52:21Z

+    try:
+        # 32x32x32 tiles = 1024x1024x1024, 8x8x8 blocks -> K_num_blocks=4
+        Mt, Kt, Nt = 32, 32, 32
+        M, K, N = Mt * TILE, Kt * TILE, Nt * TILE


can we try a big enough input where each core gets multiple blocks?

Increased the lit test dimensions to 96x32x96 tiles (3072x1024x3072), giving 12x4x12 blocks. On an 8x8 grid each core handles ceil(12/8)=2 blocks per axis (4 output blocks total).

Also added a multi-block-per-core case to test_matmul_l1_acc_multinode.py: (Mt=16, Kt=4, Nt=16, block=4x4, grid=2x2) giving M_num=4, N_num=4 on a 2x2 grid -> 2 blocks/core/axis.

zoecarver · 2026-04-14T13:59:24Z

Claude suggested the following test cases. The first two look particularly interesting. Feel free to add any or none of them:

1. Two consecutive += loops to the same reserved block (HIGH - silent miscompilation)

out_blk = out_dfb.reserve()
for k1 in range(K1):
    a_blk = a_dfb.wait()
    b_blk = b_dfb.wait()
    out_blk += a_blk @ b_blk
    a_blk.pop(); b_blk.pop()
for k2 in range(K2):
    c_blk = c_dfb.wait()
    d_blk = d_dfb.wait()
    out_blk += c_blk @ d_blk
    c_blk.pop(); d_blk.pop()
out_blk.push()

What happens: Both loops are independently annotated with l1_acc_loop. Each gets its own disable/enable bracket. The disable-after for loop 1 turns off L1 acc. Loop 2's first iteration packs without L1 acc, overwriting loop 1's accumulated result.

Why: findOutermostL1AccLoop for each loop returns itself (neither is nested inside the other). Disable guards bracket each loop independently. Between the two loops, L1 acc is off.

Generated code:

disable(0)                      // before loop 1
for k1 { packs; if(k1==0) enable(1) }
disable(0)                      // after loop 1
disable(0)                      // before loop 2
for k2 { packs; if(k2==0) enable(1) }  // k2=0 OVERWRITES loop 1's result
push
disable(0)                      // after loop 2

Expected: The user wants the total accumulation across both loops. Loop 2's first iteration should add to loop 1's result, not overwrite it.

This is a real pattern -- two different input streams contributing to one output (e.g., two different A/B sources with different K dimensions).

2. Mixing += and .store() in the same loop (HIGH - silent miscompilation)

out_blk = out_dfb.reserve()
for k in range(K):
    a_blk = a_dfb.wait()
    b_blk = b_dfb.wait()
    if k == 0:
        out_blk.store(a_blk @ b_blk)   # intend: overwrite on first iter
    else:
        out_blk += a_blk @ b_blk       # intend: accumulate on subsequent
    a_blk.pop(); b_blk.pop()
out_blk.push()

What happens: The loop contains an accumulating store (from +=), so it gets annotated with l1_acc_loop. The packer L1 acc state is global -- it doesn't distinguish between stores emitted by store() vs +=. After the enable guard fires (end of iteration 0), ALL subsequent packs are additive, including the ones from .store().

But the actual problem is subtler: the user's own if/else means iteration 0 uses store() (non-accumulating store) and subsequent iterations use +=. Since L1 acc is disabled during iteration 0 (enable fires at the end), the store() writes correctly. On iteration 1+, L1 acc is on and += works. So this specific pattern actually works!

The real problem is the reverse:

for k in range(K):
    if k < K - 1:
        out_blk += a_blk @ b_blk       # accumulate
    else:
        out_blk.store(final_result)     # intend: overwrite on last iter

Here the user wants the last iteration to overwrite. But L1 acc is enabled (from the enable after iteration 0), so the store() on the last iteration adds instead of overwriting. The compiler has no way to distinguish -- L1 acc is a packer-global state.

3. += with a non-matmul RHS (MEDIUM - untested path, may work or crash)

out_blk = out_dfb.reserve()
for k in range(K):
    a_blk = a_dfb.wait()
    out_blk += a_blk          # just accumulate input blocks (sum reduction)
    a_blk.pop()
out_blk.push()

What happens: __iadd__ emits ttl.store %a_blk, %reserve {accumulate}. The loop gets annotated. ConvertTTLToCompute's LowerStoreToCompute handles the CB-attached-to-CB-attached case by creating a passthrough compute (unpack from input CB to DST, pack from DST to output CB).

Risk: This lowering path may not have been tested with L1 acc. The passthrough compute doesn't have a matmul, so TTLSubblockComputeForDST would proceed normally (no TTLAccumulatingOpTrait on the compute body). If the output block exceeds DST capacity, subblocking would create subblock loops inside the l1_acc_loop. L1 acc guard insertion should handle this, but it's untested.

Also: does the passthrough compute even make sense here? The user wants out += a_blk, which means the pack should be additive. The compute would just be unpack(input) -> DST -> pack(output). With L1 acc, the pack adds to the existing output. So the first iteration writes a_blk[0], the second adds a_blk[1], etc. The result is a per-tile sum. This is semantically correct but untested.

4. Multiple += to different output blocks in the same loop (MEDIUM - likely works but untested)

out_blk_a = out_dfb_a.reserve()
out_blk_b = out_dfb_b.reserve()
for k in range(K):
    a_blk = a_dfb.wait()
    b_blk = b_dfb.wait()
    out_blk_a += a_blk @ b_blk
    c_blk = c_dfb.wait()
    d_blk = d_dfb.wait()
    out_blk_b += c_blk @ d_blk
    ...pop all...
out_blk_a.push()
out_blk_b.push()

What happens: Two accumulating stores in the same loop. The loop gets one l1_acc_loop annotation. Both stores create separate computes, each with their own acquire/release. The enable guard fires after the last release (second compute's release). So on iteration 0, both computes pack without L1 acc. On iteration 1+, both pack with L1 acc.

Risk: L1 acc is a global packer state -- it applies to ALL packs regardless of which CB they target. After enable, packs to both CB_a and CB_b are additive. This should be correct (each CB accumulates independently). But the interaction between two independent computes in a single L1 acc loop is untested. If the enable point is calculated wrong (e.g., it ends up between the two computes instead of after both), the second compute on iteration 0 would accumulate instead of overwrite.

I'd want to verify that findTopLevelAncestor with the last TileRegsReleaseOp truly finds the second compute's release as the enable point.

5. += where the output block fits in f32 DST (block_n <= 4) (MEDIUM - possible double accumulation)

# 2x2 output block = 4 tiles, fits in f32 DST
out_blk = out_dfb.reserve()
for k in range(K):
    out_blk += a_blk @ b_blk
out_blk.push()

What happens: With a 2x2 output, no subblocking is needed. But matmul_block already accumulates in DST across internal K tiles. The user's outer K loop triggers L1 acc for inter-block accumulation.

The question is: does LowerMatmulBlock also generate a reduction_loop for the internal K? If yes, there could be two levels of L1 acc -- the user's l1_acc_loop and the compiler's reduction_loop. findL1AccLoop would return the user's l1_acc_loop (preferred), so the inner reduction_loop would be ignored by this pass. But the inner loop might still need L1 acc for its own accumulation (if K_block > 1 within the matmul_block).

Wait, for a 2x2 output with K_block = 1 (the DFB shape is (M, 1) for a), there's no internal K reduction. Each outer loop iteration does a single matmul_block. So this case should be straightforward.

But what if K_block > 1 (DFB a shape is (M, K_BLOCK))? Then matmul_block accumulates K_BLOCK tiles in DST. The outer loop handles K_num_blocks iterations via L1 acc. These compose: DST accumulates within a block, L1 acc accumulates across blocks. Should work.

The risk is if the block fits in DST and the compiler tries to use DST accumulation for the outer loop too (the old prev + a @ b pattern). With +=, the compiler should use L1 acc instead. But if both mechanisms activate, the result could be double-counted.

6. += with K=1 (single iteration, LOW - probably fine but worth checking)

out_blk = out_dfb.reserve()
for _ in range(1):
    out_blk += a_blk @ b_blk
out_blk.push()

What happens: The loop may be canonicalized away by MLIR (scf.for with trip count 1 -> just the body). If the loop is eliminated, the l1_acc_loop annotation is lost, and no L1 acc guards are inserted. The {accumulate} attribute on the store becomes orphaned.

Risk: The store with {accumulate} gets lowered as a regular store (no L1 acc). Semantically correct for K=1 (nothing to accumulate with). But if the canonicalization happens after TTLAnnotateL1AccLoops and before TTKernelInsertL1Accumulation, the annotation is on a loop that no longer exists. The insertion pass wouldn't find the loop. No crash (the annotation is just an attribute on the erased op), but worth verifying the pipeline ordering handles this.

7. += inside a conditional inside a loop (LOW - annotation may fail)

out_blk = out_dfb.reserve()
for k in range(K):
    a_blk = a_dfb.wait()
    b_blk = b_dfb.wait()
    if some_runtime_condition:
        out_blk += a_blk @ b_blk
    a_blk.pop(); b_blk.pop()
out_blk.push()

What happens: The ttl.store {accumulate} is inside an scf.if inside the scf.for. The annotation pass checks store->getParentOfType<scf::ForOp>() == forOp. Walking up from the store: scf.if (not a ForOp) -> scf.for (match!). The loop gets annotated. OK so far.

Risk: On iterations where the condition is false, no pack happens. L1 acc is still enabled. The next true-condition iteration packs additively. This is correct from an L1 acc perspective. But the generated code has if (k == lb) enable(1) at the end of every iteration (after the enable point). On iteration 0, if the condition is false, no packs happen, and then enable fires. On iteration 1 with condition=true, the pack is additive. But iteration 0 never wrote anything! The output CB slot contains garbage (whatever was there before). The additive pack adds to garbage.

Actually, does cb_reserve zero-initialize the CB slot? If not, accumulating into an uninitialized slot produces wrong results even in the non-conditional case (the first iteration should write, not accumulate, and writing with L1 acc off is effectively an overwrite). But if the condition is false on iteration 0, nothing is written, and the enable still fires. Subsequent iterations accumulate into garbage.

This might be a user error (conditional += without initialization), but the compiler silently generates wrong code.

8. += on an output block larger than what fits in f32 DST, with --ttl-strict-f32-acc off (LOW - precision, not crash)

Already partially tested, but worth verifying: a 8x8 output block (64 tiles >> DST capacity 4 with f32) with K=128. Heavy subblocking means many subblock iterations per K step, each truncating to bf16 on pack. With 128 K steps and bf16 truncation per step, the accumulated error could exceed PCC thresholds for large inputs.

Not a correctness bug per se, but the existing tests use PCC > 0.999 which might fail for extreme K values with large subblocked outputs.

brnorris03 · 2026-04-14T14:29:07Z

Thanks for the extra tests, I will add those (already did 8 before seeing this, see my last response to the accuracy issue).

tests 1. Two consecutive += loops to same reserve MLIR: annotate_l1_acc_loops.mlir::consecutive_loops_same_reserve insert_l1_accumulation.mlir::consecutive_l1_acc_loops Device: test_matmul_l1_acc.py::test_l1_acc_consecutive_loops 2. Mixing += and .store() in same loop MLIR: annotate_l1_acc_loops.mlir::mixed_acc_and_plain_store Device: test_matmul_l1_acc.py::test_l1_acc_mixed_store 3. += with non-matmul RHS (sum reduction) MLIR: annotate_l1_acc_loops.mlir::non_matmul_accumulate Device: test_matmul_l1_acc.py::test_l1_acc_sum_reduction 4. Multiple += to different outputs in same loop MLIR: insert_l1_accumulation.mlir::two_outputs_one_loop Device: test_matmul_l1_acc.py::test_l1_acc_multi_output 5. Output fits in f32 DST (block_n <= 4) MLIR: matmul_subblock_l1_acc.mlir::matmul_3x3_k_loop Device: test_matmul_l1_acc.py::test_l1_acc_single_core[blk2x2_K2..K8] 6. K=1 single iteration MLIR: annotate_l1_acc_loops.mlir::single_iteration Device: test_matmul_l1_acc.py::test_l1_acc_single_iteration 7. += inside conditional MLIR: annotate_l1_acc_loops.mlir::acc_inside_conditional Device: not tested (DSL does not generate runtime conditionals in compute)

…fferent CBs

brnorris03 · 2026-04-14T16:47:41Z

Addressed most of the scenarios brought up above, added issue #504 to track "7. += inside a conditional inside a loop (LOW - annotation may fail)" which is not currently supported, now would emit an error.

#	Scenario	MLIR lit test	Runtime test
1	Two consecutive += loops to same reserve	`annotate_l1_acc_loops.mlir::consecutive_loops_same_reserve` + `insert_l1_accumulation.mlir::consecutive_l1_acc_loops`	`test_matmul_l1_acc.py::test_l1_acc_consecutive_loops`
2	.store() then += (overwrite then accumulate)	`annotate_l1_acc_loops.mlir::store_and_acc_in_same_loop`	`test_matmul_l1_acc.py::test_l1_acc_store_then_acc`
3	+= with non-matmul RHS (sum reduction)	`annotate_l1_acc_loops.mlir::non_matmul_accumulate`	`test_matmul_l1_acc.py::test_l1_acc_sum_reduction`
4	Multiple += to different outputs in same loop	`insert_l1_accumulation.mlir::two_outputs_one_loop`	`test_matmul_l1_acc.py::test_l1_acc_multi_output`
5	Output fits in f32 DST (block_n <= 4)	already tested in `matmul_subblock_l1_acc.mlir::matmul_3x3_k_loop`	`test_matmul_l1_acc.py::test_l1_acc_single_core[blk2x2_K2..K8]`
6	K=1 single iteration	`annotate_l1_acc_loops.mlir::single_iteration`	`test_matmul_l1_acc.py::test_l1_acc_single_iteration`
7	+= inside conditional	`annotate_l1_acc_loops_invalid.mlir::acc_inside_conditional` (error)	not needed (rejected at compile time, #504 tracks future support)
8	+= on an output block larger than what fits in f32 DST accuracy	see earlier comment about accuracy

zoecarver

thanks for all the added tests! this looks great

…loop in an accumulation group to avoid clobbering the Pack_L1_Acc register on Wormhole, while keeping the UNPACK+MATH reconfiguration that's needed for the different input CBs.

phizalev-TT and others added 2 commits April 10, 2026 03:02

[examples] Matmul tutorial (#485)

db6d4b8

brnorris03 changed the title ~~[ttl] Enable subblocking for matmul accumulating computes~~ [ttl] Matmul with pack_reconfig_l1_acc Apr 10, 2026

brnorris03 added 5 commits April 9, 2026 23:07

initial pack_reconfig_l1_acc support

01cb9b6

preformat

513cc4d

add pytest for l1 accumulation

6476460

correct placement of pack_reconfig_l1_acc

455ad86

more packer accumulation fixes

b51e554

brnorris03 force-pushed the bnorris/matmul-l1-acc branch 3 times, most recently from c4c4b54 to 0eee228 Compare April 10, 2026 16:16

brnorris03 force-pushed the bnorris/matmul-l1-acc branch 2 times, most recently from 0be1bed to 876dde7 Compare April 10, 2026 18:03

brnorris03 and others added 6 commits April 10, 2026 11:13

Merge remote-tracking branch 'origin/main' into bnorris/matmul-l1-acc

4e8f935

generalize the packer L1 accumulation guard placement

0350205

add tests

f919130

update tests

2d11d04

Merge remote-tracking branch 'origin/main' into bnorris/matmul-l1-acc

f59d850

brnorris03 force-pushed the bnorris/matmul-l1-acc branch from 76e7848 to f59d850 Compare April 10, 2026 22:07

brnorris03 added 4 commits April 10, 2026 15:07

Merge remote-tracking branch 'origin/main' into bnorris/matmul-l1-acc

b53d0bc

comments

1da0c9f

rename ttl-annotate-reduction-loops to ttl-annotate-l1-acc-loops; oth…

c1bf820

…er cleanup; add tests

Merge remote-tracking branch 'origin/main' into bnorris/matmul-l1-acc

5b64b30

brnorris03 force-pushed the bnorris/matmul-l1-acc branch from e022cb2 to 5b64b30 Compare April 13, 2026 04:28

brnorris03 added 3 commits April 12, 2026 22:38

implement += on blocks

3109b91

use _set_var helper

760d376

fix test

903af07

brnorris03 marked this pull request as ready for review April 13, 2026 06:14

brnorris03 added 2 commits April 13, 2026 07:07

add subblocking test

61d0a15

precommit

7f8f840

zoecarver reviewed Apr 13, 2026

View reviewed changes

Comment thread lib/Dialect/TTL/Transforms/TTLAnnotateL1AccLoops.cpp Outdated

zoecarver reviewed Apr 13, 2026

View reviewed changes

brnorris03 mentioned this pull request Apr 13, 2026

Add option to enforce f32 DST accumulation and error on oversized blocks #497

Closed

brnorris03 added 3 commits April 13, 2026 16:57

address comments

6b9a141

add --ttl-strict-f32-acc compiler option

3a3a264

add TODO

7295718

brnorris03 force-pushed the bnorris/matmul-l1-acc branch from 65430d3 to 7295718 Compare April 14, 2026 01:06

brnorris03 requested a review from zoecarver April 14, 2026 01:06

zoecarver approved these changes Apr 14, 2026

View reviewed changes

brnorris03 added 3 commits April 14, 2026 07:01

update doc

7c5ad75

move the strict-f32-acc option to the subblocking pass; other cleanup

f411240

precommit

c40a42e

brnorris03 added 3 commits April 14, 2026 09:17

update comments. add another test for consecutive loops targetting di…

d5bcba8

…fferent CBs

final cleanup

dea4d75

brnorris03 enabled auto-merge (squash) April 14, 2026 16:54

zoecarver approved these changes Apr 14, 2026

View reviewed changes

Downgrade from mm_block_init to mm_block_init_short for the 2nd+ …

6003a7d

…loop in an accumulation group to avoid clobbering the Pack_L1_Acc register on Wormhole, while keeping the UNPACK+MATH reconfiguration that's needed for the different input CBs.

brnorris03 disabled auto-merge April 14, 2026 18:09

a bit more refactoring cleanup

fd5b0bf

brnorris03 enabled auto-merge (squash) April 14, 2026 19:00

reorg includes

3394dfc

brnorris03 merged commit de81ac2 into main Apr 14, 2026
11 checks passed

brnorris03 deleted the bnorris/matmul-l1-acc branch April 14, 2026 19:49

Conversation

brnorris03 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem description

What's changed

Performance

Future work (not in this PR)

Uh oh!

Uh oh!

zoecarver Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

brnorris03 Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brnorris03 Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brnorris03 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

zoecarver Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

brnorris03 Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zoecarver Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

brnorris03 Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

zoecarver commented Apr 14, 2026

1. Two consecutive += loops to the same reserved block (HIGH - silent miscompilation)

2. Mixing += and .store() in the same loop (HIGH - silent miscompilation)

3. += with a non-matmul RHS (MEDIUM - untested path, may work or crash)

4. Multiple += to different output blocks in the same loop (MEDIUM - likely works but untested)

5. += where the output block fits in f32 DST (block_n <= 4) (MEDIUM - possible double accumulation)

6. += with K=1 (single iteration, LOW - probably fine but worth checking)

7. += inside a conditional inside a loop (LOW - annotation may fail)

8. += on an output block larger than what fits in f32 DST, with --ttl-strict-f32-acc off (LOW - precision, not crash)

Uh oh!

brnorris03 commented Apr 14, 2026

Uh oh!

brnorris03 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zoecarver left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

brnorris03 commented Apr 10, 2026 •

edited

Loading

brnorris03 Apr 13, 2026 •

edited

Loading

brnorris03 Apr 14, 2026 •

edited

Loading

brnorris03 Apr 14, 2026 •

edited

Loading

brnorris03 commented Apr 14, 2026 •

edited

Loading