Skip to content

[ttl] Matmul with pack_reconfig_l1_acc#490

Merged
brnorris03 merged 35 commits into
mainfrom
bnorris/matmul-l1-acc
Apr 14, 2026
Merged

[ttl] Matmul with pack_reconfig_l1_acc#490
brnorris03 merged 35 commits into
mainfrom
bnorris/matmul-l1-acc

Conversation

@brnorris03
Copy link
Copy Markdown
Contributor

@brnorris03 brnorris03 commented Apr 10, 2026

Problem description

Matmul K accumulation across multiple K blocks had no L1 accumulation support. The DST-based workaround (prev + a @ b, lowered to copy_tile + matmul_block) required an extra accumulator DFB, a per-K-block copy from L1 back into DST, and a final copy to the output DFB.

What's changed

Switches K accumulation from DST-based to L1 packer accumulation (pack_reconfig_l1_acc). Each compute is now independent -- the packer adds to the existing L1 value instead of overwriting.

Accumulation is explicit via += on reserved blocks. Plain store() always overwrites. This is an interim mechanism; the spec's full BlockExpr pattern (fill + lazy += + store) is deferred to #446.

DSL pattern:

out_blk = out_dfb.reserve()
for kt in range(K):
    a_blk = a_dfb.wait()
    b_blk = b_dfb.wait()
    out_blk += a_blk @ b_blk     # explicit L1 accumulation
    a_blk.pop()
    b_blk.pop()
out_blk.push()

+= emits ttl.store with {accumulate}, which the compiler detects and annotates for L1 packer accumulation. store() emits a plain ttl.store and always overwrites.

Compiler pipeline additions:

  • TTLAnnotateL1AccLoops: walks accumulating stores and uses dominance to verify the cb_reserve is outside the enclosing loop. Rejects += inside conditionals (Support += inside conditionals in accumulation loops #504). Annotates enclosing loops with ttl.l1_acc_loop.
  • TTKernelInsertL1Accumulation: groups annotated loops into accumulation scopes by shared pack CB targets. Consecutive sibling loops sharing a CB get a single disable pair with re-enable between loops (init ops between loops reset packer state). Nested annotated loops are folded into the outermost ancestor. Max-reduce loops are excluded.
  • TTKernelCombinePackTiles: updated to not combine pack tiles across L1 acc loop boundaries.
  • Subblocking (TTLSubblockComputeForDST): now allows matmul accumulating computes. Added --ttl-strict-f32-acc option that errors if a non-f32 output accumulation loop requires subblocking.
  • StoreOp: added optional {accumulate} attribute (emitted by +=, consumed by TTLAnnotateL1AccLoops).
  • operators.py: added __iadd__ on TensorBlock for explicit accumulation.

Performance

L1-only 2048^3 matmul with L1 acc runs at 0.70x of ttnn.matmul (30% faster); DRAM-bound 4096^3 is 2.76x (bottlenecked by per-core DRAM reads, no multicast yet). Benchmarking and examples are on the bnorris/matmul-bench branch (will be a separate PR).

Future work (not in this PR)

phizalev-TT and others added 2 commits April 10, 2026 03:02
  SubblockComputeForDST previously skipped all accumulating computes.
  Matmul K accumulates in-place in DST without consuming DST slots, so
  subblocking the parallel (M, N) dims is safe. LowerMatmulBlock handles
  the subblocked compute before LowerToLoops sees it.
@brnorris03 brnorris03 changed the title [ttl] Enable subblocking for matmul accumulating computes [ttl] Matmul with pack_reconfig_l1_acc Apr 10, 2026
@brnorris03 brnorris03 force-pushed the bnorris/matmul-l1-acc branch 3 times, most recently from c4c4b54 to 0eee228 Compare April 10, 2026 16:16
### Problem description

The existing matmul examples had stale patterns from before the
`acc=True` removal,
`split_work_to_nodes` in `utils/block_allocation.py` had a bug that
produced
incorrect block parameters for certain matrix shapes. Adding 2D
multicast
matmul example. Generally cleaning up and fixing test in metal_examples

  ### What's changed

- **API updates across metal examples**: Updated all existing metal
matmul examples
    (`1d_mcast_matmul`, `multinode_matmul`, `multinode_reuse_matmul`,
`single_node_matmul`) fixing back change ttnn apis and new update metal
headers

- **ttlang matmul accumulation pattern**: Updated all ttlang matmul
examples to use
the explicit `+=` accumulation pattern, replacing the removed
`store(..., acc=True)`
API. Accumulation is now expressed as `acc = ttl.math.fill(out_blk, 0)`
followed
by `acc += a_blk @ b_blk` in the K loop, with a final
`out_blk.store(acc)`.

- **2D mcast matmul (metal)**: Added a metal reference implementation
under
    `examples/metal_examples/2d_mcast_matmul/metal/`

  - **2D mcast matmul (tt-lang)**: Added a tt-lang implementation under
`examples/metal_examples/2d_mcast_matmul/ttlang/`. Uses `ttl.Pipe` and
`ttl.PipeNet` to express the A (row-wise) and B (column-wise) multicast
patterns,
    with `get_large_matmul_params` for block parameter selection.

  ### Checklist

  - [ ] New/Existing tests provide coverage for changes
- `test_block_allocation.py` updated with new coverage for the block
allocation fix
    - 2D mcast matmul example manually tested on hardware and in sim
@brnorris03 brnorris03 force-pushed the bnorris/matmul-l1-acc branch 2 times, most recently from 0be1bed to 876dde7 Compare April 10, 2026 18:03
brnorris03 and others added 6 commits April 10, 2026 11:13
Problem description
When users pass host tensors (e.g., after ttnn.from_device()) to a
@ttl.operation, they get an opaque AttributeError: 'NoneType' object has
no attribute 'compute_with_storage_grid_size' with no indication that
the tensor needs to be on a device. This happens at two call sites:

_resolve_grid when grid='auto' -- tries to query the device compute grid
from a None device
CompiledTTNNKernel.__call__ -- validates kernel grid against device
compute grid on a None device
closes #389 

What's changed
Added a shared _require_device(args) helper that scans tensor arguments
for an on-device tensor and returns the device. When all tensors are on
host, it raises a ValueError listing the host tensor shapes and showing
how to fix it:
```
ValueError: No device found on any tensor argument. All ttnn tensor inputs are on host:
  arg[0]: Shape([32, 32])
Place tensors on device before calling the operation, e.g.:
  ttnn.to_device(tensor, device)
  ttnn.from_torch(tensor, ..., device=device)
```

Both crash sites (_resolve_grid and CompiledTTNNKernel.__call__) now use
this helper instead of calling .device() without a None check. The five
other .device() call sites in the file already had proper None guards
and were not changed.

The simulator path (python/sim/) is unaffected

Checklist
- [x] New/Existing tests provide coverage for changes
@brnorris03 brnorris03 force-pushed the bnorris/matmul-l1-acc branch from 76e7848 to f59d850 Compare April 10, 2026 22:07
@brnorris03 brnorris03 force-pushed the bnorris/matmul-l1-acc branch from e022cb2 to 5b64b30 Compare April 13, 2026 04:28
@brnorris03 brnorris03 marked this pull request as ready for review April 13, 2026 06:14
Comment thread lib/Dialect/TTL/Transforms/TTLAnnotateL1AccLoops.cpp Outdated
scale = math.sqrt(k_tiles)
# DST capacity with fp32_dest_acc_en=true is 4. Output block is
# 1 x block_n. When block_n > 4, L1 acc activates with per-K-step
# bf16 truncation, requiring relaxed error bounds.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really unfortunate. As a user, if I care about acc accuracy, is there any way to enforce always f32 acc and error if the block size is too big?

Copy link
Copy Markdown
Contributor Author

@brnorris03 brnorris03 Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not currently, but it would be straightforward to add. Maybe a compiler option like --strict-f32-acc that errors during SubblockComputeForDST if the output block exceeds f32 DST capacity (4 tiles) and L1 accumulation would be needed. Filed issue #497.

Copy link
Copy Markdown
Contributor Author

@brnorris03 brnorris03 Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the --ttl-strict-f32-acc option which errors at compile time if a += accumulation loop's output block requires subblocking (exceeds f32 DST capacity of 4 tiles with double-buffering), which would accumulate in bf16 for bf16 operands if subblocked by the compiler. Reduction loops (compiler-generated) are not affected by this check.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added a TODO for potentially adding an f32 L1 temporary (when the compiler can do thread-local temps) and emit a cast to bf16 after the loop. This would give full f32 precision regardless of block size, at the cost of 2x L1 per tile.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic, thank you!

Copy link
Copy Markdown
Contributor Author

@brnorris03 brnorris03 Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I measured actual error accumulation out of curiosity, here it is. I think in practice, the error growth is pretty slow, so this option would probably not be needed frequently.

+= (L1 packer accumulation):

  ┌─────┬───────┬────────┬─────────┬──────────┬────────┬─────────┬─────────────┐                     
  │  K  │ M, N  │ subblk │ max_abs │ mean_abs │ max/K1 │ mean/K1 │     pcc     │                     
  ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤                     
  │ 1   │ 2     │ no     │ 0.059   │ 0.007    │ 1.00   │ 1.00    │ 0.999998391 │
  ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤                     
  │ 1   │ 4     │ no     │ 0.062   │ 0.007    │ 1.00   │ 1.00    │ 0.999998152 │                     
  ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤                   
  │ 1   │ 8     │ yes    │ 0.062   │ 0.007    │ 1.00   │ 1.00    │ 0.999998629 │                     
  ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤                     
  │ 1   │ 16    │ yes    │ 0.068   │ 0.006    │ 1.00   │ 1.00    │ 0.999998629 │                   
  ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤                     
  │ 4   │ 2     │ no     │ 0.228   │ 0.026    │ 3.87   │ 3.87    │ 0.999995112 │
  ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤                     
  │ 4   │ 16    │ yes    │ 0.244   │ 0.026    │ 3.59   │ 4.08    │ 0.999995291 │
  ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤                     
  │ 16  │ 2     │ no     │ 0.659   │ 0.086    │ 11.18  │ 13.05   │ 0.999985993 │
  ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤                     
  │ 16  │ 16    │ yes    │ 0.906   │ 0.086    │ 13.31  │ 13.64   │ 0.999985516 │
  ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤                     
  │ 64  │ 2     │ no     │ 2.434   │ 0.320    │ 41.28  │ 48.46   │ 0.999950767 │
  ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤                     
  │ 64  │ 16    │ yes    │ 3.497   │ 0.309    │ 51.35  │ 48.82   │ 0.999953330 │
  └─────┴───────┴────────┴─────────┴──────────┴────────┴─────────┴─────────────┘                     

manual (DFB round-trip accumulation):

  ┌─────┬───────┬────────┬─────────┬──────────┬────────┬─────────┬─────────────┐                     
  │  K  │ M, N  │ subblk │ max_abs │ mean_abs │ max/K1 │ mean/K1 │     pcc     │
  ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤                     
  │ 4   │ 2     │ no     │ 0.225   │ 0.027    │ 3.82   │ 4.13    │ 0.999995947 │
  ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤                   
  │ 4   │ 16    │ yes    │ 0.307   │ 0.028    │ 4.51   │ 4.40    │ 0.999996006 │                     
  ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤                     
  │ 16  │ 2     │ no     │ 0.853   │ 0.123    │ 14.47  │ 18.63   │ 0.999984086 │                     
  ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤                     
  │ 16  │ 16    │ yes    │ 1.191   │ 0.122    │ 17.49  │ 19.29   │ 0.999983847 │                   
  ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤                     
  │ 64  │ 2     │ no     │ 3.934   │ 0.638    │ 66.72  │ 96.70   │ 0.999920487 │
  ├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤                     
  │ 64  │ 16    │ yes    │ 5.405   │ 0.644    │ 79.37  │ 101.76  │ 0.999918163 │
  └─────┴───────┴────────┴─────────┴──────────┴────────┴─────────┴─────────────┘

try:
# 32x32x32 tiles = 1024x1024x1024, 8x8x8 blocks -> K_num_blocks=4
Mt, Kt, Nt = 32, 32, 32
M, K, N = Mt * TILE, Kt * TILE, Nt * TILE
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we try a big enough input where each core gets multiple blocks?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increased the lit test dimensions to 96x32x96 tiles (3072x1024x3072), giving 12x4x12 blocks. On an 8x8 grid each core handles ceil(12/8)=2 blocks per axis (4 output blocks total).

Also added a multi-block-per-core case to test_matmul_l1_acc_multinode.py: (Mt=16, Kt=4, Nt=16, block=4x4, grid=2x2) giving M_num=4, N_num=4 on a 2x2 grid -> 2 blocks/core/axis.

@brnorris03 brnorris03 force-pushed the bnorris/matmul-l1-acc branch from 65430d3 to 7295718 Compare April 14, 2026 01:06
@brnorris03 brnorris03 requested a review from zoecarver April 14, 2026 01:06
@zoecarver
Copy link
Copy Markdown
Contributor

Claude suggested the following test cases. The first two look particularly interesting. Feel free to add any or none of them:

1. Two consecutive += loops to the same reserved block (HIGH - silent miscompilation)

out_blk = out_dfb.reserve()
for k1 in range(K1):
    a_blk = a_dfb.wait()
    b_blk = b_dfb.wait()
    out_blk += a_blk @ b_blk
    a_blk.pop(); b_blk.pop()
for k2 in range(K2):
    c_blk = c_dfb.wait()
    d_blk = d_dfb.wait()
    out_blk += c_blk @ d_blk
    c_blk.pop(); d_blk.pop()
out_blk.push()

What happens: Both loops are independently annotated with l1_acc_loop. Each gets its own disable/enable bracket. The disable-after for loop 1 turns off L1 acc. Loop 2's first iteration packs without L1 acc, overwriting loop 1's accumulated result.

Why: findOutermostL1AccLoop for each loop returns itself (neither is nested inside the other). Disable guards bracket each loop independently. Between the two loops, L1 acc is off.

Generated code:

disable(0)                      // before loop 1
for k1 { packs; if(k1==0) enable(1) }
disable(0)                      // after loop 1
disable(0)                      // before loop 2
for k2 { packs; if(k2==0) enable(1) }  // k2=0 OVERWRITES loop 1's result
push
disable(0)                      // after loop 2

Expected: The user wants the total accumulation across both loops. Loop 2's first iteration should add to loop 1's result, not overwrite it.

This is a real pattern -- two different input streams contributing to one output (e.g., two different A/B sources with different K dimensions).


2. Mixing += and .store() in the same loop (HIGH - silent miscompilation)

out_blk = out_dfb.reserve()
for k in range(K):
    a_blk = a_dfb.wait()
    b_blk = b_dfb.wait()
    if k == 0:
        out_blk.store(a_blk @ b_blk)   # intend: overwrite on first iter
    else:
        out_blk += a_blk @ b_blk       # intend: accumulate on subsequent
    a_blk.pop(); b_blk.pop()
out_blk.push()

What happens: The loop contains an accumulating store (from +=), so it gets annotated with l1_acc_loop. The packer L1 acc state is global -- it doesn't distinguish between stores emitted by store() vs +=. After the enable guard fires (end of iteration 0), ALL subsequent packs are additive, including the ones from .store().

But the actual problem is subtler: the user's own if/else means iteration 0 uses store() (non-accumulating store) and subsequent iterations use +=. Since L1 acc is disabled during iteration 0 (enable fires at the end), the store() writes correctly. On iteration 1+, L1 acc is on and += works. So this specific pattern actually works!

The real problem is the reverse:

for k in range(K):
    if k < K - 1:
        out_blk += a_blk @ b_blk       # accumulate
    else:
        out_blk.store(final_result)     # intend: overwrite on last iter

Here the user wants the last iteration to overwrite. But L1 acc is enabled (from the enable after iteration 0), so the store() on the last iteration adds instead of overwriting. The compiler has no way to distinguish -- L1 acc is a packer-global state.


3. += with a non-matmul RHS (MEDIUM - untested path, may work or crash)

out_blk = out_dfb.reserve()
for k in range(K):
    a_blk = a_dfb.wait()
    out_blk += a_blk          # just accumulate input blocks (sum reduction)
    a_blk.pop()
out_blk.push()

What happens: __iadd__ emits ttl.store %a_blk, %reserve {accumulate}. The loop gets annotated. ConvertTTLToCompute's LowerStoreToCompute handles the CB-attached-to-CB-attached case by creating a passthrough compute (unpack from input CB to DST, pack from DST to output CB).

Risk: This lowering path may not have been tested with L1 acc. The passthrough compute doesn't have a matmul, so TTLSubblockComputeForDST would proceed normally (no TTLAccumulatingOpTrait on the compute body). If the output block exceeds DST capacity, subblocking would create subblock loops inside the l1_acc_loop. L1 acc guard insertion should handle this, but it's untested.

Also: does the passthrough compute even make sense here? The user wants out += a_blk, which means the pack should be additive. The compute would just be unpack(input) -> DST -> pack(output). With L1 acc, the pack adds to the existing output. So the first iteration writes a_blk[0], the second adds a_blk[1], etc. The result is a per-tile sum. This is semantically correct but untested.


4. Multiple += to different output blocks in the same loop (MEDIUM - likely works but untested)

out_blk_a = out_dfb_a.reserve()
out_blk_b = out_dfb_b.reserve()
for k in range(K):
    a_blk = a_dfb.wait()
    b_blk = b_dfb.wait()
    out_blk_a += a_blk @ b_blk
    c_blk = c_dfb.wait()
    d_blk = d_dfb.wait()
    out_blk_b += c_blk @ d_blk
    ...pop all...
out_blk_a.push()
out_blk_b.push()

What happens: Two accumulating stores in the same loop. The loop gets one l1_acc_loop annotation. Both stores create separate computes, each with their own acquire/release. The enable guard fires after the last release (second compute's release). So on iteration 0, both computes pack without L1 acc. On iteration 1+, both pack with L1 acc.

Risk: L1 acc is a global packer state -- it applies to ALL packs regardless of which CB they target. After enable, packs to both CB_a and CB_b are additive. This should be correct (each CB accumulates independently). But the interaction between two independent computes in a single L1 acc loop is untested. If the enable point is calculated wrong (e.g., it ends up between the two computes instead of after both), the second compute on iteration 0 would accumulate instead of overwrite.

I'd want to verify that findTopLevelAncestor with the last TileRegsReleaseOp truly finds the second compute's release as the enable point.


5. += where the output block fits in f32 DST (block_n <= 4) (MEDIUM - possible double accumulation)

# 2x2 output block = 4 tiles, fits in f32 DST
out_blk = out_dfb.reserve()
for k in range(K):
    out_blk += a_blk @ b_blk
out_blk.push()

What happens: With a 2x2 output, no subblocking is needed. But matmul_block already accumulates in DST across internal K tiles. The user's outer K loop triggers L1 acc for inter-block accumulation.

The question is: does LowerMatmulBlock also generate a reduction_loop for the internal K? If yes, there could be two levels of L1 acc -- the user's l1_acc_loop and the compiler's reduction_loop. findL1AccLoop would return the user's l1_acc_loop (preferred), so the inner reduction_loop would be ignored by this pass. But the inner loop might still need L1 acc for its own accumulation (if K_block > 1 within the matmul_block).

Wait, for a 2x2 output with K_block = 1 (the DFB shape is (M, 1) for a), there's no internal K reduction. Each outer loop iteration does a single matmul_block. So this case should be straightforward.

But what if K_block > 1 (DFB a shape is (M, K_BLOCK))? Then matmul_block accumulates K_BLOCK tiles in DST. The outer loop handles K_num_blocks iterations via L1 acc. These compose: DST accumulates within a block, L1 acc accumulates across blocks. Should work.

The risk is if the block fits in DST and the compiler tries to use DST accumulation for the outer loop too (the old prev + a @ b pattern). With +=, the compiler should use L1 acc instead. But if both mechanisms activate, the result could be double-counted.


6. += with K=1 (single iteration, LOW - probably fine but worth checking)

out_blk = out_dfb.reserve()
for _ in range(1):
    out_blk += a_blk @ b_blk
out_blk.push()

What happens: The loop may be canonicalized away by MLIR (scf.for with trip count 1 -> just the body). If the loop is eliminated, the l1_acc_loop annotation is lost, and no L1 acc guards are inserted. The {accumulate} attribute on the store becomes orphaned.

Risk: The store with {accumulate} gets lowered as a regular store (no L1 acc). Semantically correct for K=1 (nothing to accumulate with). But if the canonicalization happens after TTLAnnotateL1AccLoops and before TTKernelInsertL1Accumulation, the annotation is on a loop that no longer exists. The insertion pass wouldn't find the loop. No crash (the annotation is just an attribute on the erased op), but worth verifying the pipeline ordering handles this.


7. += inside a conditional inside a loop (LOW - annotation may fail)

out_blk = out_dfb.reserve()
for k in range(K):
    a_blk = a_dfb.wait()
    b_blk = b_dfb.wait()
    if some_runtime_condition:
        out_blk += a_blk @ b_blk
    a_blk.pop(); b_blk.pop()
out_blk.push()

What happens: The ttl.store {accumulate} is inside an scf.if inside the scf.for. The annotation pass checks store->getParentOfType<scf::ForOp>() == forOp. Walking up from the store: scf.if (not a ForOp) -> scf.for (match!). The loop gets annotated. OK so far.

Risk: On iterations where the condition is false, no pack happens. L1 acc is still enabled. The next true-condition iteration packs additively. This is correct from an L1 acc perspective. But the generated code has if (k == lb) enable(1) at the end of every iteration (after the enable point). On iteration 0, if the condition is false, no packs happen, and then enable fires. On iteration 1 with condition=true, the pack is additive. But iteration 0 never wrote anything! The output CB slot contains garbage (whatever was there before). The additive pack adds to garbage.

Actually, does cb_reserve zero-initialize the CB slot? If not, accumulating into an uninitialized slot produces wrong results even in the non-conditional case (the first iteration should write, not accumulate, and writing with L1 acc off is effectively an overwrite). But if the condition is false on iteration 0, nothing is written, and the enable still fires. Subsequent iterations accumulate into garbage.

This might be a user error (conditional += without initialization), but the compiler silently generates wrong code.


8. += on an output block larger than what fits in f32 DST, with --ttl-strict-f32-acc off (LOW - precision, not crash)

Already partially tested, but worth verifying: a 8x8 output block (64 tiles >> DST capacity 4 with f32) with K=128. Heavy subblocking means many subblock iterations per K step, each truncating to bf16 on pack. With 128 K steps and bf16 truncation per step, the accumulated error could exceed PCC thresholds for large inputs.

Not a correctness bug per se, but the existing tests use PCC > 0.999 which might fail for extreme K values with large subblocked outputs.

@brnorris03
Copy link
Copy Markdown
Contributor Author

Thanks for the extra tests, I will add those (already did 8 before seeing this, see my last response to the accuracy issue).

tests

1. Two consecutive += loops to same reserve
   MLIR: annotate_l1_acc_loops.mlir::consecutive_loops_same_reserve
         insert_l1_accumulation.mlir::consecutive_l1_acc_loops
   Device: test_matmul_l1_acc.py::test_l1_acc_consecutive_loops

2. Mixing += and .store() in same loop
   MLIR: annotate_l1_acc_loops.mlir::mixed_acc_and_plain_store
   Device: test_matmul_l1_acc.py::test_l1_acc_mixed_store

3. += with non-matmul RHS (sum reduction)
   MLIR: annotate_l1_acc_loops.mlir::non_matmul_accumulate
   Device: test_matmul_l1_acc.py::test_l1_acc_sum_reduction

4. Multiple += to different outputs in same loop
   MLIR: insert_l1_accumulation.mlir::two_outputs_one_loop
   Device: test_matmul_l1_acc.py::test_l1_acc_multi_output

5. Output fits in f32 DST (block_n <= 4)
   MLIR: matmul_subblock_l1_acc.mlir::matmul_3x3_k_loop
   Device: test_matmul_l1_acc.py::test_l1_acc_single_core[blk2x2_K2..K8]

6. K=1 single iteration
   MLIR: annotate_l1_acc_loops.mlir::single_iteration
   Device: test_matmul_l1_acc.py::test_l1_acc_single_iteration

7. += inside conditional
   MLIR: annotate_l1_acc_loops.mlir::acc_inside_conditional
   Device: not tested (DSL does not generate runtime conditionals in compute)
@brnorris03
Copy link
Copy Markdown
Contributor Author

brnorris03 commented Apr 14, 2026

Addressed most of the scenarios brought up above, added issue #504 to track "7. += inside a conditional inside a loop (LOW - annotation may fail)" which is not currently supported, now would emit an error.

# Scenario MLIR lit test Runtime test
1 Two consecutive += loops to same reserve annotate_l1_acc_loops.mlir::consecutive_loops_same_reserve + insert_l1_accumulation.mlir::consecutive_l1_acc_loops test_matmul_l1_acc.py::test_l1_acc_consecutive_loops
2 .store() then += (overwrite then accumulate) annotate_l1_acc_loops.mlir::store_and_acc_in_same_loop test_matmul_l1_acc.py::test_l1_acc_store_then_acc
3 += with non-matmul RHS (sum reduction) annotate_l1_acc_loops.mlir::non_matmul_accumulate test_matmul_l1_acc.py::test_l1_acc_sum_reduction
4 Multiple += to different outputs in same loop insert_l1_accumulation.mlir::two_outputs_one_loop test_matmul_l1_acc.py::test_l1_acc_multi_output
5 Output fits in f32 DST (block_n <= 4) already tested in matmul_subblock_l1_acc.mlir::matmul_3x3_k_loop test_matmul_l1_acc.py::test_l1_acc_single_core[blk2x2_K2..K8]
6 K=1 single iteration annotate_l1_acc_loops.mlir::single_iteration test_matmul_l1_acc.py::test_l1_acc_single_iteration
7 += inside conditional annotate_l1_acc_loops_invalid.mlir::acc_inside_conditional (error) not needed (rejected at compile time, #504 tracks future support)
8 += on an output block larger than what fits in f32 DST accuracy see earlier comment about accuracy

@brnorris03 brnorris03 enabled auto-merge (squash) April 14, 2026 16:54
Copy link
Copy Markdown
Contributor

@zoecarver zoecarver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for all the added tests! this looks great

…loop in an accumulation group to

avoid clobbering the Pack_L1_Acc register on Wormhole, while keeping the UNPACK+MATH reconfiguration
that's needed for the different input CBs.
@brnorris03 brnorris03 disabled auto-merge April 14, 2026 18:09
@brnorris03 brnorris03 enabled auto-merge (squash) April 14, 2026 19:00
@brnorris03 brnorris03 merged commit de81ac2 into main Apr 14, 2026
11 checks passed
@brnorris03 brnorris03 deleted the bnorris/matmul-l1-acc branch April 14, 2026 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants