[ttl] Matmul with pack_reconfig_l1_acc#490
Conversation
SubblockComputeForDST previously skipped all accumulating computes. Matmul K accumulates in-place in DST without consuming DST slots, so subblocking the parallel (M, N) dims is safe. LowerMatmulBlock handles the subblocked compute before LowerToLoops sees it.
c4c4b54 to
0eee228
Compare
### Problem description
The existing matmul examples had stale patterns from before the
`acc=True` removal,
`split_work_to_nodes` in `utils/block_allocation.py` had a bug that
produced
incorrect block parameters for certain matrix shapes. Adding 2D
multicast
matmul example. Generally cleaning up and fixing test in metal_examples
### What's changed
- **API updates across metal examples**: Updated all existing metal
matmul examples
(`1d_mcast_matmul`, `multinode_matmul`, `multinode_reuse_matmul`,
`single_node_matmul`) fixing back change ttnn apis and new update metal
headers
- **ttlang matmul accumulation pattern**: Updated all ttlang matmul
examples to use
the explicit `+=` accumulation pattern, replacing the removed
`store(..., acc=True)`
API. Accumulation is now expressed as `acc = ttl.math.fill(out_blk, 0)`
followed
by `acc += a_blk @ b_blk` in the K loop, with a final
`out_blk.store(acc)`.
- **2D mcast matmul (metal)**: Added a metal reference implementation
under
`examples/metal_examples/2d_mcast_matmul/metal/`
- **2D mcast matmul (tt-lang)**: Added a tt-lang implementation under
`examples/metal_examples/2d_mcast_matmul/ttlang/`. Uses `ttl.Pipe` and
`ttl.PipeNet` to express the A (row-wise) and B (column-wise) multicast
patterns,
with `get_large_matmul_params` for block parameter selection.
### Checklist
- [ ] New/Existing tests provide coverage for changes
- `test_block_allocation.py` updated with new coverage for the block
allocation fix
- 2D mcast matmul example manually tested on hardware and in sim
0be1bed to
876dde7
Compare
Problem description When users pass host tensors (e.g., after ttnn.from_device()) to a @ttl.operation, they get an opaque AttributeError: 'NoneType' object has no attribute 'compute_with_storage_grid_size' with no indication that the tensor needs to be on a device. This happens at two call sites: _resolve_grid when grid='auto' -- tries to query the device compute grid from a None device CompiledTTNNKernel.__call__ -- validates kernel grid against device compute grid on a None device closes #389 What's changed Added a shared _require_device(args) helper that scans tensor arguments for an on-device tensor and returns the device. When all tensors are on host, it raises a ValueError listing the host tensor shapes and showing how to fix it: ``` ValueError: No device found on any tensor argument. All ttnn tensor inputs are on host: arg[0]: Shape([32, 32]) Place tensors on device before calling the operation, e.g.: ttnn.to_device(tensor, device) ttnn.from_torch(tensor, ..., device=device) ``` Both crash sites (_resolve_grid and CompiledTTNNKernel.__call__) now use this helper instead of calling .device() without a None check. The five other .device() call sites in the file already had proper None guards and were not changed. The simulator path (python/sim/) is unaffected Checklist - [x] New/Existing tests provide coverage for changes
76e7848 to
f59d850
Compare
e022cb2 to
5b64b30
Compare
| scale = math.sqrt(k_tiles) | ||
| # DST capacity with fp32_dest_acc_en=true is 4. Output block is | ||
| # 1 x block_n. When block_n > 4, L1 acc activates with per-K-step | ||
| # bf16 truncation, requiring relaxed error bounds. |
There was a problem hiding this comment.
This is really unfortunate. As a user, if I care about acc accuracy, is there any way to enforce always f32 acc and error if the block size is too big?
There was a problem hiding this comment.
Not currently, but it would be straightforward to add. Maybe a compiler option like --strict-f32-acc that errors during SubblockComputeForDST if the output block exceeds f32 DST capacity (4 tiles) and L1 accumulation would be needed. Filed issue #497.
There was a problem hiding this comment.
Added the --ttl-strict-f32-acc option which errors at compile time if a += accumulation loop's output block requires subblocking (exceeds f32 DST capacity of 4 tiles with double-buffering), which would accumulate in bf16 for bf16 operands if subblocked by the compiler. Reduction loops (compiler-generated) are not affected by this check.
There was a problem hiding this comment.
Also added a TODO for potentially adding an f32 L1 temporary (when the compiler can do thread-local temps) and emit a cast to bf16 after the loop. This would give full f32 precision regardless of block size, at the cost of 2x L1 per tile.
There was a problem hiding this comment.
I measured actual error accumulation out of curiosity, here it is. I think in practice, the error growth is pretty slow, so this option would probably not be needed frequently.
+= (L1 packer accumulation):
┌─────┬───────┬────────┬─────────┬──────────┬────────┬─────────┬─────────────┐
│ K │ M, N │ subblk │ max_abs │ mean_abs │ max/K1 │ mean/K1 │ pcc │
├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤
│ 1 │ 2 │ no │ 0.059 │ 0.007 │ 1.00 │ 1.00 │ 0.999998391 │
├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤
│ 1 │ 4 │ no │ 0.062 │ 0.007 │ 1.00 │ 1.00 │ 0.999998152 │
├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤
│ 1 │ 8 │ yes │ 0.062 │ 0.007 │ 1.00 │ 1.00 │ 0.999998629 │
├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤
│ 1 │ 16 │ yes │ 0.068 │ 0.006 │ 1.00 │ 1.00 │ 0.999998629 │
├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤
│ 4 │ 2 │ no │ 0.228 │ 0.026 │ 3.87 │ 3.87 │ 0.999995112 │
├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤
│ 4 │ 16 │ yes │ 0.244 │ 0.026 │ 3.59 │ 4.08 │ 0.999995291 │
├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤
│ 16 │ 2 │ no │ 0.659 │ 0.086 │ 11.18 │ 13.05 │ 0.999985993 │
├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤
│ 16 │ 16 │ yes │ 0.906 │ 0.086 │ 13.31 │ 13.64 │ 0.999985516 │
├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤
│ 64 │ 2 │ no │ 2.434 │ 0.320 │ 41.28 │ 48.46 │ 0.999950767 │
├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤
│ 64 │ 16 │ yes │ 3.497 │ 0.309 │ 51.35 │ 48.82 │ 0.999953330 │
└─────┴───────┴────────┴─────────┴──────────┴────────┴─────────┴─────────────┘
manual (DFB round-trip accumulation):
┌─────┬───────┬────────┬─────────┬──────────┬────────┬─────────┬─────────────┐
│ K │ M, N │ subblk │ max_abs │ mean_abs │ max/K1 │ mean/K1 │ pcc │
├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤
│ 4 │ 2 │ no │ 0.225 │ 0.027 │ 3.82 │ 4.13 │ 0.999995947 │
├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤
│ 4 │ 16 │ yes │ 0.307 │ 0.028 │ 4.51 │ 4.40 │ 0.999996006 │
├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤
│ 16 │ 2 │ no │ 0.853 │ 0.123 │ 14.47 │ 18.63 │ 0.999984086 │
├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤
│ 16 │ 16 │ yes │ 1.191 │ 0.122 │ 17.49 │ 19.29 │ 0.999983847 │
├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤
│ 64 │ 2 │ no │ 3.934 │ 0.638 │ 66.72 │ 96.70 │ 0.999920487 │
├─────┼───────┼────────┼─────────┼──────────┼────────┼─────────┼─────────────┤
│ 64 │ 16 │ yes │ 5.405 │ 0.644 │ 79.37 │ 101.76 │ 0.999918163 │
└─────┴───────┴────────┴─────────┴──────────┴────────┴─────────┴─────────────┘
| try: | ||
| # 32x32x32 tiles = 1024x1024x1024, 8x8x8 blocks -> K_num_blocks=4 | ||
| Mt, Kt, Nt = 32, 32, 32 | ||
| M, K, N = Mt * TILE, Kt * TILE, Nt * TILE |
There was a problem hiding this comment.
can we try a big enough input where each core gets multiple blocks?
There was a problem hiding this comment.
Increased the lit test dimensions to 96x32x96 tiles (3072x1024x3072), giving 12x4x12 blocks. On an 8x8 grid each core handles ceil(12/8)=2 blocks per axis (4 output blocks total).
Also added a multi-block-per-core case to test_matmul_l1_acc_multinode.py: (Mt=16, Kt=4, Nt=16, block=4x4, grid=2x2) giving M_num=4, N_num=4 on a 2x2 grid -> 2 blocks/core/axis.
65430d3 to
7295718
Compare
|
Claude suggested the following test cases. The first two look particularly interesting. Feel free to add any or none of them: 1. Two consecutive += loops to the same reserved block (HIGH - silent miscompilation)out_blk = out_dfb.reserve()
for k1 in range(K1):
a_blk = a_dfb.wait()
b_blk = b_dfb.wait()
out_blk += a_blk @ b_blk
a_blk.pop(); b_blk.pop()
for k2 in range(K2):
c_blk = c_dfb.wait()
d_blk = d_dfb.wait()
out_blk += c_blk @ d_blk
c_blk.pop(); d_blk.pop()
out_blk.push()What happens: Both loops are independently annotated with Why: Generated code: Expected: The user wants the total accumulation across both loops. Loop 2's first iteration should add to loop 1's result, not overwrite it. This is a real pattern -- two different input streams contributing to one output (e.g., two different A/B sources with different K dimensions). 2. Mixing += and .store() in the same loop (HIGH - silent miscompilation)out_blk = out_dfb.reserve()
for k in range(K):
a_blk = a_dfb.wait()
b_blk = b_dfb.wait()
if k == 0:
out_blk.store(a_blk @ b_blk) # intend: overwrite on first iter
else:
out_blk += a_blk @ b_blk # intend: accumulate on subsequent
a_blk.pop(); b_blk.pop()
out_blk.push()What happens: The loop contains an accumulating store (from But the actual problem is subtler: the user's own if/else means iteration 0 uses The real problem is the reverse: for k in range(K):
if k < K - 1:
out_blk += a_blk @ b_blk # accumulate
else:
out_blk.store(final_result) # intend: overwrite on last iterHere the user wants the last iteration to overwrite. But L1 acc is enabled (from the enable after iteration 0), so the 3. += with a non-matmul RHS (MEDIUM - untested path, may work or crash)out_blk = out_dfb.reserve()
for k in range(K):
a_blk = a_dfb.wait()
out_blk += a_blk # just accumulate input blocks (sum reduction)
a_blk.pop()
out_blk.push()What happens: Risk: This lowering path may not have been tested with L1 acc. The passthrough compute doesn't have a matmul, so Also: does the passthrough compute even make sense here? The user wants 4. Multiple += to different output blocks in the same loop (MEDIUM - likely works but untested)out_blk_a = out_dfb_a.reserve()
out_blk_b = out_dfb_b.reserve()
for k in range(K):
a_blk = a_dfb.wait()
b_blk = b_dfb.wait()
out_blk_a += a_blk @ b_blk
c_blk = c_dfb.wait()
d_blk = d_dfb.wait()
out_blk_b += c_blk @ d_blk
...pop all...
out_blk_a.push()
out_blk_b.push()What happens: Two accumulating stores in the same loop. The loop gets one Risk: L1 acc is a global packer state -- it applies to ALL packs regardless of which CB they target. After enable, packs to both CB_a and CB_b are additive. This should be correct (each CB accumulates independently). But the interaction between two independent computes in a single L1 acc loop is untested. If the enable point is calculated wrong (e.g., it ends up between the two computes instead of after both), the second compute on iteration 0 would accumulate instead of overwrite. I'd want to verify that 5. += where the output block fits in f32 DST (block_n <= 4) (MEDIUM - possible double accumulation)# 2x2 output block = 4 tiles, fits in f32 DST
out_blk = out_dfb.reserve()
for k in range(K):
out_blk += a_blk @ b_blk
out_blk.push()What happens: With a 2x2 output, no subblocking is needed. But The question is: does Wait, for a 2x2 output with K_block = 1 (the DFB shape is (M, 1) for a), there's no internal K reduction. Each outer loop iteration does a single matmul_block. So this case should be straightforward. But what if K_block > 1 (DFB a shape is (M, K_BLOCK))? Then matmul_block accumulates K_BLOCK tiles in DST. The outer loop handles K_num_blocks iterations via L1 acc. These compose: DST accumulates within a block, L1 acc accumulates across blocks. Should work. The risk is if the block fits in DST and the compiler tries to use DST accumulation for the outer loop too (the old 6. += with K=1 (single iteration, LOW - probably fine but worth checking)out_blk = out_dfb.reserve()
for _ in range(1):
out_blk += a_blk @ b_blk
out_blk.push()What happens: The loop may be canonicalized away by MLIR (scf.for with trip count 1 -> just the body). If the loop is eliminated, the Risk: The store with 7. += inside a conditional inside a loop (LOW - annotation may fail)out_blk = out_dfb.reserve()
for k in range(K):
a_blk = a_dfb.wait()
b_blk = b_dfb.wait()
if some_runtime_condition:
out_blk += a_blk @ b_blk
a_blk.pop(); b_blk.pop()
out_blk.push()What happens: The Risk: On iterations where the condition is false, no pack happens. L1 acc is still enabled. The next true-condition iteration packs additively. This is correct from an L1 acc perspective. But the generated code has Actually, does This might be a user error (conditional 8. += on an output block larger than what fits in f32 DST, with --ttl-strict-f32-acc off (LOW - precision, not crash)Already partially tested, but worth verifying: a 8x8 output block (64 tiles >> DST capacity 4 with f32) with K=128. Heavy subblocking means many subblock iterations per K step, each truncating to bf16 on pack. With 128 K steps and bf16 truncation per step, the accumulated error could exceed PCC thresholds for large inputs. Not a correctness bug per se, but the existing tests use PCC > 0.999 which might fail for extreme K values with large subblocked outputs. |
|
Thanks for the extra tests, I will add those (already did 8 before seeing this, see my last response to the accuracy issue). |
tests
1. Two consecutive += loops to same reserve
MLIR: annotate_l1_acc_loops.mlir::consecutive_loops_same_reserve
insert_l1_accumulation.mlir::consecutive_l1_acc_loops
Device: test_matmul_l1_acc.py::test_l1_acc_consecutive_loops
2. Mixing += and .store() in same loop
MLIR: annotate_l1_acc_loops.mlir::mixed_acc_and_plain_store
Device: test_matmul_l1_acc.py::test_l1_acc_mixed_store
3. += with non-matmul RHS (sum reduction)
MLIR: annotate_l1_acc_loops.mlir::non_matmul_accumulate
Device: test_matmul_l1_acc.py::test_l1_acc_sum_reduction
4. Multiple += to different outputs in same loop
MLIR: insert_l1_accumulation.mlir::two_outputs_one_loop
Device: test_matmul_l1_acc.py::test_l1_acc_multi_output
5. Output fits in f32 DST (block_n <= 4)
MLIR: matmul_subblock_l1_acc.mlir::matmul_3x3_k_loop
Device: test_matmul_l1_acc.py::test_l1_acc_single_core[blk2x2_K2..K8]
6. K=1 single iteration
MLIR: annotate_l1_acc_loops.mlir::single_iteration
Device: test_matmul_l1_acc.py::test_l1_acc_single_iteration
7. += inside conditional
MLIR: annotate_l1_acc_loops.mlir::acc_inside_conditional
Device: not tested (DSL does not generate runtime conditionals in compute)
|
Addressed most of the scenarios brought up above, added issue #504 to track "7. += inside a conditional inside a loop (LOW - annotation may fail)" which is not currently supported, now would emit an error.
|
zoecarver
left a comment
There was a problem hiding this comment.
thanks for all the added tests! this looks great
…loop in an accumulation group to avoid clobbering the Pack_L1_Acc register on Wormhole, while keeping the UNPACK+MATH reconfiguration that's needed for the different input CBs.
Problem description
Matmul K accumulation across multiple K blocks had no L1 accumulation support. The DST-based workaround (
prev + a @ b, lowered tocopy_tile+matmul_block) required an extra accumulator DFB, a per-K-block copy from L1 back into DST, and a final copy to the output DFB.What's changed
Switches K accumulation from DST-based to L1 packer accumulation (
pack_reconfig_l1_acc). Each compute is now independent -- the packer adds to the existing L1 value instead of overwriting.Accumulation is explicit via
+=on reserved blocks. Plainstore()always overwrites. This is an interim mechanism; the spec's fullBlockExprpattern (fill+ lazy+=+store) is deferred to #446.DSL pattern:
+=emitsttl.storewith{accumulate}, which the compiler detects and annotates for L1 packer accumulation.store()emits a plainttl.storeand always overwrites.Compiler pipeline additions:
TTLAnnotateL1AccLoops: walks accumulating stores and uses dominance to verify thecb_reserveis outside the enclosing loop. Rejects+=inside conditionals (Support += inside conditionals in accumulation loops #504). Annotates enclosing loops withttl.l1_acc_loop.TTKernelInsertL1Accumulation: groups annotated loops into accumulation scopes by shared pack CB targets. Consecutive sibling loops sharing a CB get a single disable pair with re-enable between loops (init ops between loops reset packer state). Nested annotated loops are folded into the outermost ancestor. Max-reduce loops are excluded.TTKernelCombinePackTiles: updated to not combine pack tiles across L1 acc loop boundaries.TTLSubblockComputeForDST): now allows matmul accumulating computes. Added--ttl-strict-f32-accoption that errors if a non-f32 output accumulation loop requires subblocking.StoreOp: added optional{accumulate}attribute (emitted by+=, consumed byTTLAnnotateL1AccLoops).operators.py: added__iadd__onTensorBlockfor explicit accumulation.Performance
L1-only 2048^3 matmul with L1 acc runs at 0.70x of ttnn.matmul (30% faster); DRAM-bound 4096^3 is 2.76x (bottlenecked by per-core DRAM reads, no multicast yet). Benchmarking and examples are on the
bnorris/matmul-benchbranch (will be a separate PR).Future work (not in this PR)
out.store(a @ b)with K_block-sized DFBs, compiler generates the outer K_num_blocks L1 acc loop and DM-side K-chunked reads ([ttl] Add block expressions (lazy IR creation) #446)BlockExpraccumulation syntax (y = fill(0); y += a @ b; out.store(y)): lazy block expressions ([ttl] Add block expressions (lazy IR creation) #446)out.store(relu(a @ b))) ([ttl] Add support matmul + elementwise fusion (fixes #476) #486)+=support: track whether a pack actually happened instead of using the loop IV (Support += inside conditionals in accumulation loops #504)