tenstorrent · brnorris03 · Apr 14, 2026 · Apr 10, 2026 · Apr 10, 2026 · Apr 10, 2026
@@ -17,6 +17,7 @@ python my_kernel.py --no-ttl-maximize-dst
 | `--ttl-block-matmul` / `--no-ttl-block-matmul` | enabled | Emit `matmul_block` (processes the full tile block atomically) instead of per-tile matmul loops. Disabling this option is not yet supported. |
 | `--ttl-auto-sync` / `--no-ttl-auto-sync` | disabled | Let the compiler insert and move DFB synchronization ops. When enabled, reserve/push may be refined to per-subblock granularity. When disabled, user-placed reserve/push is preserved as written. |
 | `--ttl-combine-pack-tiles` / `--no-ttl-combine-pack-tiles` | enabled | Combine consecutive `pack_tile` ops on the same CB with contiguous DST and CB indices into a single `pack_tile_block` call. |
+| `--ttl-strict-f32-acc` / `--no-ttl-strict-f32-acc` | disabled | Error at compile time if a `+=` accumulation loop's output block exceeds f32 DST capacity (4 tiles with double-buffering). When enabled, guarantees each accumulation step fits in a single DST section without subblocking. |
 
 ### Other Ways to Set These
 
@@ -110,24 +111,27 @@ ttlang-opt input.mlir -p 'ttl-to-ttkernel-pipeline{maximize-dst=true lower-to-em
 | `use-block-matmul` | bool | `true` | Lower matmul to block-level hardware calls (`experimental::matmul_block`). |
 | `auto-sync` | bool | `false` | Let the compiler insert and move DFB synchronization ops. |
 | `combine-pack-tiles` | bool | `true` | Combine consecutive `pack_tile` ops into `pack_tile_block`. |
+| `strict-f32-acc` | bool | `false` | Error if a `+=` accumulation loop's output block exceeds f32 DST capacity. |
 | `lower-to-emitc` | bool | `false` | Run the TTKernel-to-EmitC backend (produces C++ source). |
 
 The pipeline runs these passes in order:
 
-1. `convert-ttl-to-compute` — lower TTL elementwise tensor ops to `ttl.compute` with tile ops
-2. `ttl-set-compute-kernel-config` — set `fp32_dest_acc_en` / `dst_full_sync_en` defaults
-3. `ttl-assign-dst` — DST register allocation (linear scan with copy insertion)
-4. `ttl-subblock-compute-for-dst` — tile `ttl.compute` into DST-sized subblocks *(only if `maximize-dst=true`)*; optionally refine reserve/push to per-subblock granularity *(only if `auto-sync=true`)*
-5. `ttl-insert-tile-regs-sync` — insert math/pack thread synchronization
-6. `ttl-lower-matmul-block` — mark block-matmul computes and expand stores *(only if `use-block-matmul=true`)*
-7. `ttl-lower-to-loops` — lower `ttl.compute` to `scf.for` loops
-8. `ttl-schedule-operations` — reorder tile ops by dependency depth and kind *(only if `maximize-dst=true`)*
-9. `ttl-annotate-cb-associations` — annotate block args with CB indices
-10. `convert-ttl-to-ttkernel` — lower TTL DMA ops to TTKernel
-11. `ttkernel-insert-inits` — insert hardware init ops before compute ops
-12. `ttkernel-combine-pack-tiles` — combine consecutive `pack_tile` into `pack_tile_block` *(only if `combine-pack-tiles=true`)*
-13. Canonicalization and CSE cleanup
-14. *(if `lower-to-emitc=true`)* `lower-affine`, `convert-ttkernel-to-emitc`, `emitc-form-expressions`
+1. `ttl-annotate-l1-acc-loops` — detect `+=` accumulation loops and annotate for L1 packer accumulation
+2. `convert-ttl-to-compute` — lower TTL elementwise tensor ops to `ttl.compute` with tile ops
+3. `ttl-set-compute-kernel-config` — set `fp32_dest_acc_en` / `dst_full_sync_en` defaults
+4. `ttl-assign-dst` — DST register allocation (linear scan with copy insertion)
+5. `ttl-subblock-compute-for-dst` — tile `ttl.compute` into DST-sized subblocks *(only if `maximize-dst=true`)*; optionally refine reserve/push to per-subblock granularity *(only if `auto-sync=true`)*
+6. `ttl-insert-tile-regs-sync` — insert math/pack thread synchronization
+7. `ttl-lower-matmul-block` — mark block-matmul computes and expand stores *(only if `use-block-matmul=true`)*
+8. `ttl-lower-to-loops` — lower `ttl.compute` to `scf.for` loops
+9. `ttl-schedule-operations` — reorder tile ops by dependency depth and kind *(only if `maximize-dst=true`)*
+10. `ttl-annotate-cb-associations` — annotate block args with CB indices
+11. `convert-ttl-to-ttkernel` — lower TTL DMA ops to TTKernel
+12. `ttkernel-insert-inits` — insert hardware init ops before compute ops
+13. `ttkernel-insert-l1-accumulation` — insert `pack_reconfig_l1_acc` guards for `+=` and reduction loops
+14. `ttkernel-combine-pack-tiles` — combine consecutive `pack_tile` into `pack_tile_block` *(only if `combine-pack-tiles=true`)*
+15. Canonicalization and CSE cleanup
+16. *(if `lower-to-emitc=true`)* `lower-affine`, `convert-ttkernel-to-emitc`, `emitc-form-expressions`
 
 ### Individual Pass Options
 
@@ -169,6 +173,7 @@ Partition `ttl.compute` into DST-sized subblocks.
 | Option | Type | Default | Description |
 |---|---|---|---|
 | `subblock-sync` | bool | `false` | Refine DFB reserve/push to per-subblock granularity, enabling `pack_tile_block` for contiguous subblocks. When disabled, user-placed reserve/push is preserved. |
+| `strict-f32-acc` | bool | `false` | Error if a `+=` accumulation loop with non-f32 output requires subblocking. Subblocking reduces accumulation precision because bf16 L1 intermediates truncate f32 DST values. |
 
 ```bash
 ttlang-opt input.mlir -p 'func.func(ttl-subblock-compute-for-dst{subblock-sync=true})'

@@ -64,6 +64,12 @@ constexpr llvm::StringLiteral kTileLoopStrideAttrName("ttl.tile_loop_stride");
 /// ComputeOp is lowered to loops.
 constexpr llvm::StringLiteral kReductionLoopAttrName("ttl.reduction_loop");
 
+/// Marks a user-written scf.for as an L1 accumulation loop. Each iteration
+/// packs to the same CB slot; pack_reconfig_l1_acc makes subsequent
+/// iterations additive. Distinct from kReductionLoopAttrName which marks
+/// compiler-generated reduction loops.
+constexpr llvm::StringLiteral kL1AccLoopAttrName("ttl.l1_acc_loop");
+
 /// Output CB index on tile ops that need it for init insertion.
 constexpr llvm::StringLiteral
     kBcastOutputCBIndexAttrName("ttl.bcast_output_cb_index");

@@ -1039,20 +1039,28 @@ def TTL_StoreOp : TTL_Op<"store", [MemoryEffects<[MemWrite]>]> {
     `ttl.store` represents the user's intent to store a computed tensor into
     an output circular buffer.
 
-    Emitted by Python's `o.store(result)`. During `convert-ttl-to-compute`,
-    this op is transformed into a `ttl.tile_store` inside the compute body.
+    Emitted by Python's `o.store(result)` (overwrite) or `o += result`
+    (accumulate). During `convert-ttl-to-compute`, this op is transformed
+    into a `ttl.tile_store` inside the compute body.
+
+    When `accumulate` is set, the enclosing loop is annotated for L1
+    packer accumulation so that each iteration adds to the existing L1
+    value instead of overwriting.
 
     Example:
     ```mlir
     %reserve = ttl.cb_reserve %cb : ... -> tensor<1x1x!ttcore.tile<32x32, bf16>>
     %result = ttl.add %a, %b : ...
     ttl.store %result, %reserve : tensor<1x1x!ttcore.tile<32x32, bf16>>,
                                   tensor<1x1x!ttcore.tile<32x32, bf16>>
+    // Accumulating store (from +=):
+    ttl.store %result, %reserve {accumulate} : ...
     ```
   }];
   let arguments = (ins
     AnyRankedTensor:$tensor,
-    AnyRankedTensor:$view
+    AnyRankedTensor:$view,
+    OptionalAttr<UnitAttr>:$accumulate
   );
   let assemblyFormat = "$tensor `,` $view attr-dict `:` type($tensor) `,` type($view)";
   let hasVerifier = 1;

@@ -12,9 +12,11 @@
 #include "mlir/Dialect/Affine/IR/AffineOps.h"
 #include "mlir/Dialect/Affine/Utils.h"
 #include "mlir/Dialect/Arith/Utils/Utils.h"
+#include "mlir/Dialect/SCF/IR/SCF.h"
 #include "mlir/Dialect/Tensor/IR/Tensor.h"
 #include "mlir/Interfaces/ViewLikeInterface.h"
 #include "llvm/ADT/SetVector.h"
+
 #include <cstdint>
 #include <optional>
 
@@ -427,6 +429,24 @@ inline TileOp createTileOpWithPlaceholderDstIndex(OpBuilder &builder,
   return tileOp;
 }
 
+/// Collect the CB values targeted by pack_tile ops inside a loop.
+llvm::SmallDenseSet<Value, 2> getPackTileCBs(scf::ForOp loop);
+
+/// Returns true if two loops share any pack_tile CB target.
+bool sharePackCB(scf::ForOp loopA, scf::ForOp loopB);
+
+/// A group of consecutive sibling loops that pack to the same output CB.
+struct LoopGroup {
+  scf::ForOp rootLoop;
+  SmallVector<scf::ForOp> loops;
+  Operation *scopeEnd = nullptr;
+};
+
+/// Collect groups of annotated sibling loops that share a pack CB target.
+SmallVector<LoopGroup> collectLoopGroups(
+    ArrayRef<scf::ForOp> l1AccLoops,
+    const llvm::SmallDenseMap<Operation *, Operation *> &enablePointPerLoop);
+
 } // namespace mlir::tt::ttl
 
 #endif // TTLANG_DIALECT_TTL_IR_TTLOPSUTILS_H
@@ -30,13 +30,24 @@ def TTKernelInsertL1Accumulation
     : Pass<"ttkernel-insert-l1-accumulation", "::mlir::ModuleOp"> {
   let summary = "Insert L1 accumulation guards for reduction loops";
   let description = [{
-    Inserts `pack_reconfig_l1_acc` guards inside reduction loops so that
-    pack operations accumulate into L1 instead of overwriting. On the first
-    iteration of a reduction loop, pack writes normally. On subsequent
-    iterations, the packer is reconfigured to add to the existing L1 value.
+    Inserts `pack_reconfig_l1_acc` guards so that pack operations inside
+    annotated loops accumulate into L1 instead of overwriting. Max-reduce
+    loops are excluded (max is not additive).
+
+    Loops are grouped into accumulation scopes: consecutive sibling loops
+    that pack to the same CB share a single disable pair. Nested annotated
+    loops are folded into the outermost ancestor. Each group gets:
+    - `pack_reconfig_l1_acc(0)` before the first loop.
+    - `if (iv == lb) pack_reconfig_l1_acc(1)` after each loop's first
+      iteration (the L1 acc state persists across `tile_regs` boundaries).
+    - For the second+ loop in a group, an enable before the loop to re-enable
+      L1 acc after init ops that may reset packer state.
+    - `pack_reconfig_l1_acc(0)` after the last `cb_push_back`.
+
+    Loops are identified by the `ttl.l1_acc_loop` (user-written) or
+    `ttl.reduction_loop` (compiler-generated) attributes on `scf.for`
+    ops, with `ttl.l1_acc_loop` taking precedence.
 
-    Reduction loops are identified by the `ttl.reduction_loop` attribute
-    on `scf.for` ops.
   }];
 
   let dependentDialects = [
@@ -46,6 +57,21 @@ def TTKernelInsertL1Accumulation
   ];
 }
 
+def TTLAnnotateL1AccLoops
+    : Pass<"ttl-annotate-l1-acc-loops", "::mlir::func::FuncOp"> {
+  let summary = "Annotate user-written scf.for loops for L1 accumulation";
+  let description = [{
+    Detects user-written `scf.for` loops containing `ttl.store` ops with
+    the `accumulate` attribute (emitted by the `+=` operator) and annotates
+    them with `ttl.l1_acc_loop`. This enables `TTKernelInsertL1Accumulation`
+    to insert `pack_reconfig_l1_acc` guards so that packs accumulate across
+    iterations instead of overwriting.
+  }];
+  let dependentDialects = [
+    "::mlir::scf::SCFDialect"
+  ];
+}
+
 def TTLConvertTTLToCompute
     : Pass<"convert-ttl-to-compute", "::mlir::func::FuncOp"> {
   let summary = "Lower TTL elementwise tensor ops to ttl.compute with tile ops";
@@ -184,7 +210,11 @@ def TTLSubblockComputeForDST
     Option<"subblockSync", "subblock-sync", "bool", "false",
            "Refine DFB reserve/push to per-subblock granularity, enabling "
            "pack_tile_block for contiguous subblocks. When disabled (default), "
-           "user-placed reserve/push is preserved.">
+           "user-placed reserve/push is preserved.">,
+    Option<"strictF32Acc", "strict-f32-acc", "bool", "false",
+           "Error if a user-written accumulation loop (+=) with non-f32 "
+           "output requires subblocking. Subblocking reduces accumulation "
+           "precision because bf16 L1 intermediates are narrower than f32 DST.">
   ];
 
   let dependentDialects = [

@@ -44,6 +44,10 @@ struct TTLToTTKernelPipelineOptions
       *this, "reduce-full-fp32",
       llvm::cl::desc("Enable FP32 accumulation for reduce operations."),
       llvm::cl::init(true)};
+  Option<bool> strictF32Acc{
+      *this, "strict-f32-acc",
+      llvm::cl::desc("Error if accumulation output exceeds f32 DST capacity."),
+      llvm::cl::init(false)};
 };
 
 void createTTLToTTKernelPipeline(mlir::OpPassManager &pm,

@@ -11,12 +11,14 @@
 //
 //===----------------------------------------------------------------------===//
 
+#include "ttlang/Dialect/TTL/IR/TTL.h"
 #include "ttlang/Dialect/TTL/Passes.h"
 #include "ttmlir/Dialect/TTKernel/IR/TTKernel.h"
 #include "ttmlir/Dialect/TTKernel/IR/TTKernelOps.h"
 
 #include "mlir/Dialect/Arith/IR/Arith.h"
 #include "mlir/Dialect/Func/IR/FuncOps.h"
+#include "mlir/Dialect/SCF/IR/SCF.h"
 #include "mlir/IR/Builders.h"
 #include "mlir/Pass/Pass.h"
 
@@ -63,6 +65,19 @@ struct TTKernelCombinePackTilesPass
 
   void runOnOperation() override {
     getOperation().walk([](Block *block) {
+      // Skip blocks inside reduction loops: pack_tile_block is
+      // incompatible with L1 accumulation (pack_reconfig_l1_acc).
+      // L1 acc requires individual pack_tile calls so each K iteration
+      // can independently add to the existing L1 value.
+      for (Operation *parent = block->getParentOp(); parent;
+           parent = parent->getParentOp()) {
+        if (auto forOp = dyn_cast<scf::ForOp>(parent)) {
+          if (forOp->hasAttr(kReductionLoopAttrName) ||
+              forOp->hasAttr(kL1AccLoopAttrName)) {
+            return;
+          }
+        }
+      }
       // Collect all combinable runs first, then replace them. Replacing
       // during iteration would invalidate the block's operation list.
       SmallVector<SmallVector<ttk::PackTileOp>> runs;

@@ -23,6 +23,7 @@
 //===----------------------------------------------------------------------===//
 
 #include "ttlang/Dialect/TTL/IR/TTL.h"
+#include "ttlang/Dialect/TTL/IR/TTLOpsUtils.h"
 #include "ttlang/Dialect/TTL/Passes.h"
 
 #include "ttmlir/Dialect/TTKernel/IR/TTKernel.h"
@@ -348,18 +349,18 @@ analyzeSyncRegion(ttk::TileRegsAcquireOp acquireOp, Value &inputCB,
 }
 
 /// Find the outermost enclosing insertion point by walking up through
-/// compiler-generated loops (marked with ttl.tile_loop_stride or
-/// ttl.subblock_loop_stride). By construction, these loops iterate over tiles
-/// within a single ttl.compute whose input/output CBs are fixed, so the
-/// CB configuration is invariant across iterations and hoisting is safe.
-/// Stops at unmarked loops to avoid hoisting past user loops that could
-/// contain multiple sync regions with different CB configurations.
+/// loops with invariant CB configurations: compiler-generated tile/subblock
+/// loops (ttl.tile_loop_stride, ttl.subblock_loop_stride) and L1
+/// accumulation loops (ttl.l1_acc_loop). All use fixed CBs across
+/// iterations, so init hoisting is safe. Stops at unmarked loops to avoid
+/// hoisting past user loops with varying CB configurations.
 static Operation *hoistAboveCompilerLoops(Operation *op) {
   Operation *insertBefore = op;
   while (auto *parentOp = insertBefore->getParentOp()) {
     if (isa<scf::ForOp>(parentOp) &&
         (parentOp->hasAttr(kTileLoopStrideAttrName) ||
-         parentOp->hasAttr(kSubblockLoopStrideAttrName))) {
+         parentOp->hasAttr(kSubblockLoopStrideAttrName) ||
+         parentOp->hasAttr(kL1AccLoopAttrName))) {
       insertBefore = parentOp;
     } else {
       break;
@@ -403,8 +404,36 @@ static LogicalResult insertCommonInits(ModuleOp moduleOp) {
       inputCB = outputCB;
     }
 
-    if (analysis.hasMatmul && in0CB && in1CB) {
-      // mm_block_init configures UNPACK + MATH + PACK for matmul_block.
+    // When a matmul init is hoisted before a loop that shares an
+    // output CB with a preceding sibling annotated loop, use
+    // init_short. The full init reconfigures the PACK pipeline
+    // which clobbers packer state (including L1 acc on Wormhole).
+    // init_short only reconfigures UNPACK+MATH.
+    bool useInitShort = false;
+    if (analysis.hasMatmul) {
+      if (auto forOp = dyn_cast<scf::ForOp>(insertBefore)) {
+        if (forOp->hasAttr(kL1AccLoopAttrName) ||
+            forOp->hasAttr(kReductionLoopAttrName)) {
+          for (Operation *prev = forOp->getPrevNode(); prev;
+               prev = prev->getPrevNode()) {
+            if (auto prevFor = dyn_cast<scf::ForOp>(prev)) {
+              if ((prevFor->hasAttr(kL1AccLoopAttrName) ||
+                   prevFor->hasAttr(kReductionLoopAttrName)) &&
+                  sharePackCB(prevFor, forOp)) {
+                useInitShort = true;
+              }
+              break;
+            }
+          }
+        }
+      }
+    }
+
+    if (analysis.hasMatmul && in0CB && in1CB && useInitShort) {
+      ttk::MatmulBlockInitShortOp::create(
+          builder, loc, in0CB, in1CB, analysis.matmulTranspose,
+          analysis.matmulCt, analysis.matmulRt, analysis.matmulKt);
+    } else if (analysis.hasMatmul && in0CB && in1CB) {
       ttk::MatmulBlockInitOp::create(
           builder, loc, in0CB, in1CB, outputCB, analysis.matmulTranspose,
           analysis.matmulCt, analysis.matmulRt, analysis.matmulKt);