Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
db6d4b8
[examples] Matmul tutorial (#485)
phizalev-TT Apr 10, 2026
2544cb1
Allow subblocking for matmul accumulating computes
brnorris03 Apr 10, 2026
01cb9b6
initial pack_reconfig_l1_acc support
brnorris03 Apr 10, 2026
513cc4d
preformat
brnorris03 Apr 10, 2026
6476460
add pytest for l1 accumulation
brnorris03 Apr 10, 2026
455ad86
correct placement of pack_reconfig_l1_acc
brnorris03 Apr 10, 2026
b51e554
more packer accumulation fixes
brnorris03 Apr 10, 2026
ec703c7
matmul cleanup and 2d mcast (#465)
arichinsTT Apr 10, 2026
4e8f935
Merge remote-tracking branch 'origin/main' into bnorris/matmul-l1-acc
brnorris03 Apr 10, 2026
fe2bb68
consistent missing device error (#481)
arichinsTT Apr 10, 2026
0350205
generalize the packer L1 accumulation guard placement
brnorris03 Apr 10, 2026
f919130
add tests
brnorris03 Apr 10, 2026
2d11d04
update tests
brnorris03 Apr 10, 2026
f59d850
Merge remote-tracking branch 'origin/main' into bnorris/matmul-l1-acc
brnorris03 Apr 10, 2026
b53d0bc
Merge remote-tracking branch 'origin/main' into bnorris/matmul-l1-acc
brnorris03 Apr 10, 2026
1da0c9f
comments
brnorris03 Apr 13, 2026
c1bf820
rename ttl-annotate-reduction-loops to ttl-annotate-l1-acc-loops; oth…
brnorris03 Apr 13, 2026
5b64b30
Merge remote-tracking branch 'origin/main' into bnorris/matmul-l1-acc
brnorris03 Apr 13, 2026
3109b91
implement += on blocks
brnorris03 Apr 13, 2026
760d376
use _set_var helper
brnorris03 Apr 13, 2026
903af07
fix test
brnorris03 Apr 13, 2026
61d0a15
add subblocking test
brnorris03 Apr 13, 2026
7f8f840
precommit
brnorris03 Apr 13, 2026
6b9a141
address comments
brnorris03 Apr 13, 2026
3a3a264
add --ttl-strict-f32-acc compiler option
brnorris03 Apr 14, 2026
7295718
add TODO
brnorris03 Apr 14, 2026
7c5ad75
update doc
brnorris03 Apr 14, 2026
f411240
move the strict-f32-acc option to the subblocking pass; other cleanup
brnorris03 Apr 14, 2026
c40a42e
precommit
brnorris03 Apr 14, 2026
3fc7ca9
rewrite L1 acc annotation with dominance, fix consecutive += loops. add
brnorris03 Apr 14, 2026
d5bcba8
update comments. add another test for consecutive loops targetting di…
brnorris03 Apr 14, 2026
dea4d75
final cleanup
brnorris03 Apr 14, 2026
6003a7d
Downgrade from `mm_block_init` to `mm_block_init_short` for the 2nd+ …
brnorris03 Apr 14, 2026
fd5b0bf
a bit more refactoring cleanup
brnorris03 Apr 14, 2026
3394dfc
reorg includes
brnorris03 Apr 14, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 19 additions & 14 deletions docs/sphinx/reference/compiler-options.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ python my_kernel.py --no-ttl-maximize-dst
| `--ttl-block-matmul` / `--no-ttl-block-matmul` | enabled | Emit `matmul_block` (processes the full tile block atomically) instead of per-tile matmul loops. Disabling this option is not yet supported. |
| `--ttl-auto-sync` / `--no-ttl-auto-sync` | disabled | Let the compiler insert and move DFB synchronization ops. When enabled, reserve/push may be refined to per-subblock granularity. When disabled, user-placed reserve/push is preserved as written. |
| `--ttl-combine-pack-tiles` / `--no-ttl-combine-pack-tiles` | enabled | Combine consecutive `pack_tile` ops on the same CB with contiguous DST and CB indices into a single `pack_tile_block` call. |
| `--ttl-strict-f32-acc` / `--no-ttl-strict-f32-acc` | disabled | Error at compile time if a `+=` accumulation loop's output block exceeds f32 DST capacity (4 tiles with double-buffering). When enabled, guarantees each accumulation step fits in a single DST section without subblocking. |

### Other Ways to Set These

Expand Down Expand Up @@ -110,24 +111,27 @@ ttlang-opt input.mlir -p 'ttl-to-ttkernel-pipeline{maximize-dst=true lower-to-em
| `use-block-matmul` | bool | `true` | Lower matmul to block-level hardware calls (`experimental::matmul_block`). |
| `auto-sync` | bool | `false` | Let the compiler insert and move DFB synchronization ops. |
| `combine-pack-tiles` | bool | `true` | Combine consecutive `pack_tile` ops into `pack_tile_block`. |
| `strict-f32-acc` | bool | `false` | Error if a `+=` accumulation loop's output block exceeds f32 DST capacity. |
| `lower-to-emitc` | bool | `false` | Run the TTKernel-to-EmitC backend (produces C++ source). |

The pipeline runs these passes in order:

1. `convert-ttl-to-compute` — lower TTL elementwise tensor ops to `ttl.compute` with tile ops
2. `ttl-set-compute-kernel-config` — set `fp32_dest_acc_en` / `dst_full_sync_en` defaults
3. `ttl-assign-dst` — DST register allocation (linear scan with copy insertion)
4. `ttl-subblock-compute-for-dst` — tile `ttl.compute` into DST-sized subblocks *(only if `maximize-dst=true`)*; optionally refine reserve/push to per-subblock granularity *(only if `auto-sync=true`)*
5. `ttl-insert-tile-regs-sync` — insert math/pack thread synchronization
6. `ttl-lower-matmul-block` — mark block-matmul computes and expand stores *(only if `use-block-matmul=true`)*
7. `ttl-lower-to-loops` — lower `ttl.compute` to `scf.for` loops
8. `ttl-schedule-operations` — reorder tile ops by dependency depth and kind *(only if `maximize-dst=true`)*
9. `ttl-annotate-cb-associations` — annotate block args with CB indices
10. `convert-ttl-to-ttkernel` — lower TTL DMA ops to TTKernel
11. `ttkernel-insert-inits` — insert hardware init ops before compute ops
12. `ttkernel-combine-pack-tiles` — combine consecutive `pack_tile` into `pack_tile_block` *(only if `combine-pack-tiles=true`)*
13. Canonicalization and CSE cleanup
14. *(if `lower-to-emitc=true`)* `lower-affine`, `convert-ttkernel-to-emitc`, `emitc-form-expressions`
1. `ttl-annotate-l1-acc-loops` — detect `+=` accumulation loops and annotate for L1 packer accumulation
2. `convert-ttl-to-compute` — lower TTL elementwise tensor ops to `ttl.compute` with tile ops
3. `ttl-set-compute-kernel-config` — set `fp32_dest_acc_en` / `dst_full_sync_en` defaults
4. `ttl-assign-dst` — DST register allocation (linear scan with copy insertion)
5. `ttl-subblock-compute-for-dst` — tile `ttl.compute` into DST-sized subblocks *(only if `maximize-dst=true`)*; optionally refine reserve/push to per-subblock granularity *(only if `auto-sync=true`)*
6. `ttl-insert-tile-regs-sync` — insert math/pack thread synchronization
7. `ttl-lower-matmul-block` — mark block-matmul computes and expand stores *(only if `use-block-matmul=true`)*
8. `ttl-lower-to-loops` — lower `ttl.compute` to `scf.for` loops
9. `ttl-schedule-operations` — reorder tile ops by dependency depth and kind *(only if `maximize-dst=true`)*
10. `ttl-annotate-cb-associations` — annotate block args with CB indices
11. `convert-ttl-to-ttkernel` — lower TTL DMA ops to TTKernel
12. `ttkernel-insert-inits` — insert hardware init ops before compute ops
13. `ttkernel-insert-l1-accumulation` — insert `pack_reconfig_l1_acc` guards for `+=` and reduction loops
14. `ttkernel-combine-pack-tiles` — combine consecutive `pack_tile` into `pack_tile_block` *(only if `combine-pack-tiles=true`)*
15. Canonicalization and CSE cleanup
16. *(if `lower-to-emitc=true`)* `lower-affine`, `convert-ttkernel-to-emitc`, `emitc-form-expressions`

### Individual Pass Options

Expand Down Expand Up @@ -169,6 +173,7 @@ Partition `ttl.compute` into DST-sized subblocks.
| Option | Type | Default | Description |
|---|---|---|---|
| `subblock-sync` | bool | `false` | Refine DFB reserve/push to per-subblock granularity, enabling `pack_tile_block` for contiguous subblocks. When disabled, user-placed reserve/push is preserved. |
| `strict-f32-acc` | bool | `false` | Error if a `+=` accumulation loop with non-f32 output requires subblocking. Subblocking reduces accumulation precision because bf16 L1 intermediates truncate f32 DST values. |

```bash
ttlang-opt input.mlir -p 'func.func(ttl-subblock-compute-for-dst{subblock-sync=true})'
Expand Down
6 changes: 6 additions & 0 deletions include/ttlang/Dialect/TTL/IR/TTL.h
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,12 @@ constexpr llvm::StringLiteral kTileLoopStrideAttrName("ttl.tile_loop_stride");
/// ComputeOp is lowered to loops.
constexpr llvm::StringLiteral kReductionLoopAttrName("ttl.reduction_loop");

/// Marks a user-written scf.for as an L1 accumulation loop. Each iteration
/// packs to the same CB slot; pack_reconfig_l1_acc makes subsequent
/// iterations additive. Distinct from kReductionLoopAttrName which marks
/// compiler-generated reduction loops.
constexpr llvm::StringLiteral kL1AccLoopAttrName("ttl.l1_acc_loop");

/// Output CB index on tile ops that need it for init insertion.
constexpr llvm::StringLiteral
kBcastOutputCBIndexAttrName("ttl.bcast_output_cb_index");
Expand Down
14 changes: 11 additions & 3 deletions include/ttlang/Dialect/TTL/IR/TTLOps.td
Original file line number Diff line number Diff line change
Expand Up @@ -1039,20 +1039,28 @@ def TTL_StoreOp : TTL_Op<"store", [MemoryEffects<[MemWrite]>]> {
`ttl.store` represents the user's intent to store a computed tensor into
an output circular buffer.

Emitted by Python's `o.store(result)`. During `convert-ttl-to-compute`,
this op is transformed into a `ttl.tile_store` inside the compute body.
Emitted by Python's `o.store(result)` (overwrite) or `o += result`
(accumulate). During `convert-ttl-to-compute`, this op is transformed
into a `ttl.tile_store` inside the compute body.

When `accumulate` is set, the enclosing loop is annotated for L1
packer accumulation so that each iteration adds to the existing L1
value instead of overwriting.

Example:
```mlir
%reserve = ttl.cb_reserve %cb : ... -> tensor<1x1x!ttcore.tile<32x32, bf16>>
%result = ttl.add %a, %b : ...
ttl.store %result, %reserve : tensor<1x1x!ttcore.tile<32x32, bf16>>,
tensor<1x1x!ttcore.tile<32x32, bf16>>
// Accumulating store (from +=):
ttl.store %result, %reserve {accumulate} : ...
```
}];
let arguments = (ins
AnyRankedTensor:$tensor,
AnyRankedTensor:$view
AnyRankedTensor:$view,
OptionalAttr<UnitAttr>:$accumulate
);
let assemblyFormat = "$tensor `,` $view attr-dict `:` type($tensor) `,` type($view)";
let hasVerifier = 1;
Expand Down
20 changes: 20 additions & 0 deletions include/ttlang/Dialect/TTL/IR/TTLOpsUtils.h
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,11 @@
#include "mlir/Dialect/Affine/IR/AffineOps.h"
#include "mlir/Dialect/Affine/Utils.h"
#include "mlir/Dialect/Arith/Utils/Utils.h"
#include "mlir/Dialect/SCF/IR/SCF.h"
#include "mlir/Dialect/Tensor/IR/Tensor.h"
#include "mlir/Interfaces/ViewLikeInterface.h"
#include "llvm/ADT/SetVector.h"

#include <cstdint>
#include <optional>

Expand Down Expand Up @@ -427,6 +429,24 @@ inline TileOp createTileOpWithPlaceholderDstIndex(OpBuilder &builder,
return tileOp;
}

/// Collect the CB values targeted by pack_tile ops inside a loop.
llvm::SmallDenseSet<Value, 2> getPackTileCBs(scf::ForOp loop);

/// Returns true if two loops share any pack_tile CB target.
bool sharePackCB(scf::ForOp loopA, scf::ForOp loopB);

/// A group of consecutive sibling loops that pack to the same output CB.
struct LoopGroup {
scf::ForOp rootLoop;
SmallVector<scf::ForOp> loops;
Operation *scopeEnd = nullptr;
};

/// Collect groups of annotated sibling loops that share a pack CB target.
SmallVector<LoopGroup> collectLoopGroups(
ArrayRef<scf::ForOp> l1AccLoops,
const llvm::SmallDenseMap<Operation *, Operation *> &enablePointPerLoop);

} // namespace mlir::tt::ttl

#endif // TTLANG_DIALECT_TTL_IR_TTLOPSUTILS_H
44 changes: 37 additions & 7 deletions include/ttlang/Dialect/TTL/Passes.td
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,24 @@ def TTKernelInsertL1Accumulation
: Pass<"ttkernel-insert-l1-accumulation", "::mlir::ModuleOp"> {
let summary = "Insert L1 accumulation guards for reduction loops";
let description = [{
Inserts `pack_reconfig_l1_acc` guards inside reduction loops so that
pack operations accumulate into L1 instead of overwriting. On the first
iteration of a reduction loop, pack writes normally. On subsequent
iterations, the packer is reconfigured to add to the existing L1 value.
Inserts `pack_reconfig_l1_acc` guards so that pack operations inside
annotated loops accumulate into L1 instead of overwriting. Max-reduce
loops are excluded (max is not additive).

Loops are grouped into accumulation scopes: consecutive sibling loops
that pack to the same CB share a single disable pair. Nested annotated
loops are folded into the outermost ancestor. Each group gets:
- `pack_reconfig_l1_acc(0)` before the first loop.
- `if (iv == lb) pack_reconfig_l1_acc(1)` after each loop's first
iteration (the L1 acc state persists across `tile_regs` boundaries).
- For the second+ loop in a group, an enable before the loop to re-enable
L1 acc after init ops that may reset packer state.
- `pack_reconfig_l1_acc(0)` after the last `cb_push_back`.

Loops are identified by the `ttl.l1_acc_loop` (user-written) or
`ttl.reduction_loop` (compiler-generated) attributes on `scf.for`
ops, with `ttl.l1_acc_loop` taking precedence.

Reduction loops are identified by the `ttl.reduction_loop` attribute
on `scf.for` ops.
}];

let dependentDialects = [
Expand All @@ -46,6 +57,21 @@ def TTKernelInsertL1Accumulation
];
}

def TTLAnnotateL1AccLoops
: Pass<"ttl-annotate-l1-acc-loops", "::mlir::func::FuncOp"> {
let summary = "Annotate user-written scf.for loops for L1 accumulation";
let description = [{
Detects user-written `scf.for` loops containing `ttl.store` ops with
the `accumulate` attribute (emitted by the `+=` operator) and annotates
them with `ttl.l1_acc_loop`. This enables `TTKernelInsertL1Accumulation`
to insert `pack_reconfig_l1_acc` guards so that packs accumulate across
iterations instead of overwriting.
}];
let dependentDialects = [
"::mlir::scf::SCFDialect"
];
}

def TTLConvertTTLToCompute
: Pass<"convert-ttl-to-compute", "::mlir::func::FuncOp"> {
let summary = "Lower TTL elementwise tensor ops to ttl.compute with tile ops";
Expand Down Expand Up @@ -184,7 +210,11 @@ def TTLSubblockComputeForDST
Option<"subblockSync", "subblock-sync", "bool", "false",
"Refine DFB reserve/push to per-subblock granularity, enabling "
"pack_tile_block for contiguous subblocks. When disabled (default), "
"user-placed reserve/push is preserved.">
"user-placed reserve/push is preserved.">,
Option<"strictF32Acc", "strict-f32-acc", "bool", "false",
"Error if a user-written accumulation loop (+=) with non-f32 "
"output requires subblocking. Subblocking reduces accumulation "
"precision because bf16 L1 intermediates are narrower than f32 DST.">
];

let dependentDialects = [
Expand Down
4 changes: 4 additions & 0 deletions include/ttlang/Dialect/TTL/Pipelines/TTLPipelines.h
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,10 @@ struct TTLToTTKernelPipelineOptions
*this, "reduce-full-fp32",
llvm::cl::desc("Enable FP32 accumulation for reduce operations."),
llvm::cl::init(true)};
Option<bool> strictF32Acc{
*this, "strict-f32-acc",
llvm::cl::desc("Error if accumulation output exceeds f32 DST capacity."),
llvm::cl::init(false)};
};

void createTTLToTTKernelPipeline(mlir::OpPassManager &pm,
Expand Down
15 changes: 15 additions & 0 deletions lib/Dialect/TTKernel/Transforms/TTKernelCombinePackTiles.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,14 @@
//
//===----------------------------------------------------------------------===//

#include "ttlang/Dialect/TTL/IR/TTL.h"
#include "ttlang/Dialect/TTL/Passes.h"
#include "ttmlir/Dialect/TTKernel/IR/TTKernel.h"
#include "ttmlir/Dialect/TTKernel/IR/TTKernelOps.h"

#include "mlir/Dialect/Arith/IR/Arith.h"
#include "mlir/Dialect/Func/IR/FuncOps.h"
#include "mlir/Dialect/SCF/IR/SCF.h"
#include "mlir/IR/Builders.h"
#include "mlir/Pass/Pass.h"

Expand Down Expand Up @@ -63,6 +65,19 @@ struct TTKernelCombinePackTilesPass

void runOnOperation() override {
getOperation().walk([](Block *block) {
// Skip blocks inside reduction loops: pack_tile_block is
// incompatible with L1 accumulation (pack_reconfig_l1_acc).
// L1 acc requires individual pack_tile calls so each K iteration
// can independently add to the existing L1 value.
for (Operation *parent = block->getParentOp(); parent;
parent = parent->getParentOp()) {
if (auto forOp = dyn_cast<scf::ForOp>(parent)) {
if (forOp->hasAttr(kReductionLoopAttrName) ||
forOp->hasAttr(kL1AccLoopAttrName)) {
return;
}
}
}
// Collect all combinable runs first, then replace them. Replacing
// during iteration would invalidate the block's operation list.
SmallVector<SmallVector<ttk::PackTileOp>> runs;
Expand Down
47 changes: 38 additions & 9 deletions lib/Dialect/TTKernel/Transforms/TTKernelInsertInits.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
//===----------------------------------------------------------------------===//

#include "ttlang/Dialect/TTL/IR/TTL.h"
#include "ttlang/Dialect/TTL/IR/TTLOpsUtils.h"
#include "ttlang/Dialect/TTL/Passes.h"

#include "ttmlir/Dialect/TTKernel/IR/TTKernel.h"
Expand Down Expand Up @@ -348,18 +349,18 @@ analyzeSyncRegion(ttk::TileRegsAcquireOp acquireOp, Value &inputCB,
}

/// Find the outermost enclosing insertion point by walking up through
/// compiler-generated loops (marked with ttl.tile_loop_stride or
/// ttl.subblock_loop_stride). By construction, these loops iterate over tiles
/// within a single ttl.compute whose input/output CBs are fixed, so the
/// CB configuration is invariant across iterations and hoisting is safe.
/// Stops at unmarked loops to avoid hoisting past user loops that could
/// contain multiple sync regions with different CB configurations.
/// loops with invariant CB configurations: compiler-generated tile/subblock
/// loops (ttl.tile_loop_stride, ttl.subblock_loop_stride) and L1
/// accumulation loops (ttl.l1_acc_loop). All use fixed CBs across
/// iterations, so init hoisting is safe. Stops at unmarked loops to avoid
/// hoisting past user loops with varying CB configurations.
static Operation *hoistAboveCompilerLoops(Operation *op) {
Operation *insertBefore = op;
while (auto *parentOp = insertBefore->getParentOp()) {
if (isa<scf::ForOp>(parentOp) &&
(parentOp->hasAttr(kTileLoopStrideAttrName) ||
parentOp->hasAttr(kSubblockLoopStrideAttrName))) {
parentOp->hasAttr(kSubblockLoopStrideAttrName) ||
parentOp->hasAttr(kL1AccLoopAttrName))) {
insertBefore = parentOp;
} else {
break;
Expand Down Expand Up @@ -403,8 +404,36 @@ static LogicalResult insertCommonInits(ModuleOp moduleOp) {
inputCB = outputCB;
}

if (analysis.hasMatmul && in0CB && in1CB) {
// mm_block_init configures UNPACK + MATH + PACK for matmul_block.
// When a matmul init is hoisted before a loop that shares an
// output CB with a preceding sibling annotated loop, use
// init_short. The full init reconfigures the PACK pipeline
// which clobbers packer state (including L1 acc on Wormhole).
// init_short only reconfigures UNPACK+MATH.
bool useInitShort = false;
if (analysis.hasMatmul) {
if (auto forOp = dyn_cast<scf::ForOp>(insertBefore)) {
if (forOp->hasAttr(kL1AccLoopAttrName) ||
forOp->hasAttr(kReductionLoopAttrName)) {
for (Operation *prev = forOp->getPrevNode(); prev;
prev = prev->getPrevNode()) {
if (auto prevFor = dyn_cast<scf::ForOp>(prev)) {
if ((prevFor->hasAttr(kL1AccLoopAttrName) ||
prevFor->hasAttr(kReductionLoopAttrName)) &&
sharePackCB(prevFor, forOp)) {
useInitShort = true;
}
break;
}
}
}
}
}

if (analysis.hasMatmul && in0CB && in1CB && useInitShort) {
ttk::MatmulBlockInitShortOp::create(
builder, loc, in0CB, in1CB, analysis.matmulTranspose,
analysis.matmulCt, analysis.matmulRt, analysis.matmulKt);
} else if (analysis.hasMatmul && in0CB && in1CB) {
ttk::MatmulBlockInitOp::create(
builder, loc, in0CB, in1CB, outputCB, analysis.matmulTranspose,
analysis.matmulCt, analysis.matmulRt, analysis.matmulKt);
Expand Down
Loading
Loading