Element Read/Write#575
Conversation
Enable DM threads to read/write individual bf16/f32 elements from circular buffer blocks, eliminating host readback for argmax. Adds TTL dialect ops, Python frontend syntax handlers, and an EmitC lowering pass that emits inline C++ helpers handling face-based 32x32 tile layout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…thmetic element_write now auto-casts i64 literals and computed values to i32 via arith.TruncIOp, and handles Index→i32 via arith.IndexCastOp. Also adds test coverage for loop variable column indexing, if-conditionals on element_read results, and scalar arithmetic in DM threads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scalar variables assigned inside for-loop bodies now survive the loop
via memref-backed storage. When a variable is assigned inside a loop
(detected by symbol table depth > 1), the compiler allocates a
memref<1xi32> at function entry, stores on each assignment, and loads
on read. This enables patterns like:
for c in range(32):
val = ttl.element_read(blk, 0, c)
ttl.element_write(blk, 0, 0, val) # val survives loop
Also fixes _load_func_arg error message crash when node.func is an
ast.Attribute (e.g., ttl.element_write) instead of ast.Name.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…p vars The cross-scope memref treatment was too aggressive — it wrapped ALL assignments inside for-loops (including tx = ttl.copy(), m = tid // Nt, etc.) in memrefs, breaking tx.wait() and tensor subscript indexing. Now only i32 values (from element_read) get the memref treatment. Index, i64, transfer handles, tensors, and CBs use normal SSA scoping. Also adds regression test for the standard DM write loop pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…r types Three fixes to enable fully on-device argmax with element_read/write: 1. ttl_ast.py: Broaden _is_integer_scalar to accept i64/index types for outer-scope variable updates (enables best_idx = tid * 32 + c inside if/for blocks). Keep narrow _is_i32_scalar for new loop variables to avoid breaking existing kernels' index subscripts. 2. LowerElementAccessToEmitC.cpp: Change helper functions from static inline (invalid nested inside kernel_main) to C++ lambdas (valid in C++17 function bodies). Fix missing semicolons after lambda closing braces. 3. Add parallel_index_find_kernel (kernel 3) to argmax.py — compiles and runs at 0.67ms but has correctness issues in index computation that need further debugging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix API calls (kernel->operation, buffer_factor->block_count) - Fix memref auto-load in visit_Name for loop-carried variables - Fix alloca placement, nested scope lookup, i1 exclusion guard - Change _cast_to_i32 from ExtSIOp to ExtUIOp for unsigned bit patterns - Register EmitC dialect in Passes.td for isolated pass execution - Add resolveBlockDirection error diagnostic instead of silent fallback - Add op verifiers: static bounds check (row/col in [0,31]) and single-tile block requirement (tensor<1x1x!ttcore.tile<...>>) - Add MLIR lit tests (positive + negative) and Python compile-only tests - File #572 for bf16 bit-pattern magnitude comparison limitation
write rejection, bounds checking, single-tile constraint) - LowerElementAccessToEmitC pass emitting face-based tile layout helpers with runtime bounds assertions - Python DSL element_read()/element_write() with type conversions and cross-scope scalar variable support via memref<1xi32> promotion - Restore reduce-full-fp32 flag inadvertently dropped from pipeline - MLIR lit tests and Python compile-only regression tests
| // Use lambdas (valid inside function bodies in C++17) instead of nested | ||
| // function definitions, which are not allowed in C++. | ||
| if (needsBF16) { | ||
| emitVerbatim(loc, |
There was a problem hiding this comment.
I'm all for verbatum emitc but I wonder if there's an existing ttkernel op we can use?
There was a problem hiding this comment.
Or maybe this can all be expressed in GEP/memref ops?
There was a problem hiding this comment.
I agree, looking into it rn
There was a problem hiding this comment.
I think the ttkernel route is the best option, tho would require one new op for the dereference for load/store, which we could to as a verbatim for now
|
|
||
| auto i32Type = rewriter.getI32Type(); | ||
|
|
||
| auto *ctaOp = createLiteral( |
There was a problem hiding this comment.
Do these need to factor in base cta index?
There was a problem hiding this comment.
probably, but will be switching to ttkernel op for getting that
| " reinterpret_cast<volatile tt_l1_ptr uint16_t*>(l1_addr);" | ||
| " uint32_t face = (row >= 16 ? 2 : 0) + (col >= 16 ? 1 : 0);" | ||
| " uint32_t offset = face * 256 + (row % 16) * 16 + (col % 16);" | ||
| " base[offset] = (uint16_t)val;" |
There was a problem hiding this comment.
should we error rather than truncating?
There was a problem hiding this comment.
truncating for the type?
| } | ||
| }; | ||
|
|
||
| struct ElementWriteLowering : OpConversionPattern<ElementWriteOp> { |
There was a problem hiding this comment.
optional: most of the code here is duplicated with ElementReadLowering could refactor into a helper
| "reduce_sum", | ||
| "reduce_max", |
|
Can you explain more this design and the reasoning behind it? It would have been good to discuss this before implementing, but at least it should be documented in the docs/development/DFBManagement.md doc, focusing on the overall approach and how it fits in the rest of the compiler. At a high level, I don't understand the rationale for using mutable For example, in You may be able to alleviate some of that by adding the -mem2reg to the pipeline(s) early on, but that's more of a workaround than a better design. There are also a lot of changes to the frontend that seem unrelated to element read/write semantics (e.g., scopes helpers and control flow related changes). I think this PR should only focus on the element access ops (definition and lowering) and not modify the front end for unrelated purposes unless they clearly fit in the design. BTW, #540 implements frontend support for the control flow in things like argmax. |
Problem description
tt-lang DM (datamovement) threads had no way to read or write individual scalar elements from circular buffer (CB) blocks without host readback. This made on-device reductions like argmax impossible to express — the compute result had to be
copied back to the host and re-issued as a dispatch, adding significant latency and complexity.
What's changed
compute the hardware face-based tile offset, and emit the memory access.
that need to outlive the scope are backed by memref<1xi32> allocated at function entry, preserving values across iterations without violating SSA dominance.
produces incorrect ordering for negative values or mixed-sign inputs. This is tracked in [ttl] element_read scan kernel compares bf16 values as raw i32 bit patterns #572; helper functions for bf16-aware comparison are follow-on work.
Ticket
#572
Checklist