Element Read/Write by arichinsTT · Pull Request #575 · tenstorrent/tt-lang

arichinsTT · 2026-05-12T20:18:15Z

Problem description

tt-lang DM (datamovement) threads had no way to read or write individual scalar elements from circular buffer (CB) blocks without host readback. This made on-device reductions like argmax impossible to express — the compute result had to be
copied back to the host and re-issued as a dispatch, adding significant latency and complexity.

What's changed

Adds element_read and element_write operations that allow DM threads to access individual bf16/f32 elements within a CB block at a given tile coordinate [row, col], operating entirely on-device.
New IR ops (TTL_ElementReadOp, TTL_ElementWriteOp): Read/write a single element from a CB-attached tensor at a tile coordinate. Elements are returned/accepted as i32 (raw bits).
EmitC lowering pass (ttl-lower-element-access-to-emitc): Lowers the TTL ops to inline C++ lambdas that resolve the CB's L1 address (distinguishing get_read_ptr vs get_write_ptr based on whether the block came from cb.wait() or cb.reserve()),
compute the hardware face-based tile offset, and emit the memory access.
Python frontend: Adds element_read(block, row, col) and element_write(block, row, col, value) syntax in DM kernel bodies. Scalar integer values are auto-cast to i32 at operation boundaries. Scalar variables assigned inside loop/if bodies
that need to outlive the scope are backed by memref<1xi32> allocated at function entry, preserving values across iterations without violating SSA dominance.
Known limitation: element_read returns raw i32 bit patterns. Equality comparison (==) on these values is correct for bf16, but magnitude comparisons (>, <) are not — bf16 uses sign-magnitude representation, so raw unsigned integer comparison
produces incorrect ordering for negative values or mixed-sign inputs. This is tracked in [ttl] element_read scan kernel compares bf16 values as raw i32 bit patterns #572; helper functions for bf16-aware comparison are follow-on work.

Ticket

#572

Checklist

New/Existing tests provide coverage for changes

Enable DM threads to read/write individual bf16/f32 elements from circular buffer blocks, eliminating host readback for argmax. Adds TTL dialect ops, Python frontend syntax handlers, and an EmitC lowering pass that emits inline C++ helpers handling face-based 32x32 tile layout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…thmetic element_write now auto-casts i64 literals and computed values to i32 via arith.TruncIOp, and handles Index→i32 via arith.IndexCastOp. Also adds test coverage for loop variable column indexing, if-conditionals on element_read results, and scalar arithmetic in DM threads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Scalar variables assigned inside for-loop bodies now survive the loop via memref-backed storage. When a variable is assigned inside a loop (detected by symbol table depth > 1), the compiler allocates a memref<1xi32> at function entry, stores on each assignment, and loads on read. This enables patterns like: for c in range(32): val = ttl.element_read(blk, 0, c) ttl.element_write(blk, 0, 0, val) # val survives loop Also fixes _load_func_arg error message crash when node.func is an ast.Attribute (e.g., ttl.element_write) instead of ast.Name. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…p vars The cross-scope memref treatment was too aggressive — it wrapped ALL assignments inside for-loops (including tx = ttl.copy(), m = tid // Nt, etc.) in memrefs, breaking tx.wait() and tensor subscript indexing. Now only i32 values (from element_read) get the memref treatment. Index, i64, transfer handles, tensors, and CBs use normal SSA scoping. Also adds regression test for the standard DM write loop pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…r types Three fixes to enable fully on-device argmax with element_read/write: 1. ttl_ast.py: Broaden _is_integer_scalar to accept i64/index types for outer-scope variable updates (enables best_idx = tid * 32 + c inside if/for blocks). Keep narrow _is_i32_scalar for new loop variables to avoid breaking existing kernels' index subscripts. 2. LowerElementAccessToEmitC.cpp: Change helper functions from static inline (invalid nested inside kernel_main) to C++ lambdas (valid in C++17 function bodies). Fix missing semicolons after lambda closing braces. 3. Add parallel_index_find_kernel (kernel 3) to argmax.py — compiles and runs at 0.67ms but has correctness issues in index computation that need further debugging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix API calls (kernel->operation, buffer_factor->block_count) - Fix memref auto-load in visit_Name for loop-carried variables - Fix alloca placement, nested scope lookup, i1 exclusion guard - Change _cast_to_i32 from ExtSIOp to ExtUIOp for unsigned bit patterns - Register EmitC dialect in Passes.td for isolated pass execution - Add resolveBlockDirection error diagnostic instead of silent fallback - Add op verifiers: static bounds check (row/col in [0,31]) and single-tile block requirement (tensor<1x1x!ttcore.tile<...>>) - Add MLIR lit tests (positive + negative) and Python compile-only tests - File #572 for bf16 bit-pattern magnitude comparison limitation

write rejection, bounds checking, single-tile constraint) - LowerElementAccessToEmitC pass emitting face-based tile layout helpers with runtime bounds assertions - Python DSL element_read()/element_write() with type conversions and cross-scope scalar variable support via memref<1xi32> promotion - Restore reduce-full-fp32 flag inadvertently dropped from pipeline - MLIR lit tests and Python compile-only regression tests

zoecarver · 2026-05-13T18:14:47Z

+  // Use lambdas (valid inside function bodies in C++17) instead of nested
+  // function definitions, which are not allowed in C++.
+  if (needsBF16) {
+    emitVerbatim(loc,


I'm all for verbatum emitc but I wonder if there's an existing ttkernel op we can use?

Or maybe this can all be expressed in GEP/memref ops?

I agree, looking into it rn

I think the ttkernel route is the best option, tho would require one new op for the dereference for load/store, which we could to as a verbatim for now

zoecarver · 2026-05-13T18:19:03Z

+
+    auto i32Type = rewriter.getI32Type();
+
+    auto *ctaOp = createLiteral(


Do these need to factor in base cta index?

probably, but will be switching to ttkernel op for getting that

zoecarver · 2026-05-13T18:21:04Z

+        " reinterpret_cast<volatile tt_l1_ptr uint16_t*>(l1_addr);"
+        " uint32_t face = (row >= 16 ? 2 : 0) + (col >= 16 ? 1 : 0);"
+        " uint32_t offset = face * 256 + (row % 16) * 16 + (col % 16);"
+        " base[offset] = (uint16_t)val;"


should we error rather than truncating?

truncating for the type?

zoecarver · 2026-05-13T18:21:36Z

+  }
+};
+
+struct ElementWriteLowering : OpConversionPattern<ElementWriteOp> {


optional: most of the code here is duplicated with ElementReadLowering could refactor into a helper

zoecarver · 2026-05-13T18:22:07Z

+    "reduce_sum",
+    "reduce_max",


brnorris03 · 2026-05-15T05:24:25Z

Can you explain more this design and the reasoning behind it? It would have been good to discuss this before implementing, but at least it should be documented in the docs/development/DFBManagement.md doc, focusing on the overall approach and how it fits in the rest of the compiler.

At a high level, I don't understand the rationale for using mutable memref<1xi32> cells for cross-scope scalars instead of SSA form. A number of analyses will not see the value dataflow, so any TTL pass that reasons about consumers of CB-attached blocks has to be patched individually to recognize the new ops, wherever an analysis walks SSA to find consumers or upstream dataflow analyses are used (having to change many passes as a result of adding an op is usually a sign of needing to rethink the design).

For example, in ttl-coalesce-dfb-acquires, mayReleaseDFB would not consider element_read to be a dfb consumer and the pass would incorrectly coalesce cb_waits that have element reads inbetween (after which the element read would be reading the wrong element). I can provide an example of how this results in incorrect IR.

You may be able to alleviate some of that by adding the -mem2reg to the pipeline(s) early on, but that's more of a workaround than a better design.

There are also a lot of changes to the frontend that seem unrelated to element read/write semantics (e.g., scopes helpers and control flow related changes). I think this PR should only focus on the element access ops (definition and lowering) and not modify the front end for unrelated purposes unless they clearly fit in the design. BTW, #540 implements frontend support for the control flow in things like argmax.

… more examples

ttssokorac and others added 11 commits May 12, 2026 10:14

removing emitOpError in pattern match and test

b289013

lint

08410ec

removing redundant test

175db27

requiring device for test

aad9e65

arichinsTT marked this pull request as ready for review May 13, 2026 02:50

arichinsTT requested a review from a team as a code owner May 13, 2026 02:50

zoecarver reviewed May 13, 2026

View reviewed changes

Merge branch 'main' into arichins/elementRW

37b6cb4

arichinsTT added 2 commits May 18, 2026 13:09

dropping bf16 temporarily, unsafe module, ttkernel/arith lowering and…

1ef8807

… more examples

Merge branch 'main' into arichins/elementRW

79c39ce

arichinsTT closed this May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Element Read/Write#575

Element Read/Write#575
arichinsTT wants to merge 14 commits into
mainfrom
arichins/elementRW

arichinsTT commented May 12, 2026 •

edited

Loading

Uh oh!

zoecarver May 13, 2026

Uh oh!

zoecarver May 13, 2026

Uh oh!

arichinsTT May 14, 2026

Uh oh!

arichinsTT May 14, 2026

Uh oh!

zoecarver May 13, 2026

Uh oh!

arichinsTT May 14, 2026

Uh oh!

zoecarver May 13, 2026

Uh oh!

arichinsTT May 14, 2026

Uh oh!

zoecarver May 13, 2026

Uh oh!

zoecarver May 13, 2026

Uh oh!

brnorris03 commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		auto i32Type = rewriter.getI32Type();

		auto *ctaOp = createLiteral(

Conversation

arichinsTT commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem description

What's changed

Ticket

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brnorris03 commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

arichinsTT commented May 12, 2026 •

edited

Loading