Skip to content

Element Read/Write#575

Closed
arichinsTT wants to merge 14 commits into
mainfrom
arichins/elementRW
Closed

Element Read/Write#575
arichinsTT wants to merge 14 commits into
mainfrom
arichins/elementRW

Conversation

@arichinsTT
Copy link
Copy Markdown
Contributor

@arichinsTT arichinsTT commented May 12, 2026

Problem description

tt-lang DM (datamovement) threads had no way to read or write individual scalar elements from circular buffer (CB) blocks without host readback. This made on-device reductions like argmax impossible to express — the compute result had to be
copied back to the host and re-issued as a dispatch, adding significant latency and complexity.

What's changed

  • Adds element_read and element_write operations that allow DM threads to access individual bf16/f32 elements within a CB block at a given tile coordinate [row, col], operating entirely on-device.
  • New IR ops (TTL_ElementReadOp, TTL_ElementWriteOp): Read/write a single element from a CB-attached tensor at a tile coordinate. Elements are returned/accepted as i32 (raw bits).
  • EmitC lowering pass (ttl-lower-element-access-to-emitc): Lowers the TTL ops to inline C++ lambdas that resolve the CB's L1 address (distinguishing get_read_ptr vs get_write_ptr based on whether the block came from cb.wait() or cb.reserve()),
    compute the hardware face-based tile offset, and emit the memory access.
  • Python frontend: Adds element_read(block, row, col) and element_write(block, row, col, value) syntax in DM kernel bodies. Scalar integer values are auto-cast to i32 at operation boundaries. Scalar variables assigned inside loop/if bodies
    that need to outlive the scope are backed by memref<1xi32> allocated at function entry, preserving values across iterations without violating SSA dominance.
  • Known limitation: element_read returns raw i32 bit patterns. Equality comparison (==) on these values is correct for bf16, but magnitude comparisons (>, <) are not — bf16 uses sign-magnitude representation, so raw unsigned integer comparison
    produces incorrect ordering for negative values or mixed-sign inputs. This is tracked in [ttl] element_read scan kernel compares bf16 values as raw i32 bit patterns #572; helper functions for bf16-aware comparison are follow-on work.

Ticket

#572

Checklist

  • New/Existing tests provide coverage for changes

ttssokorac and others added 11 commits May 12, 2026 10:14
Enable DM threads to read/write individual bf16/f32 elements from
circular buffer blocks, eliminating host readback for argmax. Adds TTL
dialect ops, Python frontend syntax handlers, and an EmitC lowering
pass that emits inline C++ helpers handling face-based 32x32 tile layout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…thmetic

element_write now auto-casts i64 literals and computed values to i32 via
arith.TruncIOp, and handles Index→i32 via arith.IndexCastOp. Also adds
test coverage for loop variable column indexing, if-conditionals on
element_read results, and scalar arithmetic in DM threads.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scalar variables assigned inside for-loop bodies now survive the loop
via memref-backed storage. When a variable is assigned inside a loop
(detected by symbol table depth > 1), the compiler allocates a
memref<1xi32> at function entry, stores on each assignment, and loads
on read. This enables patterns like:

    for c in range(32):
        val = ttl.element_read(blk, 0, c)
    ttl.element_write(blk, 0, 0, val)  # val survives loop

Also fixes _load_func_arg error message crash when node.func is an
ast.Attribute (e.g., ttl.element_write) instead of ast.Name.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…p vars

The cross-scope memref treatment was too aggressive — it wrapped ALL
assignments inside for-loops (including tx = ttl.copy(), m = tid // Nt,
etc.) in memrefs, breaking tx.wait() and tensor subscript indexing.

Now only i32 values (from element_read) get the memref treatment.
Index, i64, transfer handles, tensors, and CBs use normal SSA scoping.
Also adds regression test for the standard DM write loop pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…r types

Three fixes to enable fully on-device argmax with element_read/write:

1. ttl_ast.py: Broaden _is_integer_scalar to accept i64/index types for
   outer-scope variable updates (enables best_idx = tid * 32 + c inside
   if/for blocks). Keep narrow _is_i32_scalar for new loop variables to
   avoid breaking existing kernels' index subscripts.

2. LowerElementAccessToEmitC.cpp: Change helper functions from static
   inline (invalid nested inside kernel_main) to C++ lambdas (valid in
   C++17 function bodies). Fix missing semicolons after lambda closing
   braces.

3. Add parallel_index_find_kernel (kernel 3) to argmax.py — compiles
   and runs at 0.67ms but has correctness issues in index computation
   that need further debugging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix API calls (kernel->operation, buffer_factor->block_count)
- Fix memref auto-load in visit_Name for loop-carried variables
- Fix alloca placement, nested scope lookup, i1 exclusion guard
- Change _cast_to_i32 from ExtSIOp to ExtUIOp for unsigned bit patterns
- Register EmitC dialect in Passes.td for isolated pass execution
- Add resolveBlockDirection error diagnostic instead of silent fallback
- Add op verifiers: static bounds check (row/col in [0,31]) and
  single-tile block requirement (tensor<1x1x!ttcore.tile<...>>)
- Add MLIR lit tests (positive + negative) and Python compile-only tests
- File #572 for bf16 bit-pattern magnitude comparison limitation
  write rejection, bounds checking, single-tile constraint)
- LowerElementAccessToEmitC pass emitting face-based tile layout helpers
  with runtime bounds assertions
- Python DSL element_read()/element_write() with type conversions and
  cross-scope scalar variable support via memref<1xi32> promotion
- Restore reduce-full-fp32 flag inadvertently dropped from pipeline
- MLIR lit tests and Python compile-only regression tests
@arichinsTT arichinsTT marked this pull request as ready for review May 13, 2026 02:50
@arichinsTT arichinsTT requested a review from a team as a code owner May 13, 2026 02:50
// Use lambdas (valid inside function bodies in C++17) instead of nested
// function definitions, which are not allowed in C++.
if (needsBF16) {
emitVerbatim(loc,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for verbatum emitc but I wonder if there's an existing ttkernel op we can use?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe this can all be expressed in GEP/memref ops?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, looking into it rn

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ttkernel route is the best option, tho would require one new op for the dereference for load/store, which we could to as a verbatim for now


auto i32Type = rewriter.getI32Type();

auto *ctaOp = createLiteral(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these need to factor in base cta index?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably, but will be switching to ttkernel op for getting that

" reinterpret_cast<volatile tt_l1_ptr uint16_t*>(l1_addr);"
" uint32_t face = (row >= 16 ? 2 : 0) + (col >= 16 ? 1 : 0);"
" uint32_t offset = face * 256 + (row % 16) * 16 + (col % 16);"
" base[offset] = (uint16_t)val;"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we error rather than truncating?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truncating for the type?

}
};

struct ElementWriteLowering : OpConversionPattern<ElementWriteOp> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: most of the code here is duplicated with ElementReadLowering could refactor into a helper

Comment thread python/ttl/operators.py
Comment on lines +741 to +742
"reduce_sum",
"reduce_max",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated?

@brnorris03
Copy link
Copy Markdown
Contributor

Can you explain more this design and the reasoning behind it? It would have been good to discuss this before implementing, but at least it should be documented in the docs/development/DFBManagement.md doc, focusing on the overall approach and how it fits in the rest of the compiler.

At a high level, I don't understand the rationale for using mutable memref<1xi32> cells for cross-scope scalars instead of SSA form. A number of analyses will not see the value dataflow, so any TTL pass that reasons about consumers of CB-attached blocks has to be patched individually to recognize the new ops, wherever an analysis walks SSA to find consumers or upstream dataflow analyses are used (having to change many passes as a result of adding an op is usually a sign of needing to rethink the design).

For example, in ttl-coalesce-dfb-acquires, mayReleaseDFB would not consider element_read to be a dfb consumer and the pass would incorrectly coalesce cb_waits that have element reads inbetween (after which the element read would be reading the wrong element). I can provide an example of how this results in incorrect IR.

You may be able to alleviate some of that by adding the -mem2reg to the pipeline(s) early on, but that's more of a workaround than a better design.

There are also a lot of changes to the frontend that seem unrelated to element read/write semantics (e.g., scopes helpers and control flow related changes). I think this PR should only focus on the element access ops (definition and lowering) and not modify the front end for unrelated purposes unless they clearly fit in the design. BTW, #540 implements frontend support for the control flow in things like argmax.

@arichinsTT arichinsTT closed this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants