Skip to content

[ttl] Make TRID DMA wait lowering selectable (default: global barriers)#267

Open
shutovilyaep wants to merge 8 commits into
tenstorrent:mainfrom
shutovilyaep:feat/lower_copy_wait
Open

[ttl] Make TRID DMA wait lowering selectable (default: global barriers)#267
shutovilyaep wants to merge 8 commits into
tenstorrent:mainfrom
shutovilyaep:feat/lower_copy_wait

Conversation

@shutovilyaep
Copy link
Copy Markdown

@shutovilyaep shutovilyaep commented Jan 23, 2026

PR #267: [ttl] Make TRID DMA wait lowering selectable (default: global barriers)

Why

Current lowering uses global DMA barriers. TRID-scoped waits needed for #87 are not supported. Default behavior must stay unchanged for existing users.

What

  • New pass option use-trid-barriers (default off): choose global barriers (current) or TRID barriers.
  • When on: copy lowering emits set_trid; wait lowering emits barrier_with_trid. TRID slots (16) are tracked; reuse of an in-flight slot triggers an evict barrier first.
  • Pipeline and TTKernel cleanup forward the option; TRID dedup patterns run only in TRID mode.
  • Tests: conversion, translation, ME2E, and Python lit updated so both modes are covered (default path kept; TRID path added or enabled where needed).

How

  • Pass and pipeline accept the option and pass it through. In TRID mode, transfer handles become i32; SCF type conversions keep regions legal.
  • TridAllocator tracks 16 slots and direction; before reusing a busy slot the lowering emits the matching barrier. In global mode the allocator is unused (handle is 0).
  • Cleanup: TRID dedup is registered only when the option is true; dedup compares TRID and NOC.
  • Tests use the option explicitly where TRID output is checked; translation tests keep default RUN and add a second RUN with TRID checks where output differs.

@shutovilyaep shutovilyaep marked this pull request as ready for review January 26, 2026 13:47
@shutovilyaep shutovilyaep requested a review from a team as a code owner January 26, 2026 13:47
@shutovilyaep
Copy link
Copy Markdown
Author

Comment received from @brnorris03:

It would be great if you can implement this as a pass option so we can choose between different lowerings (there will probably be more optimizations later), keeping the default the same as what's in main now.

@shutovilyaep shutovilyaep force-pushed the feat/lower_copy_wait branch 2 times, most recently from d754a28 to 42dbe50 Compare January 30, 2026 11:59
@shutovilyaep shutovilyaep changed the title TTL: Lower async DMA waits to TRID barriers [ttl] Make TRID DMA wait lowering selectable (default: global barriers) Jan 30, 2026
shutovilyaep added a commit to shutovilyaep/tt-lang that referenced this pull request Jan 30, 2026
…getTileGridShapeFromValue

The test used tensor<32x32xf32> (element type f32). Copy lowering calls
getTileGridShapeFromValue() which asserts the tensor has TileType element
type. Use tensor<1x1x!ttcore.tile<32x32,f32>> like other DMA tests to fix
CI crash (SIGABRT) in TTLToTTKernel conversion.

Attempt to fix CI failure in PR tenstorrent#267 / #1222.
@shutovilyaep
Copy link
Copy Markdown
Author

/codeowners ping

Copy link
Copy Markdown
Contributor

@brnorris03 brnorris03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you! The only more significant issue I see is the lack of runtime tests, I think the best approach for now is to parameterize (some of) the test/me2e tests with the new option, what do you think? I can help with more concrete suggestions on how to do that if you agree.dd

Some general questions, mainly stemming from my lack of deep knowledge of the low-level semantics of the metal ops.

  1. Is the TRID value semantically meaningful, or just needs to be unique per copy? I am guessing order doesn't matter? As defined the generated TRIDs could be nondeterministic (but correctly unique) due to parallel pattern application.

  2. With the new ops requiring explicit NOC, I see that NOC 0 is always used -- is this appropriate or something that needs to be generalized (perhaps later PR)?

Again, thank you for contributing this!!

Comment on lines +73 to +82
patterns.add<DeduplicateConsecutiveBarriers<NocAsyncReadBarrierOp>>(
patterns.getContext());
patterns.add<DeduplicateConsecutiveBarriers<NocAsyncWriteBarrierOp>>(
patterns.getContext());
patterns
.add<DeduplicateConsecutiveTridBarriers<NocAsyncReadBarrierWithTridOp>>(
patterns.getContext());
patterns
.add<DeduplicateConsecutiveTridBarriers<NocAsyncWriteBarrierWithTridOp>>(
patterns.getContext());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably doesn't matter that much, but could make the relevant patterns conditional on the option that enables TRID?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed:

  • populateTTKernelCleanupPatterns now takes useTridBarriers (default false in header); TRID dedup patterns are only added when true.
  • Convert pass calls it with useTridBarriers so the option is forwarded to cleanup.

Comment on lines 21 to +25

let options = [
Option<"useTridBarriers", "use-trid-barriers", "bool", "false",
"Use TRID-aware DMA waits (barrier_with_trid) instead of global barriers.">,
];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding the option! Not asking you to do this in the PR but it would be interesting to profile the different approaches with a small set of representative benchmarks and set the default based on that (perhaps add a short TODO to that effect here if you agree?).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Can't perform running, no device available
  • Added TODO in pass description: “Profile both modes on representative benchmarks and consider changing the default.”

Comment on lines +575 to +581
class TridAllocator {
public:
uint32_t allocateTrid() { return nextTrid++ & 0xF; }

private:
uint32_t nextTrid = 0;
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is wrapping at 16 TRIDs, but what happens if the 0th, etc are still not completed at that point? Is there any way to check/detect TRID overflow? Maybe add a TODO for future improvement to make this more robust.

Copy link
Copy Markdown
Author

@shutovilyaep shutovilyaep Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed:

  • TridAllocator now tracks outstanding TRIDs and their direction. When a TRID would be reused while still in-flight, CopyLowering emits a barrier_with_trid for the old transfer before reassigning.
  • WaitLowering releases TRIDs via releaseTrid() so they can be reused without an auto-barrier.
  • Lit test (17 copies, no intervening waits) verifies auto-barrier on overflow.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@shutovilyaep
Copy link
Copy Markdown
Author

@brnorris03 Hello, the fixes were ready about 3 weeks ago, haven't pushed due to being a little bit off due to sudden layoff from Tenstorrent. Will try to complete this today, if still relevant

@shutovilyaep shutovilyaep force-pushed the feat/lower_copy_wait branch 2 times, most recently from 6222a97 to e4f9ca4 Compare February 26, 2026 12:19
@shutovilyaep
Copy link
Copy Markdown
Author

Tried to implement awaiting for the DMA to complete instead to having a TODO when TRID ids are rotated.

Checked by locally running MLIR lit tests.

gloriouskilka pushed a commit to RedOrangeSweater/ML.TT.Lang that referenced this pull request Feb 27, 2026
This file contains the PR description to copy-paste to GitHub.
Will be excluded from final PR.

Made-with: Cursor
@shutovilyaep
Copy link
Copy Markdown
Author

@brnorris03 Hello, please take a look, squashed commits to atomic, should be mergeable

Local verification complete (macOS, no TT hardware)

cmake --build build --target check-ttlang
Test suite Result
MLIR lit tests 63/63 passed
Python binding tests 4/4 passed

Additional targeted runs:

  • llvm-lit test/ttlang/Conversion/TTLToTTKernel/ - 12/12 passed
  • llvm-lit test/ttlang/Translate/TTLToCpp/ - 11/11 passed

No TT device available for ME2E or hardware execution tests. Ready for CI.

@shutovilyaep shutovilyaep force-pushed the feat/lower_copy_wait branch 4 times, most recently from 3987a00 to 8bb8613 Compare February 27, 2026 17:19
@shutovilyaep shutovilyaep force-pushed the feat/lower_copy_wait branch 3 times, most recently from 37c11ba to 2dba10e Compare February 27, 2026 17:44
Copy link
Copy Markdown
Contributor

@brnorris03 brnorris03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for taking so long with the re-review. I think overall looks great, my main concerns at the moment are about the test coverage, should be easily addressable. So sorry to learn about the layoff (if you don't mind, can you email me so I have your contact info).

Comment thread lib/Dialect/TTL/Transforms/ConvertTTLToTTKernel.cpp Outdated
Comment on lines +575 to +581
class TridAllocator {
public:
uint32_t allocateTrid() { return nextTrid++ & 0xF; }

private:
uint32_t nextTrid = 0;
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment thread test/ttlang/Translate/TTLToCpp/cb_to_tensor_single_tile_write.mlir Outdated
Comment thread lib/Dialect/TTL/Transforms/ConvertTTLToTTKernel.cpp Outdated
Comment thread lib/Dialect/TTL/Transforms/ConvertTTLToTTKernel.cpp Outdated
Comment thread lib/Dialect/TTL/Transforms/ConvertTTLToTTKernel.cpp Outdated
Comment thread lib/Dialect/TTL/Transforms/ConvertTTLToTTKernel.cpp
Comment thread lib/Dialect/TTL/Transforms/ConvertTTLToTTKernel.cpp Outdated
Comment thread test/me2e/config_specs.py Outdated
gloriouskilka pushed a commit to RedOrangeSweater/ML.TT.Lang that referenced this pull request Mar 14, 2026
This file contains the PR description to copy-paste to GitHub.
Will be excluded from final PR.
@shutovilyaep shutovilyaep force-pushed the feat/lower_copy_wait branch 2 times, most recently from 01c7404 to 0a1c087 Compare March 17, 2026 15:40
@shutovilyaep shutovilyaep requested a review from brnorris03 March 17, 2026 15:46
@shutovilyaep shutovilyaep force-pushed the feat/lower_copy_wait branch from 0a1c087 to d91c351 Compare March 18, 2026 12:02
gloriouskilka pushed a commit to RedOrangeSweater/ML.TT.Lang that referenced this pull request Mar 26, 2026
This file contains the PR description to copy-paste to GitHub.
Will be excluded from final PR.
gloriouskilka pushed a commit to RedOrangeSweater/ML.TT.Lang that referenced this pull request Mar 26, 2026
This file contains the PR description to copy-paste to GitHub.
Will be excluded from final PR.
@shutovilyaep shutovilyaep force-pushed the feat/lower_copy_wait branch from d91c351 to e44ee28 Compare April 14, 2026 11:01
@shutovilyaep shutovilyaep force-pushed the feat/lower_copy_wait branch from e44ee28 to 490fc2f Compare May 5, 2026 05:01
Copy link
Copy Markdown
Contributor

@brnorris03 brnorris03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to land after some deprecated builder usage is updated (e.g., rewriter.create<arith::ConstantIntOp>(loc, 0, 8); should be arith::ConstantIntOp::create(rewriter, loc, 0, 8);, same for all rewriter.create calls).
Thank you!

gloriouskilka pushed a commit to RedOrangeSweater/ML.TT.Lang that referenced this pull request May 11, 2026
Replace deprecated PatternRewriter::create factory usage with static Op::create(rewriter, loc, ...) for arith constants and TTKernel NOC barrier/set_trid ops (review feedback on PR tenstorrent#267).
@shutovilyaep shutovilyaep force-pushed the feat/lower_copy_wait branch from 3da5e18 to 51b2e96 Compare May 13, 2026 03:59
Add use-trid-barriers option to convert-ttl-to-ttkernel pass and
ttl-to-ttkernel-pipeline. When enabled, ttl.copy emits
noc_async_{read,write}_set_trid before DMA operations, and ttl.wait
emits noc_async_{read,write}_barrier_with_trid instead of global
barriers.

Default behavior (use-trid-barriers=false) preserves existing global
barrier semantics from main branch.

Key changes:
- TridAllocator class manages 16 TRID slots with overflow handling
- lowerTensorCBCopy unified function supports both modes
- CopyLowering/WaitLowering patterns respect useTridBarriers flag
- TTKernel cleanup patterns conditionally registered for TRID mode
- SCF structural type conversions enabled for transfer handle types

TODO: Profile both modes on representative benchmarks and consider
changing the default.
- trid_barriers.mlir: Tests TRID-aware lowering with use-trid-barriers=true
  - Verifies noc_async_{read,write}_set_trid emission
  - Verifies noc_async_{read,write}_barrier_with_trid emission
  - Tests TRID overflow handling (17 copies without waits)

- dma_global_barriers.mlir: Tests default global barrier mode
  - Verifies noc_async_{read,write}_barrier emission (no TRID)
  - Ensures backward compatibility with main branch behavior

- Update existing tests to use explicit use-trid-barriers=true where
  they expect TRID-specific output
Enable use-trid-barriers in TTLToCpp translation tests that verify
TRID-specific C++ codegen output. Tests now explicitly request TRID
mode to match their expected noc_async_*_set_trid and
barrier_with_trid output.
Add use_trid_barriers to E2EConfig and TestConfig to enable runtime
testing of both barrier modes:

- E2EConfig.use_trid_barriers controls pipeline pass option
- TestConfig includes use_trid_barriers for test ID disambiguation
- Pipeline builder forwards option to convert-ttl-to-ttkernel
- Runner includes use_trid_barriers in kernel cache key
- CONFIGS includes one TRID-enabled config for coverage

Test IDs now include _trid suffix when use_trid_barriers=True to
ensure unique pytest node IDs.
Update Python hardware execution tests to use use_trid_barriers=True
for consistent TRID-mode testing. These tests exercise the full
compilation and execution path with TRID-aware DMA barriers.
- Remove unused releaseTrid; use SmallVector + trailing underscore in TridAllocator
- Replace tridAllocator check with assert; remove allocateTrid in non-TRID branch
- Add emitNocBarrier helper; assert i32 for handle in WaitLowering
- cb_to_tensor_single_tile_write: default RUN + TRID RUN with TRID: prefix
- dma_loop_single_tile: relax CHECK for in-loop runtime arg variable
- config_specs: add multi-tile config with use_trid_barriers=True

Addresses: tenstorrent#87
Migrate TRID/barrier and constant op construction in ConvertTTLToTTKernel to the modern Op::create API requested in PR267 review.
@shutovilyaep shutovilyaep force-pushed the feat/lower_copy_wait branch from 51b2e96 to def5a1f Compare May 13, 2026 05:47
@shutovilyaep
Copy link
Copy Markdown
Author

@brnorris03 Pushed removing unneeded "makeZeroI32" with a history rewrite to remove it from git at all - noticed CI broke due to unused function, please take a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants