[ttl] Make TRID DMA wait lowering selectable (default: global barriers)#267
[ttl] Make TRID DMA wait lowering selectable (default: global barriers)#267shutovilyaep wants to merge 8 commits into
Conversation
|
Comment received from @brnorris03:
|
d754a28 to
42dbe50
Compare
…getTileGridShapeFromValue The test used tensor<32x32xf32> (element type f32). Copy lowering calls getTileGridShapeFromValue() which asserts the tensor has TileType element type. Use tensor<1x1x!ttcore.tile<32x32,f32>> like other DMA tests to fix CI crash (SIGABRT) in TTLToTTKernel conversion. Attempt to fix CI failure in PR tenstorrent#267 / #1222.
fbe3c1d to
cc01d5b
Compare
|
/codeowners ping |
brnorris03
left a comment
There was a problem hiding this comment.
Looks great, thank you! The only more significant issue I see is the lack of runtime tests, I think the best approach for now is to parameterize (some of) the test/me2e tests with the new option, what do you think? I can help with more concrete suggestions on how to do that if you agree.dd
Some general questions, mainly stemming from my lack of deep knowledge of the low-level semantics of the metal ops.
-
Is the TRID value semantically meaningful, or just needs to be unique per copy? I am guessing order doesn't matter? As defined the generated TRIDs could be nondeterministic (but correctly unique) due to parallel pattern application.
-
With the new ops requiring explicit NOC, I see that NOC 0 is always used -- is this appropriate or something that needs to be generalized (perhaps later PR)?
Again, thank you for contributing this!!
| patterns.add<DeduplicateConsecutiveBarriers<NocAsyncReadBarrierOp>>( | ||
| patterns.getContext()); | ||
| patterns.add<DeduplicateConsecutiveBarriers<NocAsyncWriteBarrierOp>>( | ||
| patterns.getContext()); | ||
| patterns | ||
| .add<DeduplicateConsecutiveTridBarriers<NocAsyncReadBarrierWithTridOp>>( | ||
| patterns.getContext()); | ||
| patterns | ||
| .add<DeduplicateConsecutiveTridBarriers<NocAsyncWriteBarrierWithTridOp>>( | ||
| patterns.getContext()); |
There was a problem hiding this comment.
Probably doesn't matter that much, but could make the relevant patterns conditional on the option that enables TRID?
There was a problem hiding this comment.
fixed:
- populateTTKernelCleanupPatterns now takes useTridBarriers (default false in header); TRID dedup patterns are only added when true.
- Convert pass calls it with useTridBarriers so the option is forwarded to cleanup.
|
|
||
| let options = [ | ||
| Option<"useTridBarriers", "use-trid-barriers", "bool", "false", | ||
| "Use TRID-aware DMA waits (barrier_with_trid) instead of global barriers.">, | ||
| ]; |
There was a problem hiding this comment.
Thank you for adding the option! Not asking you to do this in the PR but it would be interesting to profile the different approaches with a small set of representative benchmarks and set the default based on that (perhaps add a short TODO to that effect here if you agree?).
There was a problem hiding this comment.
- Can't perform running, no device available
- Added TODO in pass description: “Profile both modes on representative benchmarks and consider changing the default.”
| class TridAllocator { | ||
| public: | ||
| uint32_t allocateTrid() { return nextTrid++ & 0xF; } | ||
|
|
||
| private: | ||
| uint32_t nextTrid = 0; | ||
| }; |
There was a problem hiding this comment.
There is wrapping at 16 TRIDs, but what happens if the 0th, etc are still not completed at that point? Is there any way to check/detect TRID overflow? Maybe add a TODO for future improvement to make this more robust.
There was a problem hiding this comment.
fixed:
- TridAllocator now tracks outstanding TRIDs and their direction. When a TRID would be reused while still in-flight, CopyLowering emits a barrier_with_trid for the old transfer before reassigning.
- WaitLowering releases TRIDs via releaseTrid() so they can be reused without an auto-barrier.
- Lit test (17 copies, no intervening waits) verifies auto-barrier on overflow.
|
@brnorris03 Hello, the fixes were ready about 3 weeks ago, haven't pushed due to being a little bit off due to sudden layoff from Tenstorrent. Will try to complete this today, if still relevant |
cc01d5b to
1e5232e
Compare
6222a97 to
e4f9ca4
Compare
|
Tried to implement awaiting for the DMA to complete instead to having a TODO when TRID ids are rotated. Checked by locally running MLIR lit tests. |
This file contains the PR description to copy-paste to GitHub. Will be excluded from final PR. Made-with: Cursor
e4f9ca4 to
e01cb94
Compare
|
@brnorris03 Hello, please take a look, squashed commits to atomic, should be mergeable Local verification complete (macOS, no TT hardware)
Additional targeted runs:
No TT device available for ME2E or hardware execution tests. Ready for CI. |
3987a00 to
8bb8613
Compare
37c11ba to
2dba10e
Compare
brnorris03
left a comment
There was a problem hiding this comment.
Apologies for taking so long with the re-review. I think overall looks great, my main concerns at the moment are about the test coverage, should be easily addressable. So sorry to learn about the layoff (if you don't mind, can you email me so I have your contact info).
| class TridAllocator { | ||
| public: | ||
| uint32_t allocateTrid() { return nextTrid++ & 0xF; } | ||
|
|
||
| private: | ||
| uint32_t nextTrid = 0; | ||
| }; |
This file contains the PR description to copy-paste to GitHub. Will be excluded from final PR.
01c7404 to
0a1c087
Compare
0a1c087 to
d91c351
Compare
This file contains the PR description to copy-paste to GitHub. Will be excluded from final PR.
This file contains the PR description to copy-paste to GitHub. Will be excluded from final PR.
d91c351 to
e44ee28
Compare
e44ee28 to
490fc2f
Compare
brnorris03
left a comment
There was a problem hiding this comment.
Ready to land after some deprecated builder usage is updated (e.g., rewriter.create<arith::ConstantIntOp>(loc, 0, 8); should be arith::ConstantIntOp::create(rewriter, loc, 0, 8);, same for all rewriter.create calls).
Thank you!
Replace deprecated PatternRewriter::create factory usage with static Op::create(rewriter, loc, ...) for arith constants and TTKernel NOC barrier/set_trid ops (review feedback on PR tenstorrent#267).
3da5e18 to
51b2e96
Compare
Add use-trid-barriers option to convert-ttl-to-ttkernel pass and
ttl-to-ttkernel-pipeline. When enabled, ttl.copy emits
noc_async_{read,write}_set_trid before DMA operations, and ttl.wait
emits noc_async_{read,write}_barrier_with_trid instead of global
barriers.
Default behavior (use-trid-barriers=false) preserves existing global
barrier semantics from main branch.
Key changes:
- TridAllocator class manages 16 TRID slots with overflow handling
- lowerTensorCBCopy unified function supports both modes
- CopyLowering/WaitLowering patterns respect useTridBarriers flag
- TTKernel cleanup patterns conditionally registered for TRID mode
- SCF structural type conversions enabled for transfer handle types
TODO: Profile both modes on representative benchmarks and consider
changing the default.
- trid_barriers.mlir: Tests TRID-aware lowering with use-trid-barriers=true
- Verifies noc_async_{read,write}_set_trid emission
- Verifies noc_async_{read,write}_barrier_with_trid emission
- Tests TRID overflow handling (17 copies without waits)
- dma_global_barriers.mlir: Tests default global barrier mode
- Verifies noc_async_{read,write}_barrier emission (no TRID)
- Ensures backward compatibility with main branch behavior
- Update existing tests to use explicit use-trid-barriers=true where
they expect TRID-specific output
Enable use-trid-barriers in TTLToCpp translation tests that verify TRID-specific C++ codegen output. Tests now explicitly request TRID mode to match their expected noc_async_*_set_trid and barrier_with_trid output.
Add use_trid_barriers to E2EConfig and TestConfig to enable runtime testing of both barrier modes: - E2EConfig.use_trid_barriers controls pipeline pass option - TestConfig includes use_trid_barriers for test ID disambiguation - Pipeline builder forwards option to convert-ttl-to-ttkernel - Runner includes use_trid_barriers in kernel cache key - CONFIGS includes one TRID-enabled config for coverage Test IDs now include _trid suffix when use_trid_barriers=True to ensure unique pytest node IDs.
Update Python hardware execution tests to use use_trid_barriers=True for consistent TRID-mode testing. These tests exercise the full compilation and execution path with TRID-aware DMA barriers.
- Remove unused releaseTrid; use SmallVector + trailing underscore in TridAllocator - Replace tridAllocator check with assert; remove allocateTrid in non-TRID branch - Add emitNocBarrier helper; assert i32 for handle in WaitLowering - cb_to_tensor_single_tile_write: default RUN + TRID RUN with TRID: prefix - dma_loop_single_tile: relax CHECK for in-loop runtime arg variable - config_specs: add multi-tile config with use_trid_barriers=True Addresses: tenstorrent#87
Migrate TRID/barrier and constant op construction in ConvertTTLToTTKernel to the modern Op::create API requested in PR267 review.
51b2e96 to
def5a1f
Compare
|
@brnorris03 Pushed removing unneeded "makeZeroI32" with a history rewrite to remove it from git at all - noticed CI broke due to unused function, please take a look |
PR #267: [ttl] Make TRID DMA wait lowering selectable (default: global barriers)
Why
Current lowering uses global DMA barriers. TRID-scoped waits needed for #87 are not supported. Default behavior must stay unchanged for existing users.
What
use-trid-barriers(default off): choose global barriers (current) or TRID barriers.How