Fix TPut release fence before TNotify#873
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the TNotify release analysis to track pipe drains and DDR-domain release fences independently using a new TNotifyReleaseState struct instead of a simple bitmask. This ensures that internal stores from operations like pto::TPutOp are properly drained and made DDR-visible before a notification signal is published, even if a pipe barrier is already present. The review feedback suggests improving the precision of the analysis in multi-block regions by operating on specific Regions rather than the parent Operation to avoid overly conservative barrier insertions, and adding a check to skip external function declarations in annotateTNotifyRelease to prevent potential crashes when calling getBody() on functions without a body.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| TNotifyReleaseState regionState = collectTNotifyReleaseState(&op); | ||
| TNotifyReleaseState nestedNotifyState = regionEntryState; | ||
| nestedNotifyState.merge(nestedLoopCarriedState); | ||
| nestedNotifyState.merge(regionState); | ||
| markNestedTNotifyWithState(&op, nestedNotifyState); |
There was a problem hiding this comment.
In multi-block regions, calling collectTNotifyReleaseState(&op) and markNestedTNotifyWithState(&op, ...) on the parent operation op instead of the specific region causes the analysis to be overly conservative. If an operation has multiple regions (such as scf::IfOp with then and else branches), a multi-block structure in one branch will conservatively collect and apply the release state to both branches, potentially leading to redundant barriers/fences.\n\nWe can improve precision by overloading collectTNotifyReleaseState and markNestedTNotifyWithState to operate directly on Region & instead of Operation *:\n\ncpp\nstatic TNotifyReleaseState collectTNotifyReleaseState(Region ®ion) {\n TNotifyReleaseState state;\n for (Block &block : region)\n for (Operation &nested : block)\n state.merge(collectTNotifyReleaseState(&nested));\n return state;\n}\n\nstatic void markNestedTNotifyWithState(Region ®ion,\n const TNotifyReleaseState &state) {\n region.walk(\n [&](pto::TNotifyOp notify) { setTNotifyReleaseAttrs(notify, state); });\n}\n
TNotifyReleaseState regionState = collectTNotifyReleaseState(region);\n TNotifyReleaseState nestedNotifyState = regionEntryState;\n nestedNotifyState.merge(nestedLoopCarriedState);\n nestedNotifyState.merge(regionState);\n markNestedTNotifyWithState(region, nestedNotifyState);| for (auto func : module.getOps<func::FuncOp>()) { | ||
| if (func.getBody().hasOneBlock()) { |
There was a problem hiding this comment.
If the module contains external function declarations (which have empty bodies), calling func.getBody().hasOneBlock() might be unsafe or lead to unnecessary processing. It is a standard best practice in MLIR passes to explicitly skip function declarations early.
for (auto func : module.getOps<func::FuncOp>()) {\n if (func.isDeclaration())\n continue;\n if (func.getBody().hasOneBlock()) {6d92cfc to
ad97aef
Compare
Codex Review该评论由 review 机器人自动更新。
SummaryReview failed at stage Findings未生成结构化 findings,因为 review 过程提前失败。 Log Tail |
ad97aef to
bcf801c
Compare
bcf801c to
f96297f
Compare
|
Update for PR1 hardening:
Local validation:
|
|
Follow-up review update:
Validation:
Commit: |
Design Document
docs/designs/ptoas-memory-consistency-design.mdrecords the memory-consistency contract, PyPTO emission patterns, EmitC lowering support, and current VPTO fail-fast boundary.Summary
pto-memory-consistencyModule pass that validates release/acquire memory-consistency contracts before backend loweringpto.cmo.clean all #pto.address_space<gm>means GM cache cleanpto.cmo.invalidate all #pto.address_space<gm>means GM cache invalidatepto.fence.release #pto.fence_scope<ddr>means DDR-domain release fencepto.fence.acquire #pto.fence_scope<ddr>means DDR-domain acquire fencedcci((__gm__ void*)0, ENTIRE_DATA_CACHE, CACHELINE_OUT)dcci((__gm__ void*)0, ENTIRE_DATA_CACHE)dsb(DSB_DDR)dcciordsb; missing or misordered explicit CMO/fence operations are reported as compile-time errors.pto.fence.release #pto.fence_scope<ddr>follows pending GM writes on MTE3 or FIX, PTOAS inserts the matchingpto.barrierimmediately before the release fence, so the generated order ispipe_barrier(<producer pipe>); dsb(DSB_DDR); TNOTIFY.SyncMacroModelto recognize macro-op MTE3 phases, sopto.comm.tputand other comm macro GM-store phases are validated beforeTNotify.Fixes #872.
PyPTO Contract
pto.barrier #pto.pipe<PIPE_MTE3>orpto.barrier #pto.pipe<PIPE_FIX>forTStoreorTPUTpublish paths. It should emit the semantic boundary:pto.tstoreorpto.comm.tputpto.fence.release #pto.fence_scope<ddr>pto.comm.tnotifypto.cmo.clean all #pto.address_space<gm>beforepto.fence.release #pto.fence_scope<ddr>.TWaitor successfulTTest, PyPTO still needs to emitpto.cmo.invalidate all #pto.address_space<gm>beforepto.load_scalar.Covered Scenarios
TNotify: requires explicitpto.fence.release #pto.fence_scope<ddr>; PTOAS auto-inserts the MTE3 pipe drain before the fenceTNotify, such as ACCTStoreorTStoreFP: requires explicitpto.fence.release #pto.fence_scope<ddr>; PTOAS auto-inserts the FIX pipe drain before the fenceTNotify: still emits onlypipe_barrier(PIPE_MTE2), no DDR fenceTNotify: requires explicitpto.fence.release #pto.fence_scope<ddr>and does not duplicate pipe drainsTPUT -> TNotify:TPUTis recognized through macro modeling and must be followed by explicit DDR release fence before signal publish; PTOAS supplies the MTE3 drainTBroadcast -> TNotify: guards the genericSyncMacroModelMTE3 phase path, not just aTPUTspecial caseTNotify: requires explicit clean plus DDR release fence before signal publishTWait/TTest -> load_scalar: requires explicit acquire invalidate before scalar GM payload readmemory_consistency_invalid.ptopto.cmo.*andpto.fence.*with a clear unsupported-lowering diagnostic until VPTO/Bisheng exposes the DSB/DCCI ABI PTOAS should callTests
git diff --checkpython3 .github/scripts/check_license_headers.py --repo hw-native-sys/PTOAS --event-name pull_request --pr-number 873build-issue872with LLVM21 andPTO_ENABLE_PYTHON_BINDING=OFFninja -C build-issue872 lib/PTO/Transforms/CMakeFiles/obj.PTOTransforms.dir/PTOMemoryConsistency.cpp.oninja -C build-issue872 ptoasLocal Build Note
ptoasbuild is blocked by the available LLVM/MLIR environment before reaching this patch code: existing Float8 andgetStridesAndOffsetAPI mismatches fail in PTO IR/VPTO files.