Skip to content

Fix TPut release fence before TNotify#873

Open
TaoTao-real wants to merge 9 commits into
hw-native-sys:mainfrom
TaoTao-real:codex/fix-issue872-tput-tnotify-release
Open

Fix TPut release fence before TNotify#873
TaoTao-real wants to merge 9 commits into
hw-native-sys:mainfrom
TaoTao-real:codex/fix-issue872-tput-tnotify-release

Conversation

@TaoTao-real

@TaoTao-real TaoTao-real commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Design Document

  • docs/designs/ptoas-memory-consistency-design.md records the memory-consistency contract, PyPTO emission patterns, EmitC lowering support, and current VPTO fail-fast boundary.

Summary

  • add an independent pto-memory-consistency Module pass that validates release/acquire memory-consistency contracts before backend lowering
  • expose explicit memory-consistency semantic ops with PTX/ARM-style naming:
    • pto.cmo.clean all #pto.address_space<gm> means GM cache clean
    • pto.cmo.invalidate all #pto.address_space<gm> means GM cache invalidate
    • pto.fence.release #pto.fence_scope<ddr> means DDR-domain release fence
    • pto.fence.acquire #pto.fence_scope<ddr> means DDR-domain acquire fence
  • EmitC lowering supports these semantic ops today:
    • clean -> dcci((__gm__ void*)0, ENTIRE_DATA_CACHE, CACHELINE_OUT)
    • invalidate -> dcci((__gm__ void*)0, ENTIRE_DATA_CACHE)
    • DDR fence -> dsb(DSB_DDR)
  • VPTO lowering now has an explicit fail-fast stub for these semantic ops. PTOAS reports that VPTO still needs a confirmed DSB/DCCI intrinsic ABI instead of silently leaving unsupported ops in backend IR.
  • PTOAS no longer auto-inserts dcci or dsb; missing or misordered explicit CMO/fence operations are reported as compile-time errors.
  • PTOAS does still derive low-level pipe drains from pending pipe state. When pto.fence.release #pto.fence_scope<ddr> follows pending GM writes on MTE3 or FIX, PTOAS inserts the matching pto.barrier immediately before the release fence, so the generated order is pipe_barrier(<producer pipe>); dsb(DSB_DDR); TNOTIFY.
  • use SyncMacroModel to recognize macro-op MTE3 phases, so pto.comm.tput and other comm macro GM-store phases are validated before TNotify.

Fixes #872.

PyPTO Contract

  • PyPTO should not manually insert pto.barrier #pto.pipe<PIPE_MTE3> or pto.barrier #pto.pipe<PIPE_FIX> for TStore or TPUT publish paths. It should emit the semantic boundary:
    • pto.tstore or pto.comm.tput
    • pto.fence.release #pto.fence_scope<ddr>
    • pto.comm.tnotify
  • PTOAS inserts the required MTE3 or FIX drain before that release fence when pending GM write work exists.
  • For cacheable scalar GM stores, PyPTO still needs to emit pto.cmo.clean all #pto.address_space<gm> before pto.fence.release #pto.fence_scope<ddr>.
  • For scalar GM payload reads after TWait or successful TTest, PyPTO still needs to emit pto.cmo.invalidate all #pto.address_space<gm> before pto.load_scalar.

Covered Scenarios

  • direct MTE3 payload producer before TNotify: requires explicit pto.fence.release #pto.fence_scope<ddr>; PTOAS auto-inserts the MTE3 pipe drain before the fence
  • direct FIX GM payload producer before TNotify, such as ACC TStore or TStoreFP: requires explicit pto.fence.release #pto.fence_scope<ddr>; PTOAS auto-inserts the FIX pipe drain before the fence
  • direct MTE2 work before TNotify: still emits only pipe_barrier(PIPE_MTE2), no DDR fence
  • existing MTE3, FIX, or PIPE_ALL barrier before TNotify: requires explicit pto.fence.release #pto.fence_scope<ddr> and does not duplicate pipe drains
  • TPUT -> TNotify: TPUT is recognized through macro modeling and must be followed by explicit DDR release fence before signal publish; PTOAS supplies the MTE3 drain
  • TBroadcast -> TNotify: guards the generic SyncMacroModel MTE3 phase path, not just a TPUT special case
  • cacheable scalar GM store before TNotify: requires explicit clean plus DDR release fence before signal publish
  • TWait/TTest -> load_scalar: requires explicit acquire invalidate before scalar GM payload read
  • dirty scalar GM store before acquire: requires explicit clean, release fence, then invalidate
  • invalid or missing memory-consistency actions produce diagnostics in memory_consistency_invalid.pto
  • VPTO backend rejects pto.cmo.* and pto.fence.* with a clear unsupported-lowering diagnostic until VPTO/Bisheng exposes the DSB/DCCI ABI PTOAS should call

Tests

  • git diff --check
  • python3 .github/scripts/check_license_headers.py --repo hw-native-sys/PTOAS --event-name pull_request --pr-number 873
  • configured a local build-issue872 with LLVM21 and PTO_ENABLE_PYTHON_BINDING=OFF
  • ninja -C build-issue872 lib/PTO/Transforms/CMakeFiles/obj.PTOTransforms.dir/PTOMemoryConsistency.cpp.o
  • attempted ninja -C build-issue872 ptoas

Local Build Note

  • Local full ptoas build is blocked by the available LLVM/MLIR environment before reaching this patch code: existing Float8 and getStridesAndOffset API mismatches fail in PTO IR/VPTO files.
  • Static checks, license checks, and the modified pass object build pass. The lit tests are intended for CI matching build environment.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the TNotify release analysis to track pipe drains and DDR-domain release fences independently using a new TNotifyReleaseState struct instead of a simple bitmask. This ensures that internal stores from operations like pto::TPutOp are properly drained and made DDR-visible before a notification signal is published, even if a pipe barrier is already present. The review feedback suggests improving the precision of the analysis in multi-block regions by operating on specific Regions rather than the parent Operation to avoid overly conservative barrier insertions, and adding a check to skip external function declarations in annotateTNotifyRelease to prevent potential crashes when calling getBody() on functions without a body.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread lib/PTO/Transforms/PTOToEmitC.cpp Outdated
Comment on lines +400 to +404
TNotifyReleaseState regionState = collectTNotifyReleaseState(&op);
TNotifyReleaseState nestedNotifyState = regionEntryState;
nestedNotifyState.merge(nestedLoopCarriedState);
nestedNotifyState.merge(regionState);
markNestedTNotifyWithState(&op, nestedNotifyState);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In multi-block regions, calling collectTNotifyReleaseState(&op) and markNestedTNotifyWithState(&op, ...) on the parent operation op instead of the specific region causes the analysis to be overly conservative. If an operation has multiple regions (such as scf::IfOp with then and else branches), a multi-block structure in one branch will conservatively collect and apply the release state to both branches, potentially leading to redundant barriers/fences.\n\nWe can improve precision by overloading collectTNotifyReleaseState and markNestedTNotifyWithState to operate directly on Region & instead of Operation *:\n\ncpp\nstatic TNotifyReleaseState collectTNotifyReleaseState(Region &region) {\n TNotifyReleaseState state;\n for (Block &block : region)\n for (Operation &nested : block)\n state.merge(collectTNotifyReleaseState(&nested));\n return state;\n}\n\nstatic void markNestedTNotifyWithState(Region &region,\n const TNotifyReleaseState &state) {\n region.walk(\n [&](pto::TNotifyOp notify) { setTNotifyReleaseAttrs(notify, state); });\n}\n

        TNotifyReleaseState regionState = collectTNotifyReleaseState(region);\n        TNotifyReleaseState nestedNotifyState = regionEntryState;\n        nestedNotifyState.merge(nestedLoopCarriedState);\n        nestedNotifyState.merge(regionState);\n        markNestedTNotifyWithState(region, nestedNotifyState);

Comment thread lib/PTO/Transforms/PTOToEmitC.cpp Outdated
Comment on lines 420 to 421
for (auto func : module.getOps<func::FuncOp>()) {
if (func.getBody().hasOneBlock()) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the module contains external function declarations (which have empty bodies), calling func.getBody().hasOneBlock() might be unsafe or lead to unnecessary processing. It is a standard best practice in MLIR passes to explicitly skip function declarations early.

  for (auto func : module.getOps<func::FuncOp>()) {\n    if (func.isDeclaration())\n      continue;\n    if (func.getBody().hasOneBlock()) {

@TaoTao-real TaoTao-real force-pushed the codex/fix-issue872-tput-tnotify-release branch 5 times, most recently from 6d92cfc to ad97aef Compare June 29, 2026 02:31
@reedhecre

reedhecre commented Jun 29, 2026

Copy link
Copy Markdown

Codex Review

该评论由 review 机器人自动更新。

  • PR: Fix TPut release fence before TNotify #873 Fix TPut release fence before TNotify
  • Author: TaoTao-real
  • Base/Head: main / codex/fix-issue872-tput-tnotify-release
  • Head SHA: ec771dc312dc
  • Trigger: PR 有新提交
  • Generated At: 2026-07-01T10:20:44Z
  • Previous Head SHA: 4c6e22c678c1
  • Status: failed at codex-review (exit=1)

Summary

Review failed at stage codex-review: exit=1

Findings

未生成结构化 findings,因为 review 过程提前失败。

Log Tail

 include/PTO/IR/PTOOps.td                           |  46 ++
 include/PTO/Transforms/MemoryConsistencyAttrs.h    |  35 +
 include/PTO/Transforms/Passes.h                    |   1 +
 include/PTO/Transforms/Passes.td                   |  19 +
 lib/PTO/Transforms/CMakeLists.txt                  |   1 +
 lib/PTO/Transforms/PTOMemoryConsistency.cpp        | 702 +++++++++++++++++++++
 lib/PTO/Transforms/PTOToEmitC.cpp                  | 234 +++----
 lib/PTO/Transforms/VPTOCANN900LLVMEmitter.cpp      |  31 +
 lib/PTO/Transforms/VPTOLLVMEmitter.cpp             |  31 +
 test/lit/pto/issue711_tnotify_mte_drain.pto        | 134 ++++
 test/lit/pto/issue872_tput_tnotify_release.pto     | 167 +++++
 test/lit/pto/memory_consistency_external_func.pto  |  37 ++
 test/lit/pto/memory_consistency_invalid.pto        |  88 +++
 test/lit/pto/memory_consistency_loop_release.pto   |  56 ++
 .../memory_consistency_noninline_call_invalid.pto  |  59 ++
 test/lit/pto/signal_payload_cache_consistency.pto  | 197 ++++++
 test/lit/pto/tnotify_release_local_ops.pto         |  79 +++
 .../vpto/memory_consistency_cmo_unsupported.pto    |  21 +
 .../vpto/memory_consistency_fence_unsupported.pto  |  21 +
 tools/ptoas/ptoas.cpp                              |   1 +
 22 files changed, 2219 insertions(+), 135 deletions(-)
===== END STAGE clone rc=0 @ 2026-07-01 18:20:35 =====

===== STAGE codex-review @ 2026-07-01 18:20:35 =====
set -euo pipefail
cd '/tmp/ptoas-pr-review-monitor/runs/20260701_182027_pr873/repo'
'codex' exec -C '/tmp/ptoas-pr-review-monitor/runs/20260701_182027_pr873/repo' -s read-only -c 'model_provider="codereview"' -c 'model="gpt-5.4"' -c 'model_reasoning_effort="xhigh"' --output-schema '/tmp/ptoas-pr-review-monitor/runs/20260701_182027_pr873/review_schema.json' -o '/tmp/ptoas-pr-review-monitor/runs/20260701_182027_pr873/codex_last_message.json' --color never - < '/tmp/ptoas-pr-review-monitor/runs/20260701_182027_pr873/review_prompt.txt'
[monitor] stage timeout: 1800s
OpenAI Codex v0.115.0 (research preview)
--------
workdir: /tmp/ptoas-pr-review-monitor/runs/20260701_182027_pr873/repo
model: gpt-5.4
provider: codereview
approval: never
sandbox: read-only
reasoning effort: xhigh
reasoning summaries: none
session id: 019f1d31-e148-7830-8c48-b35d49967de3
--------
user
你现在在审查 GitHub PR。

仓库:hw-native-sys/PTOAS
PR:#873 Fix TPut release fence before TNotify
作者:TaoTao-real
base branch:origin/main
head branch:HEAD(当前已 checkout 到 PR head)

要求:
1. 只审查这个 PR 相对 origin/main 的改动,必要时可以看上下文文件。
2. 重点找真实的 correctness / regression / contract mismatch / CI / runtime / compatibility 问题。
3. 不要提纯风格建议,不要提低价值猜测。
4. 严格按优先级输出:
   - P1:高概率会导致错误结果、编译/运行失败、严重回归、发布阻断
   - P2:重要缺陷、行为回归、遗漏校验/测试、较大兼容性问题
   - P3:次要但明确可改的问题
5. 如果没有问题,summary 直接写:未检查到 PR #873 存在问题,并返回 findings=[]。
6. 如果有问题,summary 简洁概括,findings 里每条都要给出:
   - severity
   - title
   - body(说明为什么是问题,尽量具体)
   - file(尽量给相对路径)
   - line(能确定就填整数,否则 null)

建议先查看:
- git status --short
- git diff --stat origin/main...HEAD
- git diff --unified=80 origin/main...HEAD

最终输出必须严格匹配 JSON schema。

mcp startup: no servers
Reconnecting... 1/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a144add67c93c28f-LAX, request id: d95ecdc6-bbad-453a-a3bd-2f2902f56aa4)
Reconnecting... 2/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a144adda0e798e58-LAX, request id: 6ffd17a4-4025-4854-bea1-066f29bb92b1)
Reconnecting... 3/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a144addeba2c4747-LAX, request id: 50cc3e4a-6b8e-4e4c-b535-798883ad04f0)
Reconnecting... 4/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a144ade59e486e22-LAX, request id: 702bfd32-107e-491c-a2d2-ad6ddfe6d946)
Reconnecting... 5/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a144adf1298719ff-LAX, request id: 4426c42f-9865-4c7e-87e1-2e288f10dea4)
ERROR: unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a144ae059b488c16-LAX, request id: 5a7a5a37-5a8a-446b-a4e9-beaa60fee24e
Warning: no last agent message; wrote empty content to /tmp/ptoas-pr-review-monitor/runs/20260701_182027_pr873/codex_last_message.json
===== END STAGE codex-review rc=1 @ 2026-07-01 18:20:44 =====

@TaoTao-real TaoTao-real force-pushed the codex/fix-issue872-tput-tnotify-release branch from ad97aef to bcf801c Compare June 30, 2026 07:21
@TaoTao-real TaoTao-real force-pushed the codex/fix-issue872-tput-tnotify-release branch from bcf801c to f96297f Compare June 30, 2026 07:37
@TaoTao-real

Copy link
Copy Markdown
Contributor Author

Update for PR1 hardening:

  • Removed the broad release fallback that classified every OpPipeInterface MTE3 op as a GM payload publish.
  • TNotify release detection is now payload-specific: TLoad/TPrefetch only request MTE2 drain, TStore MTE3/FIX and TStoreFP request release-side GM write handling, scalar GM stores require explicit CMO+fence, and macro MTE3 phases still cover TPUT/TBroadcast publish paths.
  • Added test/lit/pto/tnotify_release_local_ops.pto to guard local A5 TInsert MTE3 and local TMov FIX before TNotify; neither should require DDR release nor emit release drains.
  • Updated the design doc to call out that MTE3, like FIX, cannot be treated as publish solely from pipe type.

Local validation:

  • git diff --cached --check
  • license header check
  • PTOMemoryConsistency.cpp.o builds in the local LLVM21 configuration
  • full ptoas remains blocked by the existing local LLVM21 Float8/getStridesAndOffset API mismatch, before this patch code.

@TaoTao-real

Copy link
Copy Markdown
Contributor Author

Follow-up review update:

  • Addressed the CFG/region precision concern by making the conservative fallback region-scoped. Multi-block regions now collect and mark pending state only within the current Region, instead of using the parent op and accidentally mixing sibling regions.
  • func.func declarations without bodies are skipped in both release and acquire analysis.
  • Added test/lit/pto/memory_consistency_external_func.pto to guard external function declarations.
  • Updated the design doc to describe the region-scoped conservative traversal and external declaration behavior.

Validation:

  • git diff --cached --check
  • license header check
  • PTOMemoryConsistency.cpp.o builds with the local LLVM21 configuration

Commit: 4c6e22c6 Refine memory consistency region analysis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Chunked pto.comm.tput (staging tile smaller than transfer) does not drain its stores before a following TNOTIFY → peer reads partial data (a2a3)

2 participants