Skip to content

Add PTODSL A5 DSL ST coverage#886

Draft
jimmychou0 wants to merge 3 commits into
hw-native-sys:mainfrom
jimmychou0:codex/ptodsl-a5-dsl-st-validation
Draft

Add PTODSL A5 DSL ST coverage#886
jimmychou0 wants to merge 3 commits into
hw-native-sys:mainfrom
jimmychou0:codex/ptodsl-a5-dsl-st-validation

Conversation

@jimmychou0

@jimmychou0 jimmychou0 commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Abstract

This PR adds the first PTODSL-based A5 DSL ST coverage under test/dsl-st/npu_a5 and fixes the frontend/runtime/backend gaps that were exposed while replacing selected tilelang_st cases with PTODSL cases.

Problem scenarios covered:

  • tadd vector tile-op path: validate that PTODSL can author A5 vector tile operations without the tilelang_st harness, including explicit-mode sync authored in the DSL case.
  • tload/tstore data-movement path: PTODSL naturally emits pto.make_tensor_view -> pto.partition_view -> pto.tile.load/store. The old VPTO tile-op pipeline primarily handled the lowered memref view chain, so native TensorViewType / PartitionTensorViewType operands could fail during tile-op expansion or later intrinsic folding.
  • tcolexpand / tcolsum tile-op data movement plus broadcast/reduction-style paths: validate that non-trivial tile shapes, valid rows/cols, and GM view metadata survive through PTODSL, VPTO expansion, and runtime validation.
  • tmatmul cube/MX pipeline path: validate that PTODSL can cover a cube pipeline-style A5 case alongside existing MX DSL ST cases.
  • Explicit same-kind subkernel authoring: @pto.jit(kernel_kind="vector") plus same-kind @pto.simd previously produced redundant section wrappers in some shapes. The intended explicit single-kind form should use function/kernel-kind metadata directly, while mismatched explicit kind plus subkernel kind should fail early.
  • Native build configuration drift: the same MLIR cache entry could be reused across different effective compile configurations such as explicit mode, insert-sync policy, or PTO level.

Implementation changes:

  • Add test/dsl-st/npu_a5 PTODSL cases for tadd, tload/tstore, tcolexpand, tcolsum, and tmatmul.
  • Track whether kernel_kind was explicitly authored in @pto.jit, while preserving the historical default effective kind of vector when the user omits it.
  • Lower same-kind explicit @pto.simd / @pto.cube scopes without redundant pto.section.* wrappers, and report a diagnostic for explicit kind/subkernel kind mismatches.
  • Map PTODSL mode="explicit" native builds to ptoas --pto-level=level3, and keep explicit mode from implicitly enabling insert-sync.
  • Include the effective compile configuration in the native build cache manifest, so cache hits are invalidated when PTO level, insert-sync policy, mode, or related compile settings change.
  • Teach ExpandTileOp to specialize tile-op templates using native TensorViewType / PartitionTensorViewType operands, including view shape, stride, memory space, and layout in the specialization key where they affect tload/tstore DMA behavior.
  • Teach FoldTileBufIntrinsics to fold tensor_view_addr, get_tensor_view_dim, and get_tensor_view_stride from both the lowered memref chain and the native pto.partition_view -> pto.make_tensor_view chain.
  • Clean up now-dead PTODSL subkernel helper functions after backend-helper inlining so leftover helper bodies do not reach later folding/lowering passes.
  • Update existing DSL ST cases (cube_matrix_pipeline.py, gemv_mx_pipeline.py, predicate_pack.py, t_gm_memory_core.py, vmulscvt.py) for the validated A5/simulator PTODSL flow.

Validation

Real A5 NPU validation on ssh a5 under /root/ptoas/pr-work/ptodsl-a5-real-current:

/home/wengwentao/miniconda3/envs/torch_npu/bin/python test/dsl-st
/home/wengwentao/miniconda3/envs/torch_npu/bin/python test/dsl-st/npu_a5

Both suites passed on real NPU with ptoas 0.48, CANN /home/wenquan/cann29/cann-9.0.0, and /dev/davinci0.

Local static and frontend checks:

git diff --check
python3 -m py_compile ptodsl/ptodsl/_tracing/module_builder.py ptodsl/tests/test_jit_compile.py test/dsl-st/npu_a5/tadd.py test/dsl-st/npu_a5/tcolexpand.py test/dsl-st/npu_a5/tcolsum.py test/dsl-st/npu_a5/tload_store.py
python3 .github/scripts/check_license_headers.py --repo hw-native-sys/PTOAS --event-name pull_request --pr-number 886

Local VPTO/PTODSL build checks were also run before the last amend:

ninja -C build-local-vpto ptoas PTOPythonModules
/Users/jimmychou/work/ptoas/.venv-ptoas-local/bin/python ptodsl/tests/test_jit_compile.py

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces native tensor_view and partition_tensor_view folding support in the FoldTileBufIntrinsics pass, updates ExpandTileOp to include view shape and strides in the specialization key, and adds a pto_level parameter to @pto.jit to forward build-level overrides to ptoas. Additionally, VPTOSplitCVModule is updated to normalize sections in-place for pre-annotated modules. Feedback on the changes highlights a concurrency violation in FoldTileBufIntrinsics where a FuncOp pass queries the parent module's symbol table, a limitation in traceViewChain that fails on nested partitions, and an inefficient cleanup loop that should be optimized using a worklist-based dead code elimination approach.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +94 to +105
static bool isDeadPTODSLSubkernelHelper(func::FuncOp func) {
if (!func->hasAttr("pto.ptodsl.subkernel_helper"))
return false;

auto module = func->getParentOfType<ModuleOp>();
if (!module)
return false;

SymbolTable symbolTable(module);
auto uses = symbolTable.getSymbolUses(func, module);
return uses && uses->empty();
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Constructing a SymbolTable on the parent ModuleOp and calling getSymbolUses inside a FuncOp pass violates MLIR's pass nesting and concurrency model. Since FuncOp passes can be scheduled to run concurrently on different functions, traversing the parent module to find symbol uses can lead to data races, undefined behavior, or crashes in multi-threaded mode.

To resolve this, consider either:

  1. Changing FoldTileBufIntrinsics to a ModulePass so it can safely perform module-wide symbol analysis.
  2. Avoiding the use-check entirely in this pass (e.g., by relying on a subsequent dead-symbol-elimination pass to clean up dead functions, or simply skipping any function with the pto.ptodsl.subkernel_helper attribute if they are always intended to be inlined).

Comment on lines +304 to +306
if (auto partition = view.getDefiningOp<pto::PartitionViewOp>()) {
auto makeView =
partition.getSource().getDefiningOp<pto::MakeTensorViewOp>();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of traceViewChain assumes that the source of a PartitionViewOp is directly defined by a MakeTensorViewOp. If a user authors nested partitions (e.g., make_tensor_view -> partition_view -> partition_view), partition.getSource().getDefiningOp<pto::MakeTensorViewOp>() will return nullptr, causing the validation to fail and emitting a false error.

We should trace back through any nested PartitionViewOps to find the base MakeTensorViewOp.

Suggested change
if (auto partition = view.getDefiningOp<pto::PartitionViewOp>()) {
auto makeView =
partition.getSource().getDefiningOp<pto::MakeTensorViewOp>();
if (auto partition = view.getDefiningOp<pto::PartitionViewOp>()) {
Value source = partition.getSource();
while (auto parentPart = source.getDefiningOp<pto::PartitionViewOp>())
source = parentPart.getSource();
auto makeView = source.getDefiningOp<pto::MakeTensorViewOp>();

Comment on lines +947 to +959
while (true) {
SmallVector<Operation *, 8> deadViewOps;
func.walk([&](Operation *op) {
if ((isa<pto::PartitionViewOp>(op) ||
isa<pto::MakeTensorViewOp>(op)) &&
op->use_empty())
deadViewOps.push_back(op);
});
if (deadViewOps.empty())
break;
for (auto *op : llvm::reverse(deadViewOps))
op->erase();
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Repeatedly walking the entire function using func.walk in a while (true) loop to erase dead view operations is inefficient, resulting in $O(N \times M)$ complexity where $N$ is the number of dead operations and $M$ is the total number of operations in the function.

We can optimize this to $O(M)$ by using a worklist-based dead code elimination approach.

    SmallVector<Operation *, 8> worklist;
    func.walk([&](Operation *op) {
      if ((isa<pto::PartitionViewOp>(op) || isa<pto::MakeTensorViewOp>(op)) && op->use_empty())
        worklist.push_back(op);
    });
    while (!worklist.empty()) {
      Operation *op = worklist.pop_back_val();
      SmallVector<Value, 4> operands(op->getOperands());
      op->erase();
      for (Value operand : operands) {
        if (Operation *defOp = operand.getDefiningOp()) {
          if ((isa<pto::PartitionViewOp>(defOp) || isa<pto::MakeTensorViewOp>(defOp)) && defOp->use_empty()) {
            worklist.push_back(defOp);
          }
        }
      }
    }

@reedhecre

reedhecre commented Jun 30, 2026

Copy link
Copy Markdown

Codex Review

该评论由 review 机器人自动更新。

  • PR: Add PTODSL A5 DSL ST coverage #886 Add PTODSL A5 DSL ST coverage
  • Author: jimmychou0
  • Base/Head: main / codex/ptodsl-a5-dsl-st-validation
  • Head SHA: e01a8ec149ac
  • Trigger: PR 有新提交
  • Generated At: 2026-07-01T10:05:44Z
  • Previous Head SHA: c4a238b51523
  • Status: failed at codex-review (exit=1)

Summary

Review failed at stage codex-review: exit=1

Findings

未生成结构化 findings,因为 review 过程提前失败。

Log Tail

 .../Transforms/PTOInstantiateAndInlineOpLib.cpp    |  18 +-
 ptodsl/docs/user_guide/01-introduction.md          |   6 +-
 .../user_guide/03-kernel-entry-and-subkernels.md   |   6 +-
 ptodsl/ptodsl/_diagnostics.py                      |  10 +
 ptodsl/ptodsl/_jit.py                              |  28 +-
 ptodsl/ptodsl/_runtime/cache.py                    |   4 +
 ptodsl/ptodsl/_runtime/native_build.py             |  38 +-
 ptodsl/ptodsl/_tracing/module_builder.py           |   1 +
 ptodsl/ptodsl/_tracing/session.py                  |  47 ++-
 ptodsl/tests/test_jit_compile.py                   | 103 +++++-
 scripts/sim_dsl.sh                                 |  45 ++-
 test/dsl-st/cube_matrix_pipeline.py                | 113 +++---
 test/dsl-st/gemv_mx_pipeline.py                    |  16 -
 test/dsl-st/npu_a5/__main__.py                     |  32 ++
 test/dsl-st/npu_a5/tadd.py                         | 197 +++++++++++
 test/dsl-st/npu_a5/tcolexpand.py                   | 151 ++++++++
 test/dsl-st/npu_a5/tcolsum.py                      | 150 ++++++++
 test/dsl-st/npu_a5/tload_store.py                  | 209 +++++++++++
 test/dsl-st/npu_a5/tmatmul.py                      | 199 +++++++++++
 test/dsl-st/vmulscvt.py                            |  15 +-
 24 files changed, 2002 insertions(+), 247 deletions(-)
===== END STAGE clone rc=0 @ 2026-07-01 18:05:35 =====

===== STAGE codex-review @ 2026-07-01 18:05:35 =====
set -euo pipefail
cd '/tmp/ptoas-pr-review-monitor/runs/20260701_180526_pr886/repo'
'codex' exec -C '/tmp/ptoas-pr-review-monitor/runs/20260701_180526_pr886/repo' -s read-only -c 'model_provider="codereview"' -c 'model="gpt-5.4"' -c 'model_reasoning_effort="xhigh"' --output-schema '/tmp/ptoas-pr-review-monitor/runs/20260701_180526_pr886/review_schema.json' -o '/tmp/ptoas-pr-review-monitor/runs/20260701_180526_pr886/codex_last_message.json' --color never - < '/tmp/ptoas-pr-review-monitor/runs/20260701_180526_pr886/review_prompt.txt'
[monitor] stage timeout: 1800s
OpenAI Codex v0.115.0 (research preview)
--------
workdir: /tmp/ptoas-pr-review-monitor/runs/20260701_180526_pr886/repo
model: gpt-5.4
provider: codereview
approval: never
sandbox: read-only
reasoning effort: xhigh
reasoning summaries: none
session id: 019f1d24-2429-7bd2-a12e-59a08af2cd29
--------
user
你现在在审查 GitHub PR。

仓库:hw-native-sys/PTOAS
PR:#886 Add PTODSL A5 DSL ST coverage
作者:jimmychou0
base branch:origin/main
head branch:HEAD(当前已 checkout 到 PR head)

要求:
1. 只审查这个 PR 相对 origin/main 的改动,必要时可以看上下文文件。
2. 重点找真实的 correctness / regression / contract mismatch / CI / runtime / compatibility 问题。
3. 不要提纯风格建议,不要提低价值猜测。
4. 严格按优先级输出:
   - P1:高概率会导致错误结果、编译/运行失败、严重回归、发布阻断
   - P2:重要缺陷、行为回归、遗漏校验/测试、较大兼容性问题
   - P3:次要但明确可改的问题
5. 如果没有问题,summary 直接写:未检查到 PR #886 存在问题,并返回 findings=[]。
6. 如果有问题,summary 简洁概括,findings 里每条都要给出:
   - severity
   - title
   - body(说明为什么是问题,尽量具体)
   - file(尽量给相对路径)
   - line(能确定就填整数,否则 null)

建议先查看:
- git status --short
- git diff --stat origin/main...HEAD
- git diff --unified=80 origin/main...HEAD

最终输出必须严格匹配 JSON schema。

mcp startup: no servers
Reconnecting... 1/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a14497db3c1daef8-LAX, request id: 8624e299-c274-4f15-90f3-c7df3bd3a664)
Reconnecting... 2/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a14497deac502ab8-LAX, request id: 410c83c7-5368-4b3b-bea3-4a85e34f51ad)
Reconnecting... 3/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a14497e398d734db-LAX, request id: 1259f634-604d-4f85-aa4b-d99238cb8a47)
Reconnecting... 4/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a14497eadd86dcc7-LAX, request id: 8f895a3e-ff47-4eb8-80a1-cbe0e195d09f)
Reconnecting... 5/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a14497f7e92f2719-LAX, request id: 1f120af0-f7f6-42e9-b871-3c4642d2abfd)
ERROR: unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a144980ecfd61ddd-LAX, request id: 7aaa1e33-1664-4f9d-b0e6-2ad7460a8612
Warning: no last agent message; wrote empty content to /tmp/ptoas-pr-review-monitor/runs/20260701_180526_pr886/codex_last_message.json
===== END STAGE codex-review rc=1 @ 2026-07-01 18:05:44 =====

@jimmychou0 jimmychou0 force-pushed the codex/ptodsl-a5-dsl-st-validation branch 3 times, most recently from c14549e to ff01d12 Compare July 1, 2026 06:19
@jimmychou0 jimmychou0 force-pushed the codex/ptodsl-a5-dsl-st-validation branch from ff01d12 to 870dadc Compare July 1, 2026 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants