Add PTODSL A5 DSL ST coverage by jimmychou0 · Pull Request #886 · hw-native-sys/PTOAS

jimmychou0 · 2026-06-30T12:20:06Z

Abstract

This PR adds the first PTODSL-based A5 DSL ST coverage under test/dsl-st/npu_a5 and fixes the frontend/runtime/backend gaps that were exposed while replacing selected tilelang_st cases with PTODSL cases.

Problem scenarios covered:

tadd vector tile-op path: validate that PTODSL can author A5 vector tile operations without the tilelang_st harness, including explicit-mode sync authored in the DSL case.
tload/tstore data-movement path: PTODSL naturally emits pto.make_tensor_view -> pto.partition_view -> pto.tile.load/store. The old VPTO tile-op pipeline primarily handled the lowered memref view chain, so native TensorViewType / PartitionTensorViewType operands could fail during tile-op expansion or later intrinsic folding.
tcolexpand / tcolsum tile-op data movement plus broadcast/reduction-style paths: validate that non-trivial tile shapes, valid rows/cols, and GM view metadata survive through PTODSL, VPTO expansion, and runtime validation.
tmatmul cube/MX pipeline path: validate that PTODSL can cover a cube pipeline-style A5 case alongside existing MX DSL ST cases.
Explicit same-kind subkernel authoring: @pto.jit(kernel_kind="vector") plus same-kind @pto.simd previously produced redundant section wrappers in some shapes. The intended explicit single-kind form should use function/kernel-kind metadata directly, while mismatched explicit kind plus subkernel kind should fail early.
Native build configuration drift: the same MLIR cache entry could be reused across different effective compile configurations such as explicit mode, insert-sync policy, or PTO level.

Implementation changes:

Add test/dsl-st/npu_a5 PTODSL cases for tadd, tload/tstore, tcolexpand, tcolsum, and tmatmul.
Track whether kernel_kind was explicitly authored in @pto.jit, while preserving the historical default effective kind of vector when the user omits it.
Lower same-kind explicit @pto.simd / @pto.cube scopes without redundant pto.section.* wrappers, and report a diagnostic for explicit kind/subkernel kind mismatches.
Map PTODSL mode="explicit" native builds to ptoas --pto-level=level3, and keep explicit mode from implicitly enabling insert-sync.
Include the effective compile configuration in the native build cache manifest, so cache hits are invalidated when PTO level, insert-sync policy, mode, or related compile settings change.
Teach ExpandTileOp to specialize tile-op templates using native TensorViewType / PartitionTensorViewType operands, including view shape, stride, memory space, and layout in the specialization key where they affect tload/tstore DMA behavior.
Teach FoldTileBufIntrinsics to fold tensor_view_addr, get_tensor_view_dim, and get_tensor_view_stride from both the lowered memref chain and the native pto.partition_view -> pto.make_tensor_view chain.
Clean up now-dead PTODSL subkernel helper functions after backend-helper inlining so leftover helper bodies do not reach later folding/lowering passes.
Update existing DSL ST cases (cube_matrix_pipeline.py, gemv_mx_pipeline.py, predicate_pack.py, t_gm_memory_core.py, vmulscvt.py) for the validated A5/simulator PTODSL flow.

Validation

Real A5 NPU validation on ssh a5 under /root/ptoas/pr-work/ptodsl-a5-real-current:

/home/wengwentao/miniconda3/envs/torch_npu/bin/python test/dsl-st
/home/wengwentao/miniconda3/envs/torch_npu/bin/python test/dsl-st/npu_a5

Both suites passed on real NPU with ptoas 0.48, CANN /home/wenquan/cann29/cann-9.0.0, and /dev/davinci0.

Local static and frontend checks:

git diff --check
python3 -m py_compile ptodsl/ptodsl/_tracing/module_builder.py ptodsl/tests/test_jit_compile.py test/dsl-st/npu_a5/tadd.py test/dsl-st/npu_a5/tcolexpand.py test/dsl-st/npu_a5/tcolsum.py test/dsl-st/npu_a5/tload_store.py
python3 .github/scripts/check_license_headers.py --repo hw-native-sys/PTOAS --event-name pull_request --pr-number 886

Local VPTO/PTODSL build checks were also run before the last amend:

ninja -C build-local-vpto ptoas PTOPythonModules
/Users/jimmychou/work/ptoas/.venv-ptoas-local/bin/python ptodsl/tests/test_jit_compile.py

gemini-code-assist

Code Review

This pull request introduces native tensor_view and partition_tensor_view folding support in the FoldTileBufIntrinsics pass, updates ExpandTileOp to include view shape and strides in the specialization key, and adds a pto_level parameter to @pto.jit to forward build-level overrides to ptoas. Additionally, VPTOSplitCVModule is updated to normalize sections in-place for pre-annotated modules. Feedback on the changes highlights a concurrency violation in FoldTileBufIntrinsics where a FuncOp pass queries the parent module's symbol table, a limitation in traceViewChain that fails on nested partitions, and an inefficient cleanup loop that should be optimized using a worklist-based dead code elimination approach.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-30T12:23:46Z

+static bool isDeadPTODSLSubkernelHelper(func::FuncOp func) {
+  if (!func->hasAttr("pto.ptodsl.subkernel_helper"))
+    return false;
+
+  auto module = func->getParentOfType<ModuleOp>();
+  if (!module)
+    return false;
+
+  SymbolTable symbolTable(module);
+  auto uses = symbolTable.getSymbolUses(func, module);
+  return uses && uses->empty();
+}


Constructing a SymbolTable on the parent ModuleOp and calling getSymbolUses inside a FuncOp pass violates MLIR's pass nesting and concurrency model. Since FuncOp passes can be scheduled to run concurrently on different functions, traversing the parent module to find symbol uses can lead to data races, undefined behavior, or crashes in multi-threaded mode.

To resolve this, consider either:

Changing FoldTileBufIntrinsics to a ModulePass so it can safely perform module-wide symbol analysis.

Avoiding the use-check entirely in this pass (e.g., by relying on a subsequent dead-symbol-elimination pass to clean up dead functions, or simply skipping any function with the pto.ptodsl.subkernel_helper attribute if they are always intended to be inlined).

gemini-code-assist · 2026-06-30T12:23:46Z

+    if (auto partition = view.getDefiningOp<pto::PartitionViewOp>()) {
+      auto makeView =
+          partition.getSource().getDefiningOp<pto::MakeTensorViewOp>();


The current implementation of traceViewChain assumes that the source of a PartitionViewOp is directly defined by a MakeTensorViewOp. If a user authors nested partitions (e.g., make_tensor_view -> partition_view -> partition_view), partition.getSource().getDefiningOp<pto::MakeTensorViewOp>() will return nullptr, causing the validation to fail and emitting a false error.

We should trace back through any nested PartitionViewOps to find the base MakeTensorViewOp.

Suggested change

if (auto partition = view.getDefiningOp<pto::PartitionViewOp>()) {

auto makeView =

partition.getSource().getDefiningOp<pto::MakeTensorViewOp>();

if (auto partition = view.getDefiningOp<pto::PartitionViewOp>()) {

Value source = partition.getSource();

while (auto parentPart = source.getDefiningOp<pto::PartitionViewOp>())

source = parentPart.getSource();

auto makeView = source.getDefiningOp<pto::MakeTensorViewOp>();

gemini-code-assist · 2026-06-30T12:23:46Z

+    while (true) {
+      SmallVector<Operation *, 8> deadViewOps;
+      func.walk([&](Operation *op) {
+        if ((isa<pto::PartitionViewOp>(op) ||
+             isa<pto::MakeTensorViewOp>(op)) &&
+            op->use_empty())
+          deadViewOps.push_back(op);
+      });
+      if (deadViewOps.empty())
+        break;
+      for (auto *op : llvm::reverse(deadViewOps))
+        op->erase();
+    }


Repeatedly walking the entire function using func.walk in a while (true) loop to erase dead view operations is inefficient, resulting in $O(N \times M)$ complexity where $N$ is the number of dead operations and $M$ is the total number of operations in the function.

We can optimize this to $O(M)$ by using a worklist-based dead code elimination approach.

SmallVector<Operation *, 8> worklist; func.walk([&](Operation *op) { if ((isa<pto::PartitionViewOp>(op) || isa<pto::MakeTensorViewOp>(op)) && op->use_empty()) worklist.push_back(op); }); while (!worklist.empty()) { Operation *op = worklist.pop_back_val(); SmallVector<Value, 4> operands(op->getOperands()); op->erase(); for (Value operand : operands) { if (Operation *defOp = operand.getDefiningOp()) { if ((isa<pto::PartitionViewOp>(defOp) || isa<pto::MakeTensorViewOp>(defOp)) && defOp->use_empty()) { worklist.push_back(defOp); } } } }

reedhecre · 2026-06-30T13:15:28Z

Codex Review

该评论由 review 机器人自动更新。

PR: Add PTODSL A5 DSL ST coverage #886 Add PTODSL A5 DSL ST coverage
Author: jimmychou0
Base/Head: main / codex/ptodsl-a5-dsl-st-validation
Head SHA: e01a8ec149ac
Trigger: PR 有新提交
Generated At: 2026-07-01T10:05:44Z
Previous Head SHA: c4a238b51523
Status: failed at codex-review (exit=1)

Summary

Review failed at stage codex-review: exit=1

Findings

未生成结构化 findings，因为 review 过程提前失败。

Log Tail

 .../Transforms/PTOInstantiateAndInlineOpLib.cpp    |  18 +-
 ptodsl/docs/user_guide/01-introduction.md          |   6 +-
 .../user_guide/03-kernel-entry-and-subkernels.md   |   6 +-
 ptodsl/ptodsl/_diagnostics.py                      |  10 +
 ptodsl/ptodsl/_jit.py                              |  28 +-
 ptodsl/ptodsl/_runtime/cache.py                    |   4 +
 ptodsl/ptodsl/_runtime/native_build.py             |  38 +-
 ptodsl/ptodsl/_tracing/module_builder.py           |   1 +
 ptodsl/ptodsl/_tracing/session.py                  |  47 ++-
 ptodsl/tests/test_jit_compile.py                   | 103 +++++-
 scripts/sim_dsl.sh                                 |  45 ++-
 test/dsl-st/cube_matrix_pipeline.py                | 113 +++---
 test/dsl-st/gemv_mx_pipeline.py                    |  16 -
 test/dsl-st/npu_a5/__main__.py                     |  32 ++
 test/dsl-st/npu_a5/tadd.py                         | 197 +++++++++++
 test/dsl-st/npu_a5/tcolexpand.py                   | 151 ++++++++
 test/dsl-st/npu_a5/tcolsum.py                      | 150 ++++++++
 test/dsl-st/npu_a5/tload_store.py                  | 209 +++++++++++
 test/dsl-st/npu_a5/tmatmul.py                      | 199 +++++++++++
 test/dsl-st/vmulscvt.py                            |  15 +-
 24 files changed, 2002 insertions(+), 247 deletions(-)
===== END STAGE clone rc=0 @ 2026-07-01 18:05:35 =====

===== STAGE codex-review @ 2026-07-01 18:05:35 =====
set -euo pipefail
cd '/tmp/ptoas-pr-review-monitor/runs/20260701_180526_pr886/repo'
'codex' exec -C '/tmp/ptoas-pr-review-monitor/runs/20260701_180526_pr886/repo' -s read-only -c 'model_provider="codereview"' -c 'model="gpt-5.4"' -c 'model_reasoning_effort="xhigh"' --output-schema '/tmp/ptoas-pr-review-monitor/runs/20260701_180526_pr886/review_schema.json' -o '/tmp/ptoas-pr-review-monitor/runs/20260701_180526_pr886/codex_last_message.json' --color never - < '/tmp/ptoas-pr-review-monitor/runs/20260701_180526_pr886/review_prompt.txt'
[monitor] stage timeout: 1800s
OpenAI Codex v0.115.0 (research preview)
--------
workdir: /tmp/ptoas-pr-review-monitor/runs/20260701_180526_pr886/repo
model: gpt-5.4
provider: codereview
approval: never
sandbox: read-only
reasoning effort: xhigh
reasoning summaries: none
session id: 019f1d24-2429-7bd2-a12e-59a08af2cd29
--------
user
你现在在审查 GitHub PR。

仓库：hw-native-sys/PTOAS
PR：#886 Add PTODSL A5 DSL ST coverage
作者：jimmychou0
base branch：origin/main
head branch：HEAD（当前已 checkout 到 PR head）

要求：
1. 只审查这个 PR 相对 origin/main 的改动，必要时可以看上下文文件。
2. 重点找真实的 correctness / regression / contract mismatch / CI / runtime / compatibility 问题。
3. 不要提纯风格建议，不要提低价值猜测。
4. 严格按优先级输出：
   - P1：高概率会导致错误结果、编译/运行失败、严重回归、发布阻断
   - P2：重要缺陷、行为回归、遗漏校验/测试、较大兼容性问题
   - P3：次要但明确可改的问题
5. 如果没有问题，summary 直接写：未检查到 PR #886 存在问题，并返回 findings=[]。
6. 如果有问题，summary 简洁概括，findings 里每条都要给出：
   - severity
   - title
   - body（说明为什么是问题，尽量具体）
   - file（尽量给相对路径）
   - line（能确定就填整数，否则 null）

建议先查看：
- git status --short
- git diff --stat origin/main...HEAD
- git diff --unified=80 origin/main...HEAD

最终输出必须严格匹配 JSON schema。

mcp startup: no servers
Reconnecting... 1/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a14497db3c1daef8-LAX, request id: 8624e299-c274-4f15-90f3-c7df3bd3a664)
Reconnecting... 2/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a14497deac502ab8-LAX, request id: 410c83c7-5368-4b3b-bea3-4a85e34f51ad)
Reconnecting... 3/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a14497e398d734db-LAX, request id: 1259f634-604d-4f85-aa4b-d99238cb8a47)
Reconnecting... 4/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a14497eadd86dcc7-LAX, request id: 8f895a3e-ff47-4eb8-80a1-cbe0e195d09f)
Reconnecting... 5/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a14497f7e92f2719-LAX, request id: 1f120af0-f7f6-42e9-b871-3c4642d2abfd)
ERROR: unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a144980ecfd61ddd-LAX, request id: 7aaa1e33-1664-4f9d-b0e6-2ad7460a8612
Warning: no last agent message; wrote empty content to /tmp/ptoas-pr-review-monitor/runs/20260701_180526_pr886/codex_last_message.json
===== END STAGE codex-review rc=1 @ 2026-07-01 18:05:44 =====

gemini-code-assist Bot reviewed Jun 30, 2026

View reviewed changes

jimmychou0 force-pushed the codex/ptodsl-a5-dsl-st-validation branch 3 times, most recently from c14549e to ff01d12 Compare July 1, 2026 06:19

Add PTODSL A5 DSL ST coverage

870dadc

jimmychou0 force-pushed the codex/ptodsl-a5-dsl-st-validation branch from ff01d12 to 870dadc Compare July 1, 2026 07:34

jimmychou0 added 2 commits July 1, 2026 16:43

Fix PTODSL simulator CI environment handling

c4a238b

Fix PTODSL DSL ST simulator CI

e01a8ec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add PTODSL A5 DSL ST coverage#886

Add PTODSL A5 DSL ST coverage#886
jimmychou0 wants to merge 3 commits into
hw-native-sys:mainfrom
jimmychou0:codex/ptodsl-a5-dsl-st-validation

jimmychou0 commented Jun 30, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Uh oh!

reedhecre commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jimmychou0 commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Abstract

Validation

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

reedhecre commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codex Review

Summary

Findings

Log Tail

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jimmychou0 commented Jun 30, 2026 •

edited

Loading

reedhecre commented Jun 30, 2026 •

edited

Loading