Add PTODSL A5 DSL ST coverage#886
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces native tensor_view and partition_tensor_view folding support in the FoldTileBufIntrinsics pass, updates ExpandTileOp to include view shape and strides in the specialization key, and adds a pto_level parameter to @pto.jit to forward build-level overrides to ptoas. Additionally, VPTOSplitCVModule is updated to normalize sections in-place for pre-annotated modules. Feedback on the changes highlights a concurrency violation in FoldTileBufIntrinsics where a FuncOp pass queries the parent module's symbol table, a limitation in traceViewChain that fails on nested partitions, and an inefficient cleanup loop that should be optimized using a worklist-based dead code elimination approach.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| static bool isDeadPTODSLSubkernelHelper(func::FuncOp func) { | ||
| if (!func->hasAttr("pto.ptodsl.subkernel_helper")) | ||
| return false; | ||
|
|
||
| auto module = func->getParentOfType<ModuleOp>(); | ||
| if (!module) | ||
| return false; | ||
|
|
||
| SymbolTable symbolTable(module); | ||
| auto uses = symbolTable.getSymbolUses(func, module); | ||
| return uses && uses->empty(); | ||
| } |
There was a problem hiding this comment.
Constructing a SymbolTable on the parent ModuleOp and calling getSymbolUses inside a FuncOp pass violates MLIR's pass nesting and concurrency model. Since FuncOp passes can be scheduled to run concurrently on different functions, traversing the parent module to find symbol uses can lead to data races, undefined behavior, or crashes in multi-threaded mode.
To resolve this, consider either:
- Changing
FoldTileBufIntrinsicsto aModulePassso it can safely perform module-wide symbol analysis. - Avoiding the use-check entirely in this pass (e.g., by relying on a subsequent dead-symbol-elimination pass to clean up dead functions, or simply skipping any function with the
pto.ptodsl.subkernel_helperattribute if they are always intended to be inlined).
| if (auto partition = view.getDefiningOp<pto::PartitionViewOp>()) { | ||
| auto makeView = | ||
| partition.getSource().getDefiningOp<pto::MakeTensorViewOp>(); |
There was a problem hiding this comment.
The current implementation of traceViewChain assumes that the source of a PartitionViewOp is directly defined by a MakeTensorViewOp. If a user authors nested partitions (e.g., make_tensor_view -> partition_view -> partition_view), partition.getSource().getDefiningOp<pto::MakeTensorViewOp>() will return nullptr, causing the validation to fail and emitting a false error.
We should trace back through any nested PartitionViewOps to find the base MakeTensorViewOp.
| if (auto partition = view.getDefiningOp<pto::PartitionViewOp>()) { | |
| auto makeView = | |
| partition.getSource().getDefiningOp<pto::MakeTensorViewOp>(); | |
| if (auto partition = view.getDefiningOp<pto::PartitionViewOp>()) { | |
| Value source = partition.getSource(); | |
| while (auto parentPart = source.getDefiningOp<pto::PartitionViewOp>()) | |
| source = parentPart.getSource(); | |
| auto makeView = source.getDefiningOp<pto::MakeTensorViewOp>(); |
| while (true) { | ||
| SmallVector<Operation *, 8> deadViewOps; | ||
| func.walk([&](Operation *op) { | ||
| if ((isa<pto::PartitionViewOp>(op) || | ||
| isa<pto::MakeTensorViewOp>(op)) && | ||
| op->use_empty()) | ||
| deadViewOps.push_back(op); | ||
| }); | ||
| if (deadViewOps.empty()) | ||
| break; | ||
| for (auto *op : llvm::reverse(deadViewOps)) | ||
| op->erase(); | ||
| } |
There was a problem hiding this comment.
Repeatedly walking the entire function using func.walk in a while (true) loop to erase dead view operations is inefficient, resulting in
We can optimize this to
SmallVector<Operation *, 8> worklist;
func.walk([&](Operation *op) {
if ((isa<pto::PartitionViewOp>(op) || isa<pto::MakeTensorViewOp>(op)) && op->use_empty())
worklist.push_back(op);
});
while (!worklist.empty()) {
Operation *op = worklist.pop_back_val();
SmallVector<Value, 4> operands(op->getOperands());
op->erase();
for (Value operand : operands) {
if (Operation *defOp = operand.getDefiningOp()) {
if ((isa<pto::PartitionViewOp>(defOp) || isa<pto::MakeTensorViewOp>(defOp)) && defOp->use_empty()) {
worklist.push_back(defOp);
}
}
}
}
Codex Review该评论由 review 机器人自动更新。
SummaryReview failed at stage Findings未生成结构化 findings,因为 review 过程提前失败。 Log Tail |
c14549e to
ff01d12
Compare
ff01d12 to
870dadc
Compare
Abstract
This PR adds the first PTODSL-based A5 DSL ST coverage under
test/dsl-st/npu_a5and fixes the frontend/runtime/backend gaps that were exposed while replacing selectedtilelang_stcases with PTODSL cases.Problem scenarios covered:
taddvector tile-op path: validate that PTODSL can author A5 vector tile operations without thetilelang_stharness, including explicit-mode sync authored in the DSL case.tload/tstoredata-movement path: PTODSL naturally emitspto.make_tensor_view -> pto.partition_view -> pto.tile.load/store. The old VPTO tile-op pipeline primarily handled the lowered memref view chain, so nativeTensorViewType/PartitionTensorViewTypeoperands could fail during tile-op expansion or later intrinsic folding.tcolexpand/tcolsumtile-op data movement plus broadcast/reduction-style paths: validate that non-trivial tile shapes, valid rows/cols, and GM view metadata survive through PTODSL, VPTO expansion, and runtime validation.tmatmulcube/MX pipeline path: validate that PTODSL can cover a cube pipeline-style A5 case alongside existing MX DSL ST cases.@pto.jit(kernel_kind="vector")plus same-kind@pto.simdpreviously produced redundant section wrappers in some shapes. The intended explicit single-kind form should use function/kernel-kind metadata directly, while mismatched explicit kind plus subkernel kind should fail early.Implementation changes:
test/dsl-st/npu_a5PTODSL cases fortadd,tload/tstore,tcolexpand,tcolsum, andtmatmul.kernel_kindwas explicitly authored in@pto.jit, while preserving the historical default effective kind ofvectorwhen the user omits it.@pto.simd/@pto.cubescopes without redundantpto.section.*wrappers, and report a diagnostic for explicit kind/subkernel kind mismatches.mode="explicit"native builds toptoas --pto-level=level3, and keep explicit mode from implicitly enabling insert-sync.ExpandTileOpto specialize tile-op templates using nativeTensorViewType/PartitionTensorViewTypeoperands, including view shape, stride, memory space, and layout in the specialization key where they affecttload/tstoreDMA behavior.FoldTileBufIntrinsicsto foldtensor_view_addr,get_tensor_view_dim, andget_tensor_view_stridefrom both the lowered memref chain and the nativepto.partition_view -> pto.make_tensor_viewchain.cube_matrix_pipeline.py,gemv_mx_pipeline.py,predicate_pack.py,t_gm_memory_core.py,vmulscvt.py) for the validated A5/simulator PTODSL flow.Validation
Real A5 NPU validation on
ssh a5under/root/ptoas/pr-work/ptodsl-a5-real-current:Both suites passed on real NPU with
ptoas 0.48, CANN/home/wenquan/cann29/cann-9.0.0, and/dev/davinci0.Local static and frontend checks:
Local VPTO/PTODSL build checks were also run before the last amend: