Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 8 additions & 9 deletions docs/designs/mix-kernel-mix-backend-compile-flow.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ For example:
```python
@pto.jit(target="a5", entry=False, backend="vpto", mode="explicit")
def scale_row(base_gm: pto.ptr(pto.f32, "gm"), row: pto.i32):
with pto.simd():
with pto.tileop():
...

@pto.jit(target="a5", backend="emitc")
Expand Down Expand Up @@ -168,10 +168,9 @@ kernel. The Vector/Cube execution ownership is a PTOAS responsibility:
the VPTO backend.

This keeps the PTODSL programming model independent of the physical sectioning
rules. PTODSL can still expose helper abstractions such as `@pto.simd`,
`@pto.cube`, `with pto.simd():`, and `with pto.cube():`, but the design does
not require users or the frontend to manually partition every operation into a
final section.
rules. PTODSL exposes logical helper abstractions such as `@pto.tileop` and
`with pto.tileop():`, but the design does not require users or the frontend to
manually partition every operation into a final section.

### PTODSL IR Codegen Shape

Expand Down Expand Up @@ -233,11 +232,10 @@ Python-only structure. This lowering records PTODSL helper structure and call
boundaries; it does not make PTODSL responsible for the final Vector/Cube
section partition.

For `@pto.simd` / `@pto.cube` and inline `with pto.simd():` / `with pto.cube():`
scopes, PTODSL:
For `@pto.tileop` and inline `with pto.tileop():` scopes, PTODSL:

- outlines the subkernel body into a helper `func.func` when needed
- marks the helper with `pto.ptodsl.subkernel_helper`
- marks the helper with canonical `pto.tileop.helper`
- emits a helper call from the caller body

This is the PTODSL-side expression of a logical mixed kernel: the entry or
Expand Down Expand Up @@ -526,4 +524,5 @@ Use this order when debugging mixed compilation:
| `pto.aicore` | `func.func` | Legacy entry marker accepted for compatibility. |
| `pto.internal.non_entry` | `func.func` | Frontend/helper metadata; not used for current entry inference. |
| `pto.ptodsl.logical_name` | `func.func` | Source-level logical name used when assembling wrappers and peer references. |
| `pto.ptodsl.subkernel_helper` | `func.func` | Frontend helper classification: `simd`, `cube`, or `simt`. |
| `pto.tileop.helper` | `func.func` | Canonical tileop-style helper marker emitted for `@pto.tileop` helpers. |
| `pto.ptodsl.subkernel_helper` | `func.func` | Legacy helper role marker retained for compatibility with older/manual IR. |
4 changes: 2 additions & 2 deletions docs/designs/ptodsl-ast-preprocess-control-flow.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ static specialization readable.
## Goals

- Make native Python control-flow syntax usable by default for runtime control
flow in `@pto.jit(...)` kernels and named `@pto.cube` / `@pto.simd` /
`@pto.simt` sub-kernels.
flow in `@pto.jit(...)` kernels and named `@pto.tileop` / `@pto.simt`
sub-kernels.
- Use `ast_rewrite` as the public name for the source rewrite feature.
- Rewrite legal Python `if` / `for range(...)` into existing PTODSL
control-flow surfaces.
Expand Down
192 changes: 192 additions & 0 deletions docs/designs/ptodsl-redesign-of-simd-simt-cube-subkernel.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@

# 定稿设计方案(final)

## 1. 目标与用户模型

`pto.tileop` 统一 custom subkernel 标识(取代 `pto.cube`/`pto.simd` 的 public subkernel 入口职责;`pto.simt` 专属 launched SIMT)。建模为 tile-level IR 上以 tile/tensorview/scalar 为 IO、带 phase 摘要的命名 helper + `func.call`,让 `PTOInsertSync`/`PTOPlanMemory` 当一等公民。用户零参数,**摘要全由后端 `PTOInferTileOpSummaryPass` 推导;canonical helper marker 统一收敛到 `pto.tileop.helper` unit attr。**

```python
@pto.tileop
def softmax(src_view: pto.TensorView, out_tile: pto.Tile, scratch_tile: pto.Tile,
rows: pto.i32, cols: pto.i32):
# caller 传入 scratch/out tile;body 不新建 tile buffer
pto.tload(src_view, scratch_tile) # MTE
m = pto.vmax(scratch_tile) # PIPE_V
e = pto.vexp(pto.vsub(m, scratch_tile)) # PIPE_V
s = pto.vsum(e) # PIPE_V
r = pto.vdiv(e, s) # PIPE_V
pto.tstore(out_tile, r) # MTE

@pto.jit
def kernel(out, x):
softmax(x, out, scratch, rows, cols)
softmax(x, out, scratch, rows, cols) # 复用
```

约束(编译期强制):
- IO 只允许 Tile/TensorView/PartitionTensorView/PTO scalar。**输出 tile/tensorview 全走 output operand + operand_effects=write/readwrite;func.call results MVP 只允许 scalar。**
- **helper 内禁止 `alloc_tile`/`reserve_buffer`/`TAlloc`/任何需 PlanMemory 为 callee-local 规划的 op;内部 tile buffer 必须来自 caller operand。内部 vreg/mask/scalar 临时可存但不跨边界。**
- body 允许 tload/tstore、vector ops、scalar(PIPE_S) ops、cube ops、`pipe_barrier` 同步。
- 不允许 host tensor/TensorSpec/vreg/mask/pipe_handle 跨边界;不允许 SIMT-only op。
- **tileop 不能在另一 tileop/section 内调用**(避免 inline 后嵌套 section,`VPTOSplitCVModule:113-118` 禁嵌套 section)。

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里感觉可以在inline时做特殊处理,或者让VPTOSplitCVModule适配下,按理说同类section嵌套也是合理的

- **负例:tileop 只有 MTE/S/sync、无 vector/cube 主计算证据时报错。**
- MVP 单主计算域(vector **或** cube)+ 多辅助 pipe;reject cube+vector 混算。多 phase 是 correctness 必需。

## 2. 关键后端事实(已核对,含 pipeline 实测)

- `PTOInsertSync`/`PTOPlanMemory` 均 `func.walk` 全递归进 region。
- InsertSync 现有两条 PTODSL subkernel 路径:legacy `simd/cube` 兼容路径仍按 helper role→单 pipe、memory operand **保守建模为 read+write**;`tileop` 路径已读 `primary_domain/phases/operand_effects`,按 **non-empty boundary-effect phase** 拆多 `CompoundInstanceElement`。
- **`CompoundInstanceElement`(`SyncCommon.h:334-341`)单 `kPipeValue` + 单组 def/use**;空 def/use 节点合法但不贡献跨边界依赖。
- **`classifyTileOpByPipe`(`PTONormalizeUncoveredTileSections.cpp:252-254`)把 MTE1 归 Cube**;`inferSegmentKind` 对混合段报错。
- **`normalizeFunction`(:740-764)**:对不带 kernel_kind 且不带 tileop helper marker 的 func 会 `collectUncoveredTopLevelSegments`→`inferSegmentKind`,混合段 `emitSegmentInferenceError` 失败。`hasKnownKernelKindContext` 现已把 canonical `pto.tileop.helper`(并兼容 legacy `pto.ptodsl.subkernel_helper = "tileop"`)视为已知上下文并直接跳过 NormalizeUncovered。
- **`VPTOSplitCVModule`**:`hasSectionKind`(:58-83)检查 func 含 `SectionCubeOp`/`SectionVectorOp`;不带 section 的 split candidate 被 `eraseSectionSplitCandidatesWithoutSectionKind`(:170-175)擦除;`:135` 要求"must contain section";`:113-118` 禁嵌套 section。
- **实测 pipeline 顺序(`ptoas.cpp:1780-1900`)**:
```
preBackendPM: NormalizeUncoveredTileSections (1786)
main pm: ... → PTOInferTileOpSummaryPass → PTOMaterializeTileOpSectionsPass
→ PTOVerifyTileOpContractPass → ... → ViewToMemref
→ PlanMemory → ResolveReservedBuffers
→ VerifySubkernelPipeContract → InsertSync → ... →
MaterializeTileHandles → InlineBackendHelpers
```
- MLIR attribute 不能引用 SSA value;PTO 无 ValueAttr 机制;custom attr 需在 `PTOAttrs.td` 注册(`PTO_Attr` 基类 :36)。
- `alloc_tile`(`PTOOps.td:318`)、`reserve_buffer`(:1792)、`TAllocOp`(:2240,PlanMemory:478 处理)真实存在。

## 3. IR 载体:方案 B — named helper + `func.call` + 后端推导的 phase 摘要

复用 `func.call` + 命名 helper + callee 解析。**不复用 `kernel_kind`**,用后端推导的 `pto.tileop.primary_domain`。**不预套 section**,改 verifier + NormalizeUncovered 接受 tileop 裸 body。**前端只标 `pto.tileop.helper`,摘要全后端生成。**

```
// 前端 trace 后:helper 只带统一 marker
func.func @softmax(%src: !pto.tensorview<...>, %out: !pto.tile<...>, %scratch: !pto.tile<...>,
%rows: i32, %cols: i32)
{ pto.tileop.helper } {
pto.tload %src, %scratch
%m = pto.vmax %scratch
...
pto.tstore %out, %r
return
}

// PTOInferTileOpSummaryPass 后:补全摘要(真 MLIR attr 结构)
func.func @softmax(%src, %out, %scratch, %rows, %cols)
{ pto.tileop.helper,
pto.tileop.primary_domain = #pto.kernel_kind<vector>,
pto.tileop.phases = #array<#dict<{
pipe = #pto.pipe<MTE1>, operand_uses = [0], operand_defs = [2], result_defs = []
}, #dict<{
pipe = #pto.pipe<V>, operand_uses = [2], operand_defs = [1], result_defs = []
}, #dict<{
pipe = #pto.pipe<MTE1>, operand_uses = [1], operand_defs = [1], result_defs = []
}>>,
pto.tileop.operand_effects = ["read", "readwrite", "readwrite", "read", "read"]
} {
...
}

func.func @kernel(%out, %x, %scratch, %rows, %cols) {
func.call @softmax(%x, %out, %scratch, %rows, %cols) : (...) -> ()
func.call @softmax(%x, %out, %scratch, %rows, %cols) : (...) -> () // 复用
}
```

### phase attr schema(真 MLIR 结构,需在 PTOAttrs.td 注册)

- `pto.tileop.phases`: `ArrayAttr<DictionaryAttr>`,每 phase dict:
- `pipe`: 复用现有 pipe 整数枚举(`PTOAttrs.td:213-227`)或新 `PipeAttr`(需注册),按 op `getPipe()` 推。
- `operand_uses`: 整数 `ArrayAttr`(operand index 指向函数所有 operands;**InsertSync 只消费 memory-like operand**,scalar 可在 summary 供验证或忽略、不参与建图;可空)。
- `operand_defs`: 整数 `ArrayAttr`(同上,可空)。
- `result_defs`: 整数 `ArrayAttr`(**MVP 固定空或仅 scalar result,不参与 memory sync**;复杂语义后置)。
- **effects 可空**:纯内部 phase 可全空,保留在 phases 用于校验/主域推导,**InsertSync 跳过不建 `CompoundInstanceElement`**。有 boundary effect 的 phase 才建节点。是否标 use/def 是 **policy 非 IR 不变量**。
- `pto.tileop.operand_effects`:从 phases 非空 effects 派生(union use→read、def→write),unknown→readwrite;scalar 标 read 但不建图。
- `pto.tileop.primary_domain`:主计算域 vector/cube(借用枚举值,不挂 `kernel_kind` attr)。
- **去掉 `pipe_footprint`**:body pipe set 由 `phases.pipe` 集合表达。

### operand index 作用域(明确)

- index 指向**函数所有 operands**(含 scalar)。
- InsertSync 只消费 **memory-like operand**(tile/tensorview)的 use/def 建 `CompoundInstanceElement`。
- scalar operand 可在 summary 供验证或忽略;**不参与 def/use 建图**。

### 摘要属性职责划分

| 职责 | 由谁承担 |
|---|---|
| body 出现过的 pipe 集合 | `phases.pipe` 集合(verifier 校验 body op getPipe() ∈ 此集) |
| caller 跨边界 sync 建模 | 有 boundary effect 的 phase(memory-like operand use/def 非空),InsertSync 为其建 `CompoundInstanceElement` |
| 主计算域 | `primary_domain` |
| 每 operand 副作用 | `operand_effects`(从非空 phase effects 派生;scalar 标 read 但不建图) |

> 不保留 `has_sync`:InsertSync 假设 helper 内部自管同步、caller 层只管跨边界。

### 输出/results 边界(MVP 硬约束)

- 输出 tile/tensorview 全走 output operand + operand_effects=write/readwrite。
- **func.call results MVP 只允许 scalar**(alias handle 后置)。
- **helper 内禁 `alloc_tile`/`reserve_buffer`/`TAlloc`**;内部 tile 必须来自 caller operand;内部 vreg/mask/scalar 临时不跨边界。

## 4. MVP 边界

- 单主计算域 + 多辅助 pipe;reject cube+vector 混算。
- **多 phase correctness 必需**,pipe 按 `getPipe()` 推;effects 可空(非空才建 sync 节点),softmax V phase 标 use/def 是 policy。
- MTE1/2/3/4、PIPE_S 归 phase、不参与 primary_domain 判定(tileop 专用规则,不改全局 `classifyTileOpByPipe`)。
- 禁 helper-local tile allocation。
- SIMT-only op 排除。**tileop 禁嵌套调用**。
- **负例:tileop 只有 MTE/S/sync 无主计算证据报错。**

## 5. 改动点

### 前端(`_subkernels.py` + `_tracing/session.py`)

1. `_create_subkernel_section_op`:tileop **不预套 section**。
2. helper 只附 `pto.tileop.helper`;**不写 primary_domain/phases/operand_effects**。
3. helper 函数类型:输出全走 operand;results 只用于 scalar。
4. 前端 public boundary 契约:保留 vreg/mask 不外逃;results 限 scalar;**禁 tileop 嵌套调用**。helper body 内 `alloc_tile/reserve_buffer/TAlloc` 等 helper-local 资源分配由后端 `PTOVerifyTileOpContractPass` 兜底拒绝。
5. 装饰器无 `kind` 参数;public surface 仅保留 `@pto.tileop` / `@pto.simt`,legacy `@pto.cube` / `@pto.simd` 前端直接报迁移诊断。

### 后端

1. **verifier 改造**:tile op verifier 把带 `pto.tileop.helper` 的 func 当合法上下文;results 限 scalar;**拒绝 alloc_tile/reserve_buffer/TAlloc**;内部 vreg/mask/scalar 临时不跨边界。
2. **`PTONormalizeUncoveredTileSections` 跳过 tileop**(P0):`normalizeFunction`/`hasKnownKernelKindContext` 增条件——带 `pto.tileop.helper` 的 func 跳过,避免 preBackendPM:1786 扫到裸 body 混合段报错。
3. **新增 `PTOInferTileOpSummaryPass`**:扫 helper body 推导 primary_domain + phases(pipe 按 `getPipe()`,effects 可空,operand index 指向所有 operand,memory-like 才建图)+ operand_effects(从非空 effects 派生)。tileop 专用 MTE/S 规则,不改全局 `classifyTileOpByPipe`。
4. **新增 materialize pass**:按 primary_domain+phases 物化 `SectionCubeOp`/`SectionVectorOp`(只包 cube/vector 主段),MTE/S/sync 保持 top-level。**lit case 覆盖两类**:MTE+section.vector+MTE、MTE+section.cube+MTE,验证 `VPTOSplitCVModule`/EmitC/VPTO 接受(注意 `hasSectionKind`:58-83 要求 func 含 section,`eraseSectionSplitCandidatesWithoutSectionKind`:170-175 会擦除无 section 的 candidate)。
5. **`UpdatePTODSLSubkernelCallInfo` 改造**:读 primary_domain+phases;按**有 boundary effect 的 phase**拆多 `CompoundInstanceElement`(空 effect phase 跳过,scalar operand 不建图);memory operand 副作用从保守全 R+W 改读 operand_effects;支持 callsite scalar results 进依赖图。
6. **新增 `PTOVerifyTileOpContractPass`**:校验 body op `getPipe()` ∈ phases pipe 集合、主域 pipe 与 primary_domain 一致、operand_effects == 非空 phases 派生、SIMT-only op 排除、cube+vector 混算 reject、results 限 scalar、tileop 无嵌套调用、**拒绝 alloc_tile/reserve_buffer/TAlloc**、**负例(只有 MTE/S/sync 无主计算证据报错)**。旧 `PTOVerifySubkernelPipeContractPass` 保留兼容 cube/simd。
7. 主 pipeline 不再依赖 `PTOWrapFunctionsInSectionsPass` 为 tileop helper 自动套单段;tileop section 形成以 `PTOMaterializeTileOpSectionsPass` 为准。
8. **`PTOInlineBackendHelpers`**:保证不丢围绕 call 的 sync ops;tileop 禁嵌套调用从源头避免 inline 后嵌套 section。

### pass 顺序(实测修正,P0)

```
前端 trace → verifier(tileop 裸 body)
→ [preBackendPM] NormalizeUncoveredTileSections (跳过 `pto.tileop.helper`) ← P0 必须跳过
→ [main pm] PTOInferTileOpSummaryPass
→ PTOMaterializeTileOpSectionsPass
→ PTOVerifyTileOpContractPass
→ ... → ViewToMemref → PlanMemory → ResolveReservedBuffers →
VerifySubkernelPipeContract → InsertSync → ... →
MaterializeTileHandles → InlineBackendHelpers
```

### 可选后置(非 MVP)

- helper-local tile allocation 的 callsite clone/inline(放开 result 返回 tile / alloc_tile 等)。
- alias handle result(放开 result 非 scalar / result_defs 复杂语义)。
- phases def/use 细到 UB 子区域。
- 内联 opt pass。

## 6. 当前落地状态

已落地并与本文主设计一致的部分:

1. `NormalizeUncoveredTileSections` 已把 tileop helper marker 视为已知上下文并跳过预归一化。
2. `PTOInferTileOpSummaryPass`、`PTOMaterializeTileOpSectionsPass`、`PTOVerifyTileOpContractPass` 已接入主 pipeline,且都位于 `PlanMemory` 之前。
3. `UpdatePTODSLSubkernelCallInfo` 已能消费 tileop 摘要,按 phase 建模跨 helper 边界的 InsertSync 依赖;legacy `simd/cube` 兼容路径仍保留保守单-pipe 建模。
4. tileop helper ABI 已收敛为 Tile/TensorView/PartitionTensorView/PTO scalar;`ptr` 仍为 SIMT-only。
5. `@pto.tileop` 在 IR 层语义上统一到 tileop helper role,并使用 canonical `pto.tileop.helper` marker;legacy public `@pto.simd` / `@pto.cube` 已前端报错;后端仍兼容 legacy `pto.ptodsl.subkernel_helper = "tileop"`。
6. inline `with pto.tileop()` 已与 decorated `@pto.tileop` 对齐:前端不再预套 section,统一交后端 `PTOMaterializeTileOpSectionsPass` 物化。

## 7. 仍待单独收敛的差异

1. **`pto.tileop.operand_effects` 的“无显式 boundary effect 时默认值”尚未与本文最终写法重新对齐。**
本文目标写法仍按 `unknown→readwrite` 记录;当前实现会把无 boundary effect 的 operand 物化/校验为 `"read"`。这一点需要单独决策后再统一设计与实现。
27 changes: 27 additions & 0 deletions include/PTO/IR/PTO.h
Original file line number Diff line number Diff line change
Expand Up @@ -190,9 +190,36 @@ inline constexpr llvm::StringLiteral kPTOSimtMaxRegistersAttrName =
inline constexpr llvm::StringLiteral kPTOVisibilityAttrName = "pto.visibility";
inline constexpr llvm::StringLiteral kPTOVisibilityInternalValue = "internal";
inline constexpr llvm::StringLiteral kPTOVisibilityExternalValue = "external";
inline constexpr llvm::StringLiteral kPTODSLSubkernelHelperAttrName =
"pto.ptodsl.subkernel_helper";
inline constexpr llvm::StringLiteral kPTOTileOpHelperAttrName =
"pto.tileop.helper";
inline constexpr llvm::StringLiteral kPTODSLLogicalNameAttrName =
"pto.ptodsl.logical_name";

/// Return the logical PTODSL helper role when present.
///
/// Canonical tileop helpers use the unit attr `pto.tileop.helper`. Legacy
/// helper roles still use `pto.ptodsl.subkernel_helper = "<role>"`.
inline StringRef getPTODSLSubkernelHelperRole(::mlir::func::FuncOp func) {
if (!func)
return {};
if (func->hasAttrOfType<UnitAttr>(kPTOTileOpHelperAttrName))
return "tileop";
if (auto attr =
func->getAttrOfType<StringAttr>(kPTODSLSubkernelHelperAttrName))
return attr.getValue();
return {};
}

inline bool hasPTODSLSubkernelHelperMarker(::mlir::func::FuncOp func) {
return !getPTODSLSubkernelHelperRole(func).empty();
}

inline bool isPTODSLTileOpHelper(::mlir::func::FuncOp func) {
return getPTODSLSubkernelHelperRole(func) == "tileop";
}

/// Return the PTODSL logical function name when present, otherwise fall back to
/// the current symbol name. PTODSL uses this to mark ABI-specialized helper and
/// kernel-module symbols without relying on symbol-name parsing.
Expand Down
3 changes: 3 additions & 0 deletions include/PTO/Transforms/Passes.h
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,9 @@ std::unique_ptr<Pass> createPTOInferValidatePipeInitPass();
std::unique_ptr<Pass> createPTOResolveReservedBuffersPass();
std::unique_ptr<Pass> createPTOWrapFunctionsInSectionsPass();
std::unique_ptr<Pass> createPTONormalizeUncoveredTileSectionsPass();
std::unique_ptr<Pass> createPTOInferTileOpSummaryPass();
std::unique_ptr<Pass> createPTOMaterializeTileOpSectionsPass();
std::unique_ptr<Pass> createPTOVerifyTileOpContractPass();
std::unique_ptr<Pass> createVPTOSplitCVModulePass();
std::unique_ptr<Pass> createVPTONormalizeContainerPass();
std::unique_ptr<Pass> createPTOVerifyTFreePass();
Expand Down
Loading
Loading