Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
9fd6076
Add PTODSL alloc_buffer surface
Jun 25, 2026
b8456d6
Add PTODSL contiguous scalar vector access
Jun 25, 2026
7955639
Merge branch 'alloc_buf_dsl' into rmsNorm_merge_test
Jun 25, 2026
d54f745
Merge branch 'ldst_contiguous_ext_dsl' into rmsNorm_merge_test
Jun 25, 2026
b458deb
feat(ptodsl): implement simt_allreduce_sum for SIMT cross-workitem al…
Jun 23, 2026
af96964
fix(ptodsl): align allreduce scratch interface
Jun 24, 2026
875db5b
test(ptodsl): cover alloc_buffer allreduce scratch
Jun 25, 2026
1f08361
example(ptodsl): add RMSNorm alloc_buffer SIMT kernel
Jun 25, 2026
c64d7d1
Merge remote-tracking branch 'hw/main' into rmsNorm_merge_test
Jun 25, 2026
edae0af
test(ptodsl): cover RMSNorm example compile
Jun 25, 2026
1b39788
docs(ptodsl): avoid fixed alloc_buffer alignment contract
Jun 25, 2026
1e7cc23
docs(ptodsl): summarize alloc_buffer scopes in table
Jun 25, 2026
b5a1d78
test(ptodsl): compact rmsnorm simt loops
Jun 25, 2026
2a254f5
test(ptodsl): align rmsnorm simt body
Jun 25, 2026
4e7e957
test(ptodsl): match rmsnorm rstd store
Jun 25, 2026
3df78fe
test(ptodsl): rename rmsnorm simt helper
Jun 25, 2026
ce139da
test(ptodsl): validate rmsnorm simt partition
Jun 25, 2026
e3f9746
docs(ptodsl): clarify alloc buffer parameters
Jun 26, 2026
ab7239d
docs(ptodsl): restrict alloc buffer scope
Jun 26, 2026
0c4e634
docs(ptodsl): split scalar contiguous access docs
Jun 26, 2026
e7fa2e9
docs(ptodsl): preserve scalar access wording
Jun 26, 2026
e69811c
docs(ptodsl): move builtin vector docs
Jun 26, 2026
0e46cb4
feat(ptodsl): support inline simt launch dimensions
Jun 26, 2026
ff75684
贴近golden方便对比
Jun 26, 2026
1339499
Make rmsNorm align to MLIR golden for easier implementation comparison
Jun 26, 2026
08185ba
Clarify inline SIMT launch context docs
Jun 26, 2026
993ab62
refactor(ptodsl): use python loops in rmsnorm simt body
Jun 26, 2026
f10301a
refactor(ptodsl): remove alloc buffer persistent flag
Jun 26, 2026
25d0bd5
Align RMSNorm SIMT loops with golden MLIR
Jun 26, 2026
1835bc1
test(ptodsl): add rmsnorm launch validation script
Jun 27, 2026
d1719fe
fix(ptodsl): load rmsnorm pingpong tile via UB pointer
Jun 29, 2026
3cb1af6
fix(ptodsl): pass dynamic shared memory to runtime launch
Jun 30, 2026
ceb8e01
fix(ptodsl): inline cross-warp allreduce path
Jun 30, 2026
3f8afd4
fix(ptodsl): use pto sqrt in rmsnorm simt example
Jun 30, 2026
28fa9ef
refactor(ptodsl): make alloc_buffer local-only
Jun 30, 2026
102e008
test(ptodsl): add manual dyn-UB RMSNorm launch
Jun 30, 2026
59ac739
feat(ptodsl): inline SIMT allreduce implementation
kuri780 Jun 30, 2026
433ff88
test(ptodsl): keep only manual RMSNorm launch entry
Jun 30, 2026
59ad97e
fix(ptodsl): remove automatic dyn shared launch bytes
Jun 30, 2026
709d936
test(ptodsl): relax RMSNorm MLIR shape checks
Jul 1, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions ptodsl/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,24 @@ Direct run on a real NPU:
python3 ptodsl/examples/flash_attention_softmax_launch.py
```

### `rms_norm/rmsnorm_alloc_buffer_simt.py`

Compile-only RMSNorm example for explicit-mode SIMT kernels. It exercises
SIMT-local `pto.alloc_buffer(...)`, hand-authored dynamic UB scratch offsets,
contiguous `scalar.load` / `scalar.store`, `pto.vec`,
`pto.simt_allreduce_sum(...)`, explicit pipe `set_flag` / `wait_flag` sync,
and a runtime token loop that lowers to `scf.for`.

```bash
python3 ptodsl/examples/rms_norm/rmsnorm_alloc_buffer_simt.py --variant x128 > /tmp/rmsnorm_x128.mlir
python3 ptodsl/examples/rms_norm/rmsnorm_alloc_buffer_simt.py --variant x64 > /tmp/rmsnorm_x64.mlir
```

Expected: MLIR containing `@rmsnorm_4096_alloc_buffer_simt_context_kernel`,
`scf.for`, `vector<4xf32>` for both `x128` and `x64`, and inline
`pto.redux_add` / `pto.syncthreads` allreduce ops. The main token loop should also contain dynamic
`pto.set_flag_dyn` / `pto.wait_flag_dyn` operations for the ping-pong events.

### Launch artifacts

- `~/.cache/ptodsl/` — JIT-compiled kernel `.so` cache
Expand All @@ -167,6 +185,7 @@ python3 ptodsl/tests/test_jit_compile.py
python3 ptodsl/tests/test_jit_diagnostics.py
python3 ptodsl/tests/test_subkernel_diagnostics.py
python3 ptodsl/tests/test_flash_attention_demo_compile.py
python3 ptodsl/tests/test_rmsnorm_example_compile.py
python3 ptodsl/tests/test_ptoas_frontend_verify.py
python3 ptodsl/tests/test_docs_as_test.py
```
Expand All @@ -178,6 +197,7 @@ ptodsl_jit_compile: PASS
ptodsl_jit_diagnostics: PASS
ptodsl_subkernel_diagnostics: PASS
ptodsl_flash_attention_demo_compile: PASS
ptodsl_rmsnorm_example_compile: PASS
ptodsl_ptoas_frontend_verify: PASS
ptodsl_docs_as_test: PASS
```
Expand Down
4 changes: 3 additions & 1 deletion ptodsl/docs/user_guide/01-introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -257,7 +257,9 @@ These are hardware-bound compute sub-kernels, each mapped to a specific NPU comp

Each can be invoked as a named decorated function (`@pto.cube` /
`@pto.simd` / `@pto.simt`) or inline as a context manager
(`with pto.cube():`, `with pto.simd():`, `with pto.simt():`).
(`with pto.cube():`, `with pto.simd():`, `with pto.simt():`). Inline SIMT
scopes can also spell launch dimensions directly with
`with pto.simt(dim_x, dim_y, dim_z):`.

The boundary contract is strict: vreg values do not escape a simd kernel, cube-local state does not leak into UB, and data crosses layer boundaries only through UB-backed tiles or typed UB pointers.

Expand Down
25 changes: 19 additions & 6 deletions ptodsl/docs/user_guide/03-kernel-entry-and-subkernels.md
Original file line number Diff line number Diff line change
Expand Up @@ -736,8 +736,9 @@ two ways:

1. **As decorated functions** — reusable, named sub-kernels called from
`@pto.jit` entries and modules.
2. **As context managers** (`with pto.cube():`, etc.) — inline blocks for
one-off snippets (see Section 3.8).
2. **As context managers** (`with pto.cube():`, `with pto.simd():`,
`with pto.simt():`, and `with pto.simt(dim_x, dim_y, dim_z):`) — inline
blocks for one-off snippets (see Section 3.8).

Named sub-kernel decorators use the same default AST rewrite model as
`@pto.jit`: supported Python `if` and `for range(...)` statements lower to
Expand Down Expand Up @@ -997,10 +998,13 @@ Specific SIMT micro-op APIs are documented in Chapter 13.

## 3.8 Inline context manager syntax

In addition to the decorator form, each sub-kernel unit provides a context
manager: `with pto.cube():`, `with pto.simd():`, and `with pto.simt():`. These
open one-off anonymous sub-kernel bodies without requiring a separate named
Python function. Inline scopes are supported in top-level `@pto.jit` bodies.
In addition to the decorator form, each sub-kernel unit provides an inline
context manager form: `with pto.cube():`, `with pto.simd():`,
`with pto.simt():`, and `with pto.simt(dim_x, dim_y, dim_z):`. These open
one-off anonymous sub-kernel bodies without requiring a separate named Python
function. Inline scopes are supported in top-level `@pto.jit` bodies. The
dimensioned SIMT form uses the same inline body style while making the caller
emit an explicit `pto.simt_launch`.

### Syntax

Expand All @@ -1022,6 +1026,12 @@ with pto.simt():
scalar.store(o_next, o_next_tile[row, col])
```

```python
with pto.simt(128, 1, 1):
tid = pto.get_tid_x()
scalar.store(tid, scratch_ub, scalar.index_cast(tid))
```

<!-- ptodsl-doc-test: {"mode":"compile_fragment","fixture":"kernel_entry.inline_cube_scope","symbol":"kernel_entry_inline_cube_scope_probe","compile":{"BLOCK_M":16,"BLOCK_K":16,"BLOCK_N":16}} -->
```python
with pto.cube():
Expand All @@ -1041,6 +1051,9 @@ with pto.cube():
/ `pto.section.cube` bodies inside the outlined helper.
- `with pto.simt():` preserves its scalar body inside one outlined
`pto.simt_entry` helper, and the caller emits `pto.store_vfsimt_info`.
- `with pto.simt(dim_x, dim_y, dim_z):` uses the same inline outlining and
automatic capture rules, but emits a caller-side explicit SIMT launch with
the authored dimensions.
- Values defined inside the inline sub-kernel cannot escape the block directly.
Use Tiles, typed pointers, or other mutable references to communicate results
back to the caller.
Expand Down
30 changes: 27 additions & 3 deletions ptodsl/docs/user_guide/04-type-system-and-buffer.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,31 @@ ptr_ub = pto.ptr(pto.f16, pto.MemorySpace.UB)
| `MemorySpace.ACC` | Cube L0C accumulator buffer |
| `MemorySpace.BIAS` | Cube bias table buffer |

## 4.5 TensorView
## 4.5 Explicit scratch buffers

Allocate SIMT lane-local scratch storage for pointer-style load and store
operations inside a SIMT helper.

```text
pto.alloc_buffer(shape, dtype)
```

<!-- ptodsl-doc-pending: {"reason":"illustrative fragment; covered by test_jit_compile alloc_buffer probes"} -->
```python
scratch = pto.alloc_buffer((32,), pto.f32)
```

| Parameter | Description |
|-----------|-------------|
| `shape` | Static positive integer shape. Pass an `int`, `tuple[int, ...]`, or `list[int]`. |
| `dtype` | Element type of the returned buffer, such as `pto.f32` or `pto.i32`. |

The returned pointer names a local allocation in the SIMT helper invocation
that allocates it. Use this for per-workitem temporary fragments, scalar
scratch values, or staged values that are accessed through pointer-style loads
and stores.

## 4.6 TensorView

`TensorView` is a descriptor for a tensor in Global Memory. Create one inside a `@pto.jit` body with `make_tensor_view`:

Expand Down Expand Up @@ -205,7 +229,7 @@ def kernel(

Strides support non-contiguous tensors. Pass `strides=A.strides` from the source tensor for the default row-major layout, or supply explicit strides for sub-views. Use `tv.as_ptr()` to obtain a typed GM pointer for use with MTE Ops in explicit-mode orchestration.

## 4.6 PartitionTensorView
## 4.7 PartitionTensorView

`partition_view` creates a sub-view of a TensorView at a given offset and size. It describes *which part* of the GM tensor a `tile.load` or `tile.store` should operate on:

Expand All @@ -216,7 +240,7 @@ part = pto.partition_view(tv, offsets=[row_offset, 0], sizes=[BLOCK, dim])

The result is a `PartitionTensorView` — a lightweight descriptor, not a data buffer. It carries the partition's shape, strides, and element type (inherited from the source TensorView). Use `part.as_ptr()` to obtain a typed GM pointer for MTE Ops in explicit-mode orchestration.

## 4.7 Tile
## 4.8 Tile

A `Tile` is an on-chip buffer allocated in UB or cube-local memory. Allocate tiles with `alloc_tile`:

Expand Down
65 changes: 64 additions & 1 deletion ptodsl/docs/user_guide/06-scalar-and-pointer-ops.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,9 @@ When in doubt, ask: *can this value change between launches of the same compiled

## 6.2 Scalar access: load and store

`scalar.load` reads a single scalar element from a typed pointer or tile location. `scalar.store` writes a scalar back. These are the canonical scalar memory ops for SIMT authoring. The offset is counted in elements, not bytes.
`scalar.load` reads one scalar element from a typed pointer or tile location.
`scalar.store` writes one scalar element back. These are the canonical scalar
memory ops for SIMT authoring. Offsets are counted in elements, not bytes.

#### `scalar.load(ptr: PtrType, offset: Index) -> ScalarType`

Expand Down Expand Up @@ -101,6 +103,67 @@ scalar.store(value, tile[row, col])
scalar.store(value, ptr, offset)
```

### Contiguous vector access

Use `contiguous=N` when a single work-item should read or write `N` adjacent
elements as one vector value. `N` must be a positive Python integer greater than
`1`.

#### `scalar.load(ptr: PtrType, offset: Index, *, contiguous: int) -> VecValue`

**Description**: Loads `contiguous` adjacent elements from a typed pointer.

**Parameters**:

| Parameter | Type | Description |
|-----------|------|-------------|
| `ptr` | `PtrType` | Typed source pointer |
| `offset` | `Index` | First element to load |
| `contiguous` | Positive Python `int` greater than `1` | Number of adjacent elements to load |

**Returns**:

| Return Value | Type | Description |
|--------------|------|-------------|
| `value` | `pto.vec(T, N)` | Vector value with `N == contiguous` and element type `T` |

**Example**:

```python
x4 = scalar.load(ptr, offset, contiguous=4)
```

For a `pto.ptr(pto.f32, "ub")`, this produces a DSL vector value with type
`pto.vec(pto.f32, 4)`.

---

#### `scalar.store(value: VecValue, ptr: PtrType, offset: Index, *, contiguous: int | None = None) -> None`

**Description**: Stores a vector value to adjacent elements of a typed pointer.
The store width is taken from the vector lane count. If `contiguous` is
provided, it must match that lane count.

**Parameters**:

| Parameter | Type | Description |
|-----------|------|-------------|
| `value` | `pto.vec(T, N)` | Vector value to write |
| `ptr` | `PtrType` | Typed destination pointer |
| `offset` | `Index` | First element to store |
| `contiguous` | `int` or `None` | Optional width check; when provided, it must equal `N` |

**Example**:

```python
scalar.store(x4, ptr, offset)
scalar.store(x4, ptr, offset, contiguous=4) # optional width check
```

`scalar.store(scalar_value, ptr, offset, contiguous=N)` is rejected because
scalar values are not implicitly broadcast for vector stores. To build an
explicit broadcast vector, use `pto.vec(...)`; see Section 8.4.

### Scalar value adaptation

`scalar.store` adapts the authored `value` to the destination element type.
Expand Down
40 changes: 40 additions & 0 deletions ptodsl/docs/user_guide/08-compute-operations.md
Original file line number Diff line number Diff line change
Expand Up @@ -1864,3 +1864,43 @@ The `mte_l1_l0a`/`mte_l1_l0b` stage operands from the authored source tiles into
| `pto.mad_mx_bias(lhs, rhs, dst, bias, m, n, k, **clauses)` | MX-format bias-init matmul |

MX variants require MX-enabled dtypes (f8) and pre-loaded scale payloads. For most users, the standard `mad`, `mad_acc`, and `mad_bias` are the primary interface.

---

## 8.4 Builtin vector values

Builtin vector values are small fixed-lane vectors used by contiguous scalar
accesses and element-wise vector expressions. They are distinct from the
`VRegType` values used inside `@pto.simd` kernels.

#### `pto.vec(dtype, lanes, *, init=None)`

**Description**: Names a builtin vector type. When `init` is provided,
constructs a vector value. A scalar initializer is broadcast to every lane.

**Parameters**:

| Parameter | Type | Description |
|-----------|------|-------------|
| `dtype` | PTO dtype | Element type, such as `pto.f32` |
| `lanes` | Positive Python `int` | Number of lanes |
| `init` | Scalar value, vector value, or `None` | Optional initializer; scalar values are broadcast to all lanes |

**Returns**:

| Return Value | Type | Description |
|--------------|------|-------------|
| `result` | Vector type or `pto.vec(dtype, lanes)` value | Without `init`, returns a vector type descriptor; with `init`, returns a vector value |

**Example**:

<!-- ptodsl-doc-pending: {"reason":"illustrative fragment; covered by test_jit_compile scalar contiguous vector probes"} -->
```python
x4 = scalar.load(ptr, offset, contiguous=4)
rstd4 = pto.vec(pto.f32, 4, init=rstd)
y4 = x4 * rstd4
scalar.store(y4, ptr, offset)
```

Use this form when a scalar value must participate in element-wise arithmetic
with a vector value returned by `scalar.load(..., contiguous=N)`.
4 changes: 2 additions & 2 deletions ptodsl/docs/user_guide/13-simt-micro-ops.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ scalar values loaded from tiles.
#### `pto.store_vfsimt_info(dim_z, dim_y, dim_x) -> None`

**Description**: Emits the low-level VPTO launch descriptor operation. Most
code should use `body[dim_x, dim_y, dim_z](...)` or `pto.simt_launch(...)`
instead.
code should use `body[dim_x, dim_y, dim_z](...)`, `pto.simt_launch(...)`, or
the inline form `with pto.simt(dim_x, dim_y, dim_z):` instead.

**Parameters**:

Expand Down
Loading