hw-native-sys · and0d0 · Jun 25, 2026 · Jun 25, 2026 · Jun 25, 2026 · Jun 25, 2026
diff --git a/ptodsl/README.md b/ptodsl/README.md
@@ -152,6 +152,24 @@ Direct run on a real NPU:
 python3 ptodsl/examples/flash_attention_softmax_launch.py
 ```
 
+### `rms_norm/rmsnorm_alloc_buffer_simt.py`
+
+Compile-only RMSNorm example for explicit-mode SIMT kernels. It exercises
+SIMT-local `pto.alloc_buffer(...)`, hand-authored dynamic UB scratch offsets,
+contiguous `scalar.load` / `scalar.store`, `pto.vec`,
+`pto.simt_allreduce_sum(...)`, explicit pipe `set_flag` / `wait_flag` sync,
+and a runtime token loop that lowers to `scf.for`.
+
+```bash
+python3 ptodsl/examples/rms_norm/rmsnorm_alloc_buffer_simt.py --variant x128 > /tmp/rmsnorm_x128.mlir
+python3 ptodsl/examples/rms_norm/rmsnorm_alloc_buffer_simt.py --variant x64 > /tmp/rmsnorm_x64.mlir
+```
+
+Expected: MLIR containing `@rmsnorm_4096_alloc_buffer_simt_context_kernel`,
+`scf.for`, `vector<4xf32>` for both `x128` and `x64`, and inline
+`pto.redux_add` / `pto.syncthreads` allreduce ops. The main token loop should also contain dynamic
+`pto.set_flag_dyn` / `pto.wait_flag_dyn` operations for the ping-pong events.
+
 ### Launch artifacts
 
 - `~/.cache/ptodsl/` — JIT-compiled kernel `.so` cache
@@ -167,6 +185,7 @@ python3 ptodsl/tests/test_jit_compile.py
 python3 ptodsl/tests/test_jit_diagnostics.py
 python3 ptodsl/tests/test_subkernel_diagnostics.py
 python3 ptodsl/tests/test_flash_attention_demo_compile.py
+python3 ptodsl/tests/test_rmsnorm_example_compile.py
 python3 ptodsl/tests/test_ptoas_frontend_verify.py
 python3 ptodsl/tests/test_docs_as_test.py
 ```
@@ -178,6 +197,7 @@ ptodsl_jit_compile: PASS
 ptodsl_jit_diagnostics: PASS
 ptodsl_subkernel_diagnostics: PASS
 ptodsl_flash_attention_demo_compile: PASS
+ptodsl_rmsnorm_example_compile: PASS
 ptodsl_ptoas_frontend_verify: PASS
 ptodsl_docs_as_test: PASS
 ```

diff --git a/ptodsl/docs/user_guide/01-introduction.md b/ptodsl/docs/user_guide/01-introduction.md
@@ -257,7 +257,9 @@ These are hardware-bound compute sub-kernels, each mapped to a specific NPU comp
 
 Each can be invoked as a named decorated function (`@pto.cube` /
 `@pto.simd` / `@pto.simt`) or inline as a context manager
-(`with pto.cube():`, `with pto.simd():`, `with pto.simt():`).
+(`with pto.cube():`, `with pto.simd():`, `with pto.simt():`). Inline SIMT
+scopes can also spell launch dimensions directly with
+`with pto.simt(dim_x, dim_y, dim_z):`.
 
 The boundary contract is strict: vreg values do not escape a simd kernel, cube-local state does not leak into UB, and data crosses layer boundaries only through UB-backed tiles or typed UB pointers.
 

diff --git a/ptodsl/docs/user_guide/03-kernel-entry-and-subkernels.md b/ptodsl/docs/user_guide/03-kernel-entry-and-subkernels.md
@@ -736,8 +736,9 @@ two ways:
 
 1. **As decorated functions** — reusable, named sub-kernels called from
    `@pto.jit` entries and modules.
-2. **As context managers** (`with pto.cube():`, etc.) — inline blocks for
-   one-off snippets (see Section 3.8).
+2. **As context managers** (`with pto.cube():`, `with pto.simd():`,
+   `with pto.simt():`, and `with pto.simt(dim_x, dim_y, dim_z):`) — inline
+   blocks for one-off snippets (see Section 3.8).
 
 Named sub-kernel decorators use the same default AST rewrite model as
 `@pto.jit`: supported Python `if` and `for range(...)` statements lower to
@@ -997,10 +998,13 @@ Specific SIMT micro-op APIs are documented in Chapter 13.
 
 ## 3.8 Inline context manager syntax
 
-In addition to the decorator form, each sub-kernel unit provides a context
-manager: `with pto.cube():`, `with pto.simd():`, and `with pto.simt():`. These
-open one-off anonymous sub-kernel bodies without requiring a separate named
-Python function. Inline scopes are supported in top-level `@pto.jit` bodies.
+In addition to the decorator form, each sub-kernel unit provides an inline
+context manager form: `with pto.cube():`, `with pto.simd():`,
+`with pto.simt():`, and `with pto.simt(dim_x, dim_y, dim_z):`. These open
+one-off anonymous sub-kernel bodies without requiring a separate named Python
+function. Inline scopes are supported in top-level `@pto.jit` bodies. The
+dimensioned SIMT form uses the same inline body style while making the caller
+emit an explicit `pto.simt_launch`.
 
 ### Syntax
 
@@ -1022,6 +1026,12 @@ with pto.simt():
     scalar.store(o_next, o_next_tile[row, col])
 ```
 
+```python
+with pto.simt(128, 1, 1):
+    tid = pto.get_tid_x()
+    scalar.store(tid, scratch_ub, scalar.index_cast(tid))
+```
+
 <!-- ptodsl-doc-test: {"mode":"compile_fragment","fixture":"kernel_entry.inline_cube_scope","symbol":"kernel_entry_inline_cube_scope_probe","compile":{"BLOCK_M":16,"BLOCK_K":16,"BLOCK_N":16}} -->
 ```python
 with pto.cube():
@@ -1041,6 +1051,9 @@ with pto.cube():
   / `pto.section.cube` bodies inside the outlined helper.
 - `with pto.simt():` preserves its scalar body inside one outlined
   `pto.simt_entry` helper, and the caller emits `pto.store_vfsimt_info`.
+- `with pto.simt(dim_x, dim_y, dim_z):` uses the same inline outlining and
+  automatic capture rules, but emits a caller-side explicit SIMT launch with
+  the authored dimensions.
 - Values defined inside the inline sub-kernel cannot escape the block directly.
   Use Tiles, typed pointers, or other mutable references to communicate results
   back to the caller.

diff --git a/ptodsl/docs/user_guide/04-type-system-and-buffer.md b/ptodsl/docs/user_guide/04-type-system-and-buffer.md
@@ -175,7 +175,31 @@ ptr_ub  = pto.ptr(pto.f16, pto.MemorySpace.UB)
 | `MemorySpace.ACC` | Cube L0C accumulator buffer |
 | `MemorySpace.BIAS` | Cube bias table buffer |
 
-## 4.5 TensorView
+## 4.5 Explicit scratch buffers
+
+Allocate SIMT lane-local scratch storage for pointer-style load and store
+operations inside a SIMT helper.
+
+```text
+pto.alloc_buffer(shape, dtype)
+```
+
+<!-- ptodsl-doc-pending: {"reason":"illustrative fragment; covered by test_jit_compile alloc_buffer probes"} -->
+```python
+scratch = pto.alloc_buffer((32,), pto.f32)
+```
+
+| Parameter | Description |
+|-----------|-------------|
+| `shape` | Static positive integer shape. Pass an `int`, `tuple[int, ...]`, or `list[int]`. |
+| `dtype` | Element type of the returned buffer, such as `pto.f32` or `pto.i32`. |
+
+The returned pointer names a local allocation in the SIMT helper invocation
+that allocates it. Use this for per-workitem temporary fragments, scalar
+scratch values, or staged values that are accessed through pointer-style loads
+and stores.
+
+## 4.6 TensorView
 
 `TensorView` is a descriptor for a tensor in Global Memory. Create one inside a `@pto.jit` body with `make_tensor_view`:
 
@@ -205,7 +229,7 @@ def kernel(
 
 Strides support non-contiguous tensors. Pass `strides=A.strides` from the source tensor for the default row-major layout, or supply explicit strides for sub-views. Use `tv.as_ptr()` to obtain a typed GM pointer for use with MTE Ops in explicit-mode orchestration.
 
-## 4.6 PartitionTensorView
+## 4.7 PartitionTensorView
 
 `partition_view` creates a sub-view of a TensorView at a given offset and size. It describes *which part* of the GM tensor a `tile.load` or `tile.store` should operate on:
 
@@ -216,7 +240,7 @@ part = pto.partition_view(tv, offsets=[row_offset, 0], sizes=[BLOCK, dim])
 
 The result is a `PartitionTensorView` — a lightweight descriptor, not a data buffer. It carries the partition's shape, strides, and element type (inherited from the source TensorView). Use `part.as_ptr()` to obtain a typed GM pointer for MTE Ops in explicit-mode orchestration.
 
-## 4.7 Tile
+## 4.8 Tile
 
 A `Tile` is an on-chip buffer allocated in UB or cube-local memory. Allocate tiles with `alloc_tile`:
 

diff --git a/ptodsl/docs/user_guide/06-scalar-and-pointer-ops.md b/ptodsl/docs/user_guide/06-scalar-and-pointer-ops.md
@@ -35,7 +35,9 @@ When in doubt, ask: *can this value change between launches of the same compiled
 
 ## 6.2 Scalar access: load and store
 
-`scalar.load` reads a single scalar element from a typed pointer or tile location. `scalar.store` writes a scalar back. These are the canonical scalar memory ops for SIMT authoring. The offset is counted in elements, not bytes.
+`scalar.load` reads one scalar element from a typed pointer or tile location.
+`scalar.store` writes one scalar element back. These are the canonical scalar
+memory ops for SIMT authoring. Offsets are counted in elements, not bytes.
 
 #### `scalar.load(ptr: PtrType, offset: Index) -> ScalarType`
 
@@ -101,6 +103,67 @@ scalar.store(value, tile[row, col])
 scalar.store(value, ptr, offset)
 ```
 
+### Contiguous vector access
+
+Use `contiguous=N` when a single work-item should read or write `N` adjacent
+elements as one vector value. `N` must be a positive Python integer greater than
+`1`.
+
+#### `scalar.load(ptr: PtrType, offset: Index, *, contiguous: int) -> VecValue`
+
+**Description**: Loads `contiguous` adjacent elements from a typed pointer.
+
+**Parameters**:
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `ptr` | `PtrType` | Typed source pointer |
+| `offset` | `Index` | First element to load |
+| `contiguous` | Positive Python `int` greater than `1` | Number of adjacent elements to load |
+
+**Returns**:
+
+| Return Value | Type | Description |
+|--------------|------|-------------|
+| `value` | `pto.vec(T, N)` | Vector value with `N == contiguous` and element type `T` |
+
+**Example**:
+
+```python
+x4 = scalar.load(ptr, offset, contiguous=4)
+```
+
+For a `pto.ptr(pto.f32, "ub")`, this produces a DSL vector value with type
+`pto.vec(pto.f32, 4)`.
+
+---
+
+#### `scalar.store(value: VecValue, ptr: PtrType, offset: Index, *, contiguous: int | None = None) -> None`
+
+**Description**: Stores a vector value to adjacent elements of a typed pointer.
+The store width is taken from the vector lane count. If `contiguous` is
+provided, it must match that lane count.
+
+**Parameters**:
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `value` | `pto.vec(T, N)` | Vector value to write |
+| `ptr` | `PtrType` | Typed destination pointer |
+| `offset` | `Index` | First element to store |
+| `contiguous` | `int` or `None` | Optional width check; when provided, it must equal `N` |
+
+**Example**:
+
+```python
+scalar.store(x4, ptr, offset)
+scalar.store(x4, ptr, offset, contiguous=4)  # optional width check
+```
+
+`scalar.store(scalar_value, ptr, offset, contiguous=N)` is rejected because
+scalar values are not implicitly broadcast for vector stores. To build an
+explicit broadcast vector, use `pto.vec(...)`; see Section 8.4.
+
 ### Scalar value adaptation
 
 `scalar.store` adapts the authored `value` to the destination element type.

diff --git a/ptodsl/docs/user_guide/08-compute-operations.md b/ptodsl/docs/user_guide/08-compute-operations.md
@@ -1864,3 +1864,43 @@ The `mte_l1_l0a`/`mte_l1_l0b` stage operands from the authored source tiles into
 | `pto.mad_mx_bias(lhs, rhs, dst, bias, m, n, k, **clauses)` | MX-format bias-init matmul |
 
 MX variants require MX-enabled dtypes (f8) and pre-loaded scale payloads. For most users, the standard `mad`, `mad_acc`, and `mad_bias` are the primary interface.
+
+---
+
+## 8.4 Builtin vector values
+
+Builtin vector values are small fixed-lane vectors used by contiguous scalar
+accesses and element-wise vector expressions. They are distinct from the
+`VRegType` values used inside `@pto.simd` kernels.
+
+#### `pto.vec(dtype, lanes, *, init=None)`
+
+**Description**: Names a builtin vector type. When `init` is provided,
+constructs a vector value. A scalar initializer is broadcast to every lane.
+
+**Parameters**:
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `dtype` | PTO dtype | Element type, such as `pto.f32` |
+| `lanes` | Positive Python `int` | Number of lanes |
+| `init` | Scalar value, vector value, or `None` | Optional initializer; scalar values are broadcast to all lanes |
+
+**Returns**:
+
+| Return Value | Type | Description |
+|--------------|------|-------------|
+| `result` | Vector type or `pto.vec(dtype, lanes)` value | Without `init`, returns a vector type descriptor; with `init`, returns a vector value |
+
+**Example**:
+
+<!-- ptodsl-doc-pending: {"reason":"illustrative fragment; covered by test_jit_compile scalar contiguous vector probes"} -->
+```python
+x4 = scalar.load(ptr, offset, contiguous=4)
+rstd4 = pto.vec(pto.f32, 4, init=rstd)
+y4 = x4 * rstd4
+scalar.store(y4, ptr, offset)
+```
+
+Use this form when a scalar value must participate in element-wise arithmetic
+with a vector value returned by `scalar.load(..., contiguous=N)`.
diff --git a/ptodsl/docs/user_guide/13-simt-micro-ops.md b/ptodsl/docs/user_guide/13-simt-micro-ops.md
@@ -10,8 +10,8 @@ scalar values loaded from tiles.
 #### `pto.store_vfsimt_info(dim_z, dim_y, dim_x) -> None`
 
 **Description**: Emits the low-level VPTO launch descriptor operation. Most
-code should use `body[dim_x, dim_y, dim_z](...)` or `pto.simt_launch(...)`
-instead.
+code should use `body[dim_x, dim_y, dim_z](...)`, `pto.simt_launch(...)`, or
+the inline form `with pto.simt(dim_x, dim_y, dim_z):` instead.
 
 **Parameters**: