hw-native-sys · SergioMartin86 · Jun 25, 2026 · Jun 25, 2026 · Jun 25, 2026 · Jun 25, 2026
diff --git a/RECONCILIATION_NOTES.md b/RECONCILIATION_NOTES.md
@@ -0,0 +1,134 @@
+# Polling PR — Rebase Reconciliation Notes
+
+State of `polling-pr-minimal` (HEAD `188be7e4`) vs `upstream/main`
+(currently `ecfb1663`, 14 commits ahead of the PR's base `fcc33bcb`).
+
+## TL;DR
+
+- **`git rebase upstream/main` produces 15 file conflicts.** Mechanical
+  resolution (take "theirs" for files we rewrote, take "ours" for files
+  with upstream feature additions, rename `Arg→L0TaskArgs` and `.ptr→.ref()`
+  throughout) gets the tree to **compile clean**.
+- **Compile-clean tree still hangs at runtime** with the now-familiar
+  507018 AICore op-timeout. The hang is a **protocol-level mismatch**
+  between upstream's evolved init/dispatch handshake and the polling-side
+  SHM/scheduler — not a few-more-renames-away fix. Estimated 1-2 days of
+  targeted protocol alignment + re-benchmarking to land cleanly.
+- Decision (24 Jun 2026): pause the rebase; reviewer/maintainer to either
+  rebase as part of merge or this PR will be rebased in a future session.
+
+## Upstream commits since `fcc33bcb`
+
+| Commit | Touches polling design? | Why |
+|---|---|---|
+| `10a7680b` `Refactor: tensormap L0/L2TaskArgs arg hierarchy` | **Yes — heavy** | `Arg` → `Arg<MaxT,MaxS>` template; `L0TaskArgs = Arg<32,16>` for core submit; `TensorRef::ptr` → `.ref()`/`.create_info()` accessors. Renames propagate through every submit signature. |
+| `c6354842` `feat(runtime): unify runtime_env ring sizing` | **Yes — heavy** | `PTO2RuntimeArenaLayout` gains `task_window_sizes[PTO2_MAX_RING_DEPTH]`/`heap_sizes[]`/`dep_pool_capacities[]` arrays. New `init_per_ring` on `PTO2SharedMemoryHandle`. `runtime_init_data_from_layout` per-ring overload. |
+| `6dd8a5dc` `consolidate profiling init into SchedulerContext::init()` | Yes — medium | `SchedulerContext::init()` signature changed; `l2_swimlane_level` moved from `PTO2OrchestratorState` to `SchedulerContext`. `runtime_destroy(rt, arena)` 2-arg signature. |
+| `6c3a9e49` `consumed/reuse deadlock fix` | **No** | Fixes interaction between `fanout_refcount` / `fanout_count` / `task_state=CONSUMED` / `scope_end` producer-release — all four mechanisms removed by the polling design. |
+| `11f0bf40` `AICPU callable prewarm` | Yes — light | Adds `aicpu_prewarm_callable` C entry to `aicpu_executor.cpp`. |
+| `4725ef7b` `dispatcher fresh-process retry` | Yes — light | Adds retry path in `device_runner.cpp`. |
+| `78b123e7` `rename init-claim flag to init_claimed_` | Trivial | Field rename in scheduler. |
+| `ae59a8e9` `in-place card recovery` | No | `device_runner.cpp` only. |
+| `3aa94a99` `close unpublished sim host orchestration handles` | No | Sim host only. |
+| `e2112e9f` `restore SDMA async completion demo` | No | Example. |
+| Others (`ecfb1663`, `cce30871`, `2f77399a`, `e583b8a0`) | Trivial | CI / docs / examples. |
+
+## Per-file conflict matrix
+
+After `git rebase upstream/main`, 15 files conflict. Recommended
+resolution + work needed:
+
+| File | Recommended action | Status |
+|---|---|---|
+| `runtime/pto_runtime2_types.h` | take theirs (polling) | ✓ compile-fixed |
+| `runtime/pto_runtime2.h` | take theirs + add per-ring overloads | ✓ compile-fixed (added `runtime_reserve_layout` and `runtime_init_data_from_layout` per-ring overloads; added `runtime_destroy(rt, arena)` overload) |
+| `runtime/pto_runtime2.cpp` | take theirs (stub) | ✓ |
+| `runtime/pto_orchestrator.cpp` | take theirs (stub) | ✓ |
+| `runtime/pto_orchestrator.h` | take theirs + rename `Arg → L0TaskArgs` + `.create_info →`→`.create_info()` + `.ptr → &.ref()` + add `l2_swimlane_level` field | ✓ compile-fixed |
+| `runtime/pto_dep_compute.h` | take theirs + `inputs.tensors[i].ptr → &inputs.tensors[i].ref()` | ✓ compile-fixed |
+| `runtime/scheduler/pto_scheduler.h` | take theirs (polling) | ✓ |
+| `runtime/scheduler/scheduler_context.h` | take theirs + add `thread_idx` to `on_orchestration_done` signature | ✓ compile-fixed |
+| `runtime/scheduler/scheduler_cold_path.cpp` | take theirs (stub) | ✓ |
+| `runtime/scheduler/scheduler_dispatch.cpp` | take theirs (stub) | ✓ |
+| `runtime/pto_shared_memory.h` | take theirs (polling) + add `init_per_ring` method (broadcast to scalar init) | ✓ compile-fixed |
+| `runtime/runtime.h` | add `needs_copy_back` to `TensorPair` (upstream-API compat) | ✓ compile-fixed |
+| `aicpu/aicpu_executor.cpp` | take ours (upstream — has prewarm, profiling consolidation, deadlock-fix-related changes) | ✓ compile-fixed via signature adapters above |
+| `host/runtime_maker.cpp` | take ours (upstream — has per-ring env parsing #1128) | ✓ compile-fixed |
+| `orchestration/pto_arg_with_deps.h` | take ours (upstream) | ✓ trivial |
+| `orchestration/pto_orchestration_api.h` | take ours (upstream) | ✓ trivial |
+| `docs/MULTI_RING.md` | take theirs (updated for polling) | ✓ trivial |
+
+## Runtime hang — root cause hypothesis
+
+After the compile-clean tree above runs `paged_attention` Case1, AICore
+times out at 507018 with no orchestration log past the `simpler-dispatcher`
+init. Suspect chain:
+
+1. **`init_per_ring` is a stub**. My implementation broadcasts
+   `task_window_sizes[0]` to the old scalar `init_header` /
+   `setup_pointers`. If upstream's `aicpu_executor` writes
+   `prebuilt_layout.task_window_sizes[r]` for r > 0 with different values
+   than [0], the SHM layout's per-ring offsets diverge from what the
+   AICPU expects → wrong pointers → silent corruption or hang.
+2. **`PTO2OrchestratorState::l2_swimlane_level`** is back as a field, but
+   upstream's `SchedulerContext::init` may now own that state. Adding
+   the field in two places creates a tearing concern only if both writers
+   actually fire — unlikely to be the hang root cause but worth checking.
+3. **`runtime_destroy(rt, arena)`**: my overload calls the 1-arg form,
+   but upstream's `arena` parameter may be used for staged teardown
+   (e.g., scope finalize). The polling design's destroy doesn't need it
+   but the *order* of teardown might matter for upstream's aicpu_executor
+   loop. Not the boot-time hang, but a leak/reset issue downstream.
+4. **AICPU dispatch handshake**: upstream's aicpu_executor may have
+   ordering expectations around when the polling design's wiring queue
+   is initialized vs when the AICore handshake fires. The polling
+   scheduler initializes wiring lazily in `init_data_from_layout`; if
+   upstream's executor handshakes AICore *before* the wiring queue is
+   ready, AICore spins for tasks that never arrive.
+
+The fix path: thread true per-ring sizes through `PTO2SharedMemoryHandle`
+(currently the polling code uses a uniform per-ring layout — needs to
+honor the array), then add a runtime trace point at the boundary
+between aicpu_executor's `init_per_ring` call and the scheduler's first
+`drain_wiring_queue` to confirm where the AICore handshake is firing
+vs when the wiring becomes ready.
+
+## What to do next session
+
+1. `git rebase upstream/main`, apply the resolutions above (the order is
+   mechanical now that this doc records them).
+2. Build (should compile clean as documented).
+3. Run `paged_attention` Case1 to confirm the runtime hang reproduces.
+4. Add device-side `LOG_INFO_V0` traces at:
+   - `PTO2SharedMemoryHandle::init_per_ring` entry/exit (per ring)
+   - `AicpuExecutor::run` immediately before / after the first scheduler
+     `drain_wiring_queue` call
+   - `SchedulerContext::on_orchestration_done` entry
+5. Diagnose the gap revealed by the traces; align the polling SHM /
+   wiring init order with upstream's handshake.
+6. Re-run the 26-test benchmark sweep (the one in `PR_NOTES.md`) and
+   confirm parity with the pre-rebase result.
+
+## Quick repro recipe
+
+```bash
+git checkout polling-pr-minimal             # HEAD = 188be7e4
+git rebase upstream/main                    # 15 conflicts
+
+# Take theirs (polling) for files we rewrote:
+git checkout --theirs \
+  src/a2a3/runtime/tensormap_and_ringbuffer/runtime/{pto_runtime2_types.h,pto_runtime2.cpp,pto_runtime2.h,pto_orchestrator.cpp,pto_orchestrator.h,pto_dep_compute.h,scheduler/pto_scheduler.h,scheduler/scheduler_context.h,scheduler/scheduler_cold_path.cpp,scheduler/scheduler_dispatch.cpp} \
+  src/a2a3/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.md
+
+# Take ours (upstream) for files where upstream adds features:
+git checkout --ours \
+  src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp \
+  src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp \
+  src/a2a3/runtime/tensormap_and_ringbuffer/orchestration/{pto_arg_with_deps.h,pto_orchestration_api.h}
+
+git add -u src/
+
+# Apply compile-fixes (see "Per-file conflict matrix" for details).
+# Build is clean after these. Runtime hangs — see "Runtime hang — root
+# cause hypothesis" above for the next investigation steps.
+```
diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/common/intrinsic.h b/src/a2a3/runtime/tensormap_and_ringbuffer/common/intrinsic.h
@@ -63,7 +63,7 @@
  *     compiled, ran without error, and produced wrong output. Use
  *     `get_sub_block_id(args)` instead, which reads from the runtime's
  *     `GlobalContext.sub_block_id` that the scheduler initializes per
- *     AIV core in `scheduler_cold_path.cpp::SchedulerContext::init`.
+ *     AIV core in `scheduler_context.h::SchedulerContext::init`.
  *
  *   - `get_block_idx()` and `get_block_num()` are not redirected to
  *     simpler's LocalContext either — use the `(args)` variants below
@@ -97,7 +97,7 @@ static constexpr int32_t PTO2_EXT_PARAMS_COUNT = 2;
 
 /**
  * Args[] suffix indices for context pointers.
- * Derived from MAX_TENSOR_ARGS(32) + MAX_SCALAR_ARGS(16).
+ * Derived from MAX_TENSOR_ARGS(16) + MAX_SCALAR_ARGS(32).
  * Users should not depend on these values; use the Get* functions below.
  */
 static constexpr int32_t SPMD_LOCAL_CONTEXT_INDEX = 48;

diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.md b/src/a2a3/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.md
@@ -179,8 +179,9 @@ Each ring's `last_task_alive` advances independently:
 
 ```text
 advance_ring_pointers(ring_id):  // protected by per-ring advance_lock
-    la = ring->fc.last_task_alive
-    while ring->get_slot_state_by_task_id(la).task_state >= CONSUMED:
+    watermark = ring->completed_watermark
+    la = last_task_alive
+    while la <= watermark and watermark >= slot[la].last_consumer_local_id:
         reset slot for reuse
         la++
     sync_to_sm()  // release-store last_task_alive
@@ -235,91 +236,25 @@ AICore uses `last_reg_val` to detect new dispatches — identical values cause s
 | `PTO2_HEAP_SIZE` | 256 MB | 1 GB |
 | `PTO2_DEP_LIST_POOL_SIZE` | 16384 | 65536 |
 
-### 7.2 Runtime Overrides
-
-Each ring resource (`ring_task_window` / `ring_heap` / `ring_dep_pool`) is a
-single `CallConfig.runtime_env` field that accepts **either** a scalar (broadcast
-to every ring) **or** a list of four per-ring values. Precedence is resolved
-independently for each resource and ring:
-
-```text
-per-ring CallConfig entry (a scalar is broadcast to every entry)
-  > per-ring PTO2_RING_* env value
-  > scalar PTO2_RING_* env value
-  > compile-time default
-```
-
-`ring_id` is the scope-depth ring selected by the runtime:
-
-```text
-scope depth 0 -> ring 0
-scope depth 1 -> ring 1
-scope depth 2 -> ring 2
-scope depth >=3 -> ring 3
-```
+### 7.2 Runtime Environment Overrides
 
-Per-task via `CallConfig.runtime_env` — different L2 tasks in one launch can
-each carry their own sizes. Invalid values raise at submit time (`validate()`).
-Assign a scalar to size every ring the same:
-
-```python
-cfg = CallConfig()
-cfg.runtime_env.ring_task_window = 128   # power of 2, >= 4
-cfg.runtime_env.ring_heap = 262144       # bytes/ring, >= 1024
-cfg.runtime_env.ring_dep_pool = 256      # 4 .. INT32_MAX
-orchestrator.submit_next_level(handle, args, cfg)
-```
-
-Assign a four-entry list to tune the scope-depth rings independently. The list
-must contain exactly four entries; use `0` for an entry that should fall through
-to the next precedence tier. All `CallConfig` values are integer byte/count
-values, and each field always reads back as a four-entry list.
-
-```python
-cfg = CallConfig()
-cfg.runtime_env.ring_task_window = [8192, 16384, 131072, 524288]
-cfg.runtime_env.ring_heap = [
-    128 * 1024 * 1024,
-    256 * 1024 * 1024,
-    384 * 1024 * 1024,
-    512 * 1024 * 1024,
-]
-cfg.runtime_env.ring_dep_pool = [4096, 8192, 16384, 32768]
-orchestrator.submit_next_level(handle, args, cfg)
-```
-
-Scene tests set the same keys under a nested `runtime_env` block in the
-per-case `config` dict — each value is a scalar or a four-entry list:
-
-```python
-"config": {
-    "runtime_env": {
-        "ring_task_window": [8192, 16384, 131072, 524288],
-        "ring_heap": [134217728, 268435456, 402653184, 536870912],
-        "ring_dep_pool": 256,  # scalar broadcasts to every ring
-    }
-}
-```
-
-Process-wide env fallback accepts either one scalar value or exactly four
-comma-separated per-ring values. Invalid env values are logged and ignored, then
-fall through to defaults. `PTO2_RING_HEAP` values are integer bytes:
+Uniform (applies to all rings):
 
 ```bash
-# Uniform, old behavior:
 PTO2_RING_TASK_WINDOW=1024
 PTO2_RING_HEAP=1048576
 PTO2_RING_DEP_POOL=1024
-
-# Per-ring, indexed by ring_id 0..3:
-PTO2_RING_TASK_WINDOW=8192,16384,131072,524288
-PTO2_RING_HEAP=134217728,268435456,402653184,536870912
-PTO2_RING_DEP_POOL=4096,8192,16384,32768
 ```
 
-Use `--enable-scope-stats` to confirm the effective values for a real run. The
-first line of `scope_stats/scope_stats.jsonl` includes `task_window_max`,
-`heap_max`, and `dep_pool_max`, indexed by `ring`.
+In `kernel_config.py`:
+
+```python
+RUNTIME_ENV = {
+    "PTO2_RING_TASK_WINDOW": "128",
+    "PTO2_RING_HEAP": "262144",
+    "PTO2_RING_DEP_POOL": "256",
+}
+```
 
 ### 7.3 Sizing Guidelines
 

diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md b/src/a2a3/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md
@@ -538,7 +538,7 @@ This is protected by a per-ring try-lock (`advance_lock`) in `RingSchedState`, e
 
 ### 8.5 SchedulerContext
 
-All scheduler-side state and methods live in `SchedulerContext` (`runtime/scheduler/scheduler_context.h`). It is held as a `sched_ctx_` member of `AicpuExecutor`; `AicpuExecutor` is a thin wrapper that owns the lifecycle atomics and the orchestration SO handle, and delegates everything else to `SchedulerContext`.
+All scheduler-side state and methods live in `SchedulerContext` (`runtime/scheduler_context.h`). It is held as a `sched_ctx_` member of `AicpuExecutor`; `AicpuExecutor` is a thin wrapper that owns the lifecycle atomics and the orchestration SO handle, and delegates everything else to `SchedulerContext`.
 
 Public surface (called from `AicpuExecutor::init/run/deinit`):
 
@@ -552,11 +552,7 @@ Public surface (called from `AicpuExecutor::init/run/deinit`):
 | `deinit()` | once per run | Reset every scheduler-owned field to its post-construction default |
 | Read-only accessors | various | `aic_count()` / `aiv_count()` / `is_completed()` / `completed_tasks_count()` |
 
-Private internals are split across three .cpp files by responsibility:
-
-- `scheduler_completion.cpp` — completion polling, drain protocol
-- `scheduler_dispatch.cpp` — task dispatch loop and helpers
-- `scheduler_cold_path.cpp` — exit checks, stall diagnostics, profiling, lifecycle (`init/deinit`), core management (`handshake_all_cores` / `assign_cores_to_threads` / `emergency_shutdown`), and `on_orchestration_done`
+Private internals all live inline in `scheduler_context.h`, covering completion polling, drain protocol, task dispatch loop and helpers, exit checks, stall diagnostics, profiling, lifecycle (`init/deinit`), core management (`handshake_all_cores` / `assign_cores_to_threads` / `reassign_cores_for_all_threads` / `emergency_shutdown`), and `on_orchestration_done`.
 
 `AicpuExecutor` calls neither `handshake_*`, `assign_*`, `reassign_*`, nor `emergency_shutdown` directly — they are private, invoked only by `init` and `on_orchestration_done`.
 

diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/docs/SCALAR_DATA_ACCESS.md b/src/a2a3/runtime/tensormap_and_ringbuffer/docs/SCALAR_DATA_ACCESS.md
@@ -32,7 +32,7 @@ addr null-check → TensorMap lookup → spin-wait producer COMPLETED → comput
 
 - **addr null-check**: `buffer.addr == 0` means unallocated — log error, return 0
 - **TensorMap lookup**: find producer task by `buffer.addr`
-- **spin-wait**: wait until producer `task_state >= PTO2_TASK_COMPLETED`
+- **spin-wait**: wait until producer's `completion_flags[local_id & mask] == 1`
 - **No producer** (lookup callback never fires): skip waiting, read immediately
 
 ### 3.2 set_tensor_data Flow

diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/docs/device_log_profiling.md b/src/a2a3/runtime/tensormap_and_ringbuffer/docs/device_log_profiling.md
@@ -52,7 +52,7 @@ Thread 3: PTO2 total submitted tasks = 16704
 
 ### Field Reference
 
-| Field | Source (`pto_orchestrator.cpp`) | Description |
+| Field | Source (`pto_orchestrator.h`) | Description |
 | ----- | ------------------------------- | ----------- |
 | **cost** | Wall-clock around `orch_func()` call | Total time including orchestration logic + scope overhead |
 | **total** | Sum of all sub-steps below | Accumulated time inside `submit_task` across all tasks |

diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/docs/profiling_levels.md b/src/a2a3/runtime/tensormap_and_ringbuffer/docs/profiling_levels.md
@@ -48,7 +48,7 @@ Each sub-level macro requires `PTO2_PROFILING=1`:
 
 - Debug/diagnostic logs (always present)
 - Progress tracking (`PTO2 progress: completed=...`)
-- Stall detection and dump (triggered after the `SCHEDULER_TIMEOUT_MS` wall-clock no-progress budget)
+- Stall detection and dump (triggered only after `MAX_IDLE_ITERATIONS` idle loops)
 - Deadlock/livelock detection (`diagnose_stuck_state`, called on stall)
 
 **What's NOT compiled:**
@@ -255,7 +255,7 @@ Identity fields the AICPU side used to write at level 1 (`func_id`,
   collector (`L2SwimlaneCollector::set_core_types`).
 
 AICore buffer rotation no longer piggy-backs on `complete_task`. AICPU
-counts dispatches per core in the dispatch path (scheduler_dispatch in
+counts dispatches per core in the dispatch path (scheduler_context in
 tensormap_and_ringbuffer; aicpu_executor in host_build_graph) and rotates
 the AICore buffer when the count is about to cross a
 `PLATFORM_AICORE_BUFFER_SIZE` boundary — strictly before
@@ -428,7 +428,7 @@ definitions to runtime headers.
 ### Code Locations
 
 - Macro defaults and validation: `src/common/task_interface/profiling_config.h`
-- Scheduler profiling: `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp` and `scheduler_cold_path.cpp`
+- Scheduler profiling: `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler_context.h`
 - Orchestrator profiling: `src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp`
 - TensorMap profiling: `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.h`
 

diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp
@@ -556,7 +556,7 @@ dep_gen_replay_emit_deps_json(const DepGenRecord *records, size_t num_records, c
         // `explicit_dep_count` / `over->dep_count` originate from device
         // shared memory and are bounded by the writer to the array sizes, but
         // we clamp on read too so a corrupted record never drives an OOB read
-        // off the end of rec.explicit_deps[64] / over->deps[582].
+        // off the end of rec.explicit_deps[64] / over->deps[326].
         const uint64_t *deps_data;
         int32_t dc;
         if (rec.flags & DEP_GEN_FLAG_HAS_OVERFLOW) {

diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/orchestration/common.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/orchestration/common.cpp
@@ -10,6 +10,13 @@
  */
 #include "common.h"
 
+// LOG_ERROR can't be pulled from common/unified_log.h here because that header
+// would re-#define LOG_INFO_V0..V9 already provided by pto_orchestration_api.h
+// (orchestration routes them through the runtime ops table). For the limited
+// use inside this file, write directly to stderr.
+#include <cstdio>
+#define LOG_ERROR(fmt, ...) std::fprintf(stderr, "[ERROR] " fmt "\n", ##__VA_ARGS__)
+
 #ifdef __linux__
 #include <cxxabi.h>
 #include <dlfcn.h>