Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 134 additions & 0 deletions RECONCILIATION_NOTES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Polling PR — Rebase Reconciliation Notes

State of `polling-pr-minimal` (HEAD `188be7e4`) vs `upstream/main`
(currently `ecfb1663`, 14 commits ahead of the PR's base `fcc33bcb`).

## TL;DR

- **`git rebase upstream/main` produces 15 file conflicts.** Mechanical
resolution (take "theirs" for files we rewrote, take "ours" for files
with upstream feature additions, rename `Arg→L0TaskArgs` and `.ptr→.ref()`
throughout) gets the tree to **compile clean**.
- **Compile-clean tree still hangs at runtime** with the now-familiar
507018 AICore op-timeout. The hang is a **protocol-level mismatch**
between upstream's evolved init/dispatch handshake and the polling-side
SHM/scheduler — not a few-more-renames-away fix. Estimated 1-2 days of
targeted protocol alignment + re-benchmarking to land cleanly.
- Decision (24 Jun 2026): pause the rebase; reviewer/maintainer to either
rebase as part of merge or this PR will be rebased in a future session.

## Upstream commits since `fcc33bcb`

| Commit | Touches polling design? | Why |
|---|---|---|
| `10a7680b` `Refactor: tensormap L0/L2TaskArgs arg hierarchy` | **Yes — heavy** | `Arg` → `Arg<MaxT,MaxS>` template; `L0TaskArgs = Arg<32,16>` for core submit; `TensorRef::ptr` → `.ref()`/`.create_info()` accessors. Renames propagate through every submit signature. |
| `c6354842` `feat(runtime): unify runtime_env ring sizing` | **Yes — heavy** | `PTO2RuntimeArenaLayout` gains `task_window_sizes[PTO2_MAX_RING_DEPTH]`/`heap_sizes[]`/`dep_pool_capacities[]` arrays. New `init_per_ring` on `PTO2SharedMemoryHandle`. `runtime_init_data_from_layout` per-ring overload. |
| `6dd8a5dc` `consolidate profiling init into SchedulerContext::init()` | Yes — medium | `SchedulerContext::init()` signature changed; `l2_swimlane_level` moved from `PTO2OrchestratorState` to `SchedulerContext`. `runtime_destroy(rt, arena)` 2-arg signature. |
| `6c3a9e49` `consumed/reuse deadlock fix` | **No** | Fixes interaction between `fanout_refcount` / `fanout_count` / `task_state=CONSUMED` / `scope_end` producer-release — all four mechanisms removed by the polling design. |
| `11f0bf40` `AICPU callable prewarm` | Yes — light | Adds `aicpu_prewarm_callable` C entry to `aicpu_executor.cpp`. |
| `4725ef7b` `dispatcher fresh-process retry` | Yes — light | Adds retry path in `device_runner.cpp`. |
| `78b123e7` `rename init-claim flag to init_claimed_` | Trivial | Field rename in scheduler. |
| `ae59a8e9` `in-place card recovery` | No | `device_runner.cpp` only. |
| `3aa94a99` `close unpublished sim host orchestration handles` | No | Sim host only. |
| `e2112e9f` `restore SDMA async completion demo` | No | Example. |
| Others (`ecfb1663`, `cce30871`, `2f77399a`, `e583b8a0`) | Trivial | CI / docs / examples. |

## Per-file conflict matrix

After `git rebase upstream/main`, 15 files conflict. Recommended
resolution + work needed:

| File | Recommended action | Status |
|---|---|---|
| `runtime/pto_runtime2_types.h` | take theirs (polling) | ✓ compile-fixed |
| `runtime/pto_runtime2.h` | take theirs + add per-ring overloads | ✓ compile-fixed (added `runtime_reserve_layout` and `runtime_init_data_from_layout` per-ring overloads; added `runtime_destroy(rt, arena)` overload) |
| `runtime/pto_runtime2.cpp` | take theirs (stub) | ✓ |
| `runtime/pto_orchestrator.cpp` | take theirs (stub) | ✓ |
| `runtime/pto_orchestrator.h` | take theirs + rename `Arg → L0TaskArgs` + `.create_info →`→`.create_info()` + `.ptr → &.ref()` + add `l2_swimlane_level` field | ✓ compile-fixed |
| `runtime/pto_dep_compute.h` | take theirs + `inputs.tensors[i].ptr → &inputs.tensors[i].ref()` | ✓ compile-fixed |
| `runtime/scheduler/pto_scheduler.h` | take theirs (polling) | ✓ |
| `runtime/scheduler/scheduler_context.h` | take theirs + add `thread_idx` to `on_orchestration_done` signature | ✓ compile-fixed |
| `runtime/scheduler/scheduler_cold_path.cpp` | take theirs (stub) | ✓ |
| `runtime/scheduler/scheduler_dispatch.cpp` | take theirs (stub) | ✓ |
| `runtime/pto_shared_memory.h` | take theirs (polling) + add `init_per_ring` method (broadcast to scalar init) | ✓ compile-fixed |
| `runtime/runtime.h` | add `needs_copy_back` to `TensorPair` (upstream-API compat) | ✓ compile-fixed |
| `aicpu/aicpu_executor.cpp` | take ours (upstream — has prewarm, profiling consolidation, deadlock-fix-related changes) | ✓ compile-fixed via signature adapters above |
| `host/runtime_maker.cpp` | take ours (upstream — has per-ring env parsing #1128) | ✓ compile-fixed |
| `orchestration/pto_arg_with_deps.h` | take ours (upstream) | ✓ trivial |
| `orchestration/pto_orchestration_api.h` | take ours (upstream) | ✓ trivial |
| `docs/MULTI_RING.md` | take theirs (updated for polling) | ✓ trivial |

## Runtime hang — root cause hypothesis

After the compile-clean tree above runs `paged_attention` Case1, AICore
times out at 507018 with no orchestration log past the `simpler-dispatcher`
init. Suspect chain:

1. **`init_per_ring` is a stub**. My implementation broadcasts
`task_window_sizes[0]` to the old scalar `init_header` /
`setup_pointers`. If upstream's `aicpu_executor` writes
`prebuilt_layout.task_window_sizes[r]` for r > 0 with different values
than [0], the SHM layout's per-ring offsets diverge from what the
AICPU expects → wrong pointers → silent corruption or hang.
2. **`PTO2OrchestratorState::l2_swimlane_level`** is back as a field, but
upstream's `SchedulerContext::init` may now own that state. Adding
the field in two places creates a tearing concern only if both writers
actually fire — unlikely to be the hang root cause but worth checking.
3. **`runtime_destroy(rt, arena)`**: my overload calls the 1-arg form,
but upstream's `arena` parameter may be used for staged teardown
(e.g., scope finalize). The polling design's destroy doesn't need it
but the *order* of teardown might matter for upstream's aicpu_executor
loop. Not the boot-time hang, but a leak/reset issue downstream.
4. **AICPU dispatch handshake**: upstream's aicpu_executor may have
ordering expectations around when the polling design's wiring queue
is initialized vs when the AICore handshake fires. The polling
scheduler initializes wiring lazily in `init_data_from_layout`; if
upstream's executor handshakes AICore *before* the wiring queue is
ready, AICore spins for tasks that never arrive.

The fix path: thread true per-ring sizes through `PTO2SharedMemoryHandle`
(currently the polling code uses a uniform per-ring layout — needs to
honor the array), then add a runtime trace point at the boundary
between aicpu_executor's `init_per_ring` call and the scheduler's first
`drain_wiring_queue` to confirm where the AICore handshake is firing
vs when the wiring becomes ready.

## What to do next session

1. `git rebase upstream/main`, apply the resolutions above (the order is
mechanical now that this doc records them).
2. Build (should compile clean as documented).
3. Run `paged_attention` Case1 to confirm the runtime hang reproduces.
4. Add device-side `LOG_INFO_V0` traces at:
- `PTO2SharedMemoryHandle::init_per_ring` entry/exit (per ring)
- `AicpuExecutor::run` immediately before / after the first scheduler
`drain_wiring_queue` call
- `SchedulerContext::on_orchestration_done` entry
5. Diagnose the gap revealed by the traces; align the polling SHM /
wiring init order with upstream's handshake.
6. Re-run the 26-test benchmark sweep (the one in `PR_NOTES.md`) and
confirm parity with the pre-rebase result.

## Quick repro recipe

```bash
git checkout polling-pr-minimal # HEAD = 188be7e4
git rebase upstream/main # 15 conflicts

# Take theirs (polling) for files we rewrote:
git checkout --theirs \
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/{pto_runtime2_types.h,pto_runtime2.cpp,pto_runtime2.h,pto_orchestrator.cpp,pto_orchestrator.h,pto_dep_compute.h,scheduler/pto_scheduler.h,scheduler/scheduler_context.h,scheduler/scheduler_cold_path.cpp,scheduler/scheduler_dispatch.cpp} \
src/a2a3/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.md

# Take ours (upstream) for files where upstream adds features:
git checkout --ours \
src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp \
src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp \
src/a2a3/runtime/tensormap_and_ringbuffer/orchestration/{pto_arg_with_deps.h,pto_orchestration_api.h}

git add -u src/

# Apply compile-fixes (see "Per-file conflict matrix" for details).
# Build is clean after these. Runtime hangs — see "Runtime hang — root
# cause hypothesis" above for the next investigation steps.
```
4 changes: 2 additions & 2 deletions src/a2a3/runtime/tensormap_and_ringbuffer/common/intrinsic.h
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@
* compiled, ran without error, and produced wrong output. Use
* `get_sub_block_id(args)` instead, which reads from the runtime's
* `GlobalContext.sub_block_id` that the scheduler initializes per
* AIV core in `scheduler_cold_path.cpp::SchedulerContext::init`.
* AIV core in `scheduler_context.h::SchedulerContext::init`.
*
* - `get_block_idx()` and `get_block_num()` are not redirected to
* simpler's LocalContext either — use the `(args)` variants below
Expand Down Expand Up @@ -97,7 +97,7 @@ static constexpr int32_t PTO2_EXT_PARAMS_COUNT = 2;

/**
* Args[] suffix indices for context pointers.
* Derived from MAX_TENSOR_ARGS(32) + MAX_SCALAR_ARGS(16).
* Derived from MAX_TENSOR_ARGS(16) + MAX_SCALAR_ARGS(32).
* Users should not depend on these values; use the Get* functions below.
*/
static constexpr int32_t SPMD_LOCAL_CONTEXT_INDEX = 48;
Expand Down
93 changes: 14 additions & 79 deletions src/a2a3/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,8 +179,9 @@ Each ring's `last_task_alive` advances independently:

```text
advance_ring_pointers(ring_id): // protected by per-ring advance_lock
la = ring->fc.last_task_alive
while ring->get_slot_state_by_task_id(la).task_state >= CONSUMED:
watermark = ring->completed_watermark
la = last_task_alive
while la <= watermark and watermark >= slot[la].last_consumer_local_id:
reset slot for reuse
la++
sync_to_sm() // release-store last_task_alive
Expand Down Expand Up @@ -235,91 +236,25 @@ AICore uses `last_reg_val` to detect new dispatches — identical values cause s
| `PTO2_HEAP_SIZE` | 256 MB | 1 GB |
| `PTO2_DEP_LIST_POOL_SIZE` | 16384 | 65536 |

### 7.2 Runtime Overrides

Each ring resource (`ring_task_window` / `ring_heap` / `ring_dep_pool`) is a
single `CallConfig.runtime_env` field that accepts **either** a scalar (broadcast
to every ring) **or** a list of four per-ring values. Precedence is resolved
independently for each resource and ring:

```text
per-ring CallConfig entry (a scalar is broadcast to every entry)
> per-ring PTO2_RING_* env value
> scalar PTO2_RING_* env value
> compile-time default
```

`ring_id` is the scope-depth ring selected by the runtime:

```text
scope depth 0 -> ring 0
scope depth 1 -> ring 1
scope depth 2 -> ring 2
scope depth >=3 -> ring 3
```
### 7.2 Runtime Environment Overrides

Per-task via `CallConfig.runtime_env` — different L2 tasks in one launch can
each carry their own sizes. Invalid values raise at submit time (`validate()`).
Assign a scalar to size every ring the same:

```python
cfg = CallConfig()
cfg.runtime_env.ring_task_window = 128 # power of 2, >= 4
cfg.runtime_env.ring_heap = 262144 # bytes/ring, >= 1024
cfg.runtime_env.ring_dep_pool = 256 # 4 .. INT32_MAX
orchestrator.submit_next_level(handle, args, cfg)
```

Assign a four-entry list to tune the scope-depth rings independently. The list
must contain exactly four entries; use `0` for an entry that should fall through
to the next precedence tier. All `CallConfig` values are integer byte/count
values, and each field always reads back as a four-entry list.

```python
cfg = CallConfig()
cfg.runtime_env.ring_task_window = [8192, 16384, 131072, 524288]
cfg.runtime_env.ring_heap = [
128 * 1024 * 1024,
256 * 1024 * 1024,
384 * 1024 * 1024,
512 * 1024 * 1024,
]
cfg.runtime_env.ring_dep_pool = [4096, 8192, 16384, 32768]
orchestrator.submit_next_level(handle, args, cfg)
```

Scene tests set the same keys under a nested `runtime_env` block in the
per-case `config` dict — each value is a scalar or a four-entry list:

```python
"config": {
"runtime_env": {
"ring_task_window": [8192, 16384, 131072, 524288],
"ring_heap": [134217728, 268435456, 402653184, 536870912],
"ring_dep_pool": 256, # scalar broadcasts to every ring
}
}
```

Process-wide env fallback accepts either one scalar value or exactly four
comma-separated per-ring values. Invalid env values are logged and ignored, then
fall through to defaults. `PTO2_RING_HEAP` values are integer bytes:
Uniform (applies to all rings):

```bash
# Uniform, old behavior:
PTO2_RING_TASK_WINDOW=1024
PTO2_RING_HEAP=1048576
PTO2_RING_DEP_POOL=1024

# Per-ring, indexed by ring_id 0..3:
PTO2_RING_TASK_WINDOW=8192,16384,131072,524288
PTO2_RING_HEAP=134217728,268435456,402653184,536870912
PTO2_RING_DEP_POOL=4096,8192,16384,32768
```

Use `--enable-scope-stats` to confirm the effective values for a real run. The
first line of `scope_stats/scope_stats.jsonl` includes `task_window_max`,
`heap_max`, and `dep_pool_max`, indexed by `ring`.
In `kernel_config.py`:

```python
RUNTIME_ENV = {
"PTO2_RING_TASK_WINDOW": "128",
"PTO2_RING_HEAP": "262144",
"PTO2_RING_DEP_POOL": "256",
}
```

### 7.3 Sizing Guidelines

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -538,7 +538,7 @@ This is protected by a per-ring try-lock (`advance_lock`) in `RingSchedState`, e

### 8.5 SchedulerContext

All scheduler-side state and methods live in `SchedulerContext` (`runtime/scheduler/scheduler_context.h`). It is held as a `sched_ctx_` member of `AicpuExecutor`; `AicpuExecutor` is a thin wrapper that owns the lifecycle atomics and the orchestration SO handle, and delegates everything else to `SchedulerContext`.
All scheduler-side state and methods live in `SchedulerContext` (`runtime/scheduler_context.h`). It is held as a `sched_ctx_` member of `AicpuExecutor`; `AicpuExecutor` is a thin wrapper that owns the lifecycle atomics and the orchestration SO handle, and delegates everything else to `SchedulerContext`.

Public surface (called from `AicpuExecutor::init/run/deinit`):

Expand All @@ -552,11 +552,7 @@ Public surface (called from `AicpuExecutor::init/run/deinit`):
| `deinit()` | once per run | Reset every scheduler-owned field to its post-construction default |
| Read-only accessors | various | `aic_count()` / `aiv_count()` / `is_completed()` / `completed_tasks_count()` |

Private internals are split across three .cpp files by responsibility:

- `scheduler_completion.cpp` — completion polling, drain protocol
- `scheduler_dispatch.cpp` — task dispatch loop and helpers
- `scheduler_cold_path.cpp` — exit checks, stall diagnostics, profiling, lifecycle (`init/deinit`), core management (`handshake_all_cores` / `assign_cores_to_threads` / `emergency_shutdown`), and `on_orchestration_done`
Private internals all live inline in `scheduler_context.h`, covering completion polling, drain protocol, task dispatch loop and helpers, exit checks, stall diagnostics, profiling, lifecycle (`init/deinit`), core management (`handshake_all_cores` / `assign_cores_to_threads` / `reassign_cores_for_all_threads` / `emergency_shutdown`), and `on_orchestration_done`.

`AicpuExecutor` calls neither `handshake_*`, `assign_*`, `reassign_*`, nor `emergency_shutdown` directly — they are private, invoked only by `init` and `on_orchestration_done`.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ addr null-check → TensorMap lookup → spin-wait producer COMPLETED → comput

- **addr null-check**: `buffer.addr == 0` means unallocated — log error, return 0
- **TensorMap lookup**: find producer task by `buffer.addr`
- **spin-wait**: wait until producer `task_state >= PTO2_TASK_COMPLETED`
- **spin-wait**: wait until producer's `completion_flags[local_id & mask] == 1`
- **No producer** (lookup callback never fires): skip waiting, read immediately

### 3.2 set_tensor_data Flow
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ Thread 3: PTO2 total submitted tasks = 16704

### Field Reference

| Field | Source (`pto_orchestrator.cpp`) | Description |
| Field | Source (`pto_orchestrator.h`) | Description |
| ----- | ------------------------------- | ----------- |
| **cost** | Wall-clock around `orch_func()` call | Total time including orchestration logic + scope overhead |
| **total** | Sum of all sub-steps below | Accumulated time inside `submit_task` across all tasks |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ Each sub-level macro requires `PTO2_PROFILING=1`:

- Debug/diagnostic logs (always present)
- Progress tracking (`PTO2 progress: completed=...`)
- Stall detection and dump (triggered after the `SCHEDULER_TIMEOUT_MS` wall-clock no-progress budget)
- Stall detection and dump (triggered only after `MAX_IDLE_ITERATIONS` idle loops)
- Deadlock/livelock detection (`diagnose_stuck_state`, called on stall)

**What's NOT compiled:**
Expand Down Expand Up @@ -255,7 +255,7 @@ Identity fields the AICPU side used to write at level 1 (`func_id`,
collector (`L2SwimlaneCollector::set_core_types`).

AICore buffer rotation no longer piggy-backs on `complete_task`. AICPU
counts dispatches per core in the dispatch path (scheduler_dispatch in
counts dispatches per core in the dispatch path (scheduler_context in
tensormap_and_ringbuffer; aicpu_executor in host_build_graph) and rotates
the AICore buffer when the count is about to cross a
`PLATFORM_AICORE_BUFFER_SIZE` boundary — strictly before
Expand Down Expand Up @@ -428,7 +428,7 @@ definitions to runtime headers.
### Code Locations

- Macro defaults and validation: `src/common/task_interface/profiling_config.h`
- Scheduler profiling: `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp` and `scheduler_cold_path.cpp`
- Scheduler profiling: `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler_context.h`
- Orchestrator profiling: `src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp`
- TensorMap profiling: `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.h`

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -556,7 +556,7 @@ dep_gen_replay_emit_deps_json(const DepGenRecord *records, size_t num_records, c
// `explicit_dep_count` / `over->dep_count` originate from device
// shared memory and are bounded by the writer to the array sizes, but
// we clamp on read too so a corrupted record never drives an OOB read
// off the end of rec.explicit_deps[64] / over->deps[582].
// off the end of rec.explicit_deps[64] / over->deps[326].
const uint64_t *deps_data;
int32_t dc;
if (rec.flags & DEP_GEN_FLAG_HAS_OVERFLOW) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,13 @@
*/
#include "common.h"

// LOG_ERROR can't be pulled from common/unified_log.h here because that header
// would re-#define LOG_INFO_V0..V9 already provided by pto_orchestration_api.h
// (orchestration routes them through the runtime ops table). For the limited
// use inside this file, write directly to stderr.
#include <cstdio>
#define LOG_ERROR(fmt, ...) std::fprintf(stderr, "[ERROR] " fmt "\n", ##__VA_ARGS__)

#ifdef __linux__
#include <cxxabi.h>
#include <dlfcn.h>
Expand Down
Loading