diff --git a/.claude/skills/l0-swimlane/SKILL.md b/.claude/skills/l0-swimlane/SKILL.md
new file mode 100644
index 000000000..130c603f0
--- /dev/null
+++ b/.claude/skills/l0-swimlane/SKILL.md
@@ -0,0 +1,127 @@
+---
+name: l0-swimlane
+description: Produce an L0 (intra-core AICore pipeline) swimlane for one task — a single kernel or a mix — via the dump-driven `simpler_setup.tools.l0_swimlane` tool. Use when the user asks to "run/produce an l0 swimlane", "trace a task's intra-core pipeline", profile why one AICore task is slow inside the core(s), or needs help choosing the tool's manual flags (`--func-id`, `--set-arg`, `--spmd-block-num`, `--case`). The tool captures real per-task args from an args dump and auto-generates the `msprof op simulator` replay — no hand-authored workspace. For a hand-authored single-`kernel_entry` replay use [insight-trace](../insight-trace/SKILL.md); for cross-task / scheduler / dependency timing use the L2 swimlane.
+---
+
+# L0 Swimlane — Intra-core Pipeline Trace for a Task
+
+`python -m simpler_setup.tools.l0_swimlane` dumps a task's real `args[]`,
+reconstructs them, generates a combined `msprof op simulator` replay of the
+**whole task** (a mix runs AIC + AIV0 + AIV1 in one op), and exports an
+Insight `trace.json` whose lanes are the cluster's pipes. Full reference:
+[docs/dfx/l0-swimlane-profiling.md](../../../docs/dfx/l0-swimlane-profiling.md).
+This skill is the **operating procedure** — above all the one genuinely
+manual decision: the slot/value for `--set-arg`.
+
+## When to use
+
+- **Use** when one task (single kernel or mix) is slow and you need the
+  per-pipe (`MTE2` / `MTE1` / `CUBE` / `FIXP` / `SCALAR` / `VECTOR`)
+  intra-core picture, or to confirm AIC↔AIV overlap inside a mix.
+- **Not** for cross-task dependencies / scheduler / dispatch / finish
+  timing — that is the **L2 swimlane**. L0 traces ONE task in isolation
+  with no AICPU, so inter-task ordering is out of scope (doc §9, tier C).
+- **vs `insight-trace`**: that skill hand-authors a wrapper around one
+  `kernel_entry`; `l0_swimlane` automates the whole thing from a real dump
+  (real args, mix-together, SPMD context synthesised). Reach for
+  `insight-trace` only when there is no test/dump to drive the capture.
+
+## Run
+
+```bash
+source .venv/bin/activate
+source "$ASCEND_HOME_PATH/set_env.sh"          # CANN env (msprof on PATH)
+# Sim dump (no NPU); task-submit locks a device for the step-5 collect.
+task-submit --device auto --max-time 1800 --run \
+  "python -m simpler_setup.tools.l0_swimlane --platform a2a3sim \
+     --func-id <set> --test <test_file.py>"
+```
+
+Onboard `a2a3` instead of `a2a3sim`: run
+`.claude/skills/onboard-arch-precheck/check.sh a2a3` first (the dump then
+runs on the locked device). The five internal steps and all flags are in
+doc §3.2 / §3.3.
+
+## Choosing the manual flags (the hard part)
+
+### `--func-id` — the task's member set
+
+You wrote the orchestration, so the members are known. `--func-id 0` traces
+the single-kernel task `{0}`; `--func-id 0,1,2` traces that 3-way mix.
+It must equal a dispatched task's func_id **set** — for a same-AIV-on-both-
+lanes SPMD mix the dump records a duplicate (`[0,1,1]`), so pass
+`--func-id 0,1` (`set([0,1,1]) == {0,1}`). Wrong set → the tool lists the
+func_id shapes present in the dump; pick one of those.
+
+### `--set-arg SLOT=VALUE` — only when a loop count must shrink
+
+First classify where the kernel's loop trip count comes from:
+
+| Trip count from | `--set-arg`? | Rule |
+| --------------- | ------------ | ---- |
+| **Tensor shape** (e.g. `shapes[0] / TILE_ELEMS`) | **No** | shape is the real dump value; changing it distorts. (mixed_example / single-kernel rows need no `--set-arg`.) |
+| **A scalar arg** (e.g. `n_blocks`) | **Yes** — set the count directly | camodel would run the full loop; shrink to ≥ 3–4 (doc §7.2: floor 3, prefer 4). |
+| **A control-tensor's content** (e.g. `context_lens`) | **Yes** — fill the buffer | the kernel *derives* the count from the data; fill so the derived count ≈ 4 (need `block_size` to back out the value). Integer dtypes only. |
+| **The SPMD `block_num`** | **No** — use `--spmd-block-num` | block_num lives in the synthesised slot-48 context, which `--set-arg` cannot reach. |
+
+Then find the slot — it is **per-kernel, never fixed**. Discover it:
+
+1. Run once with `--no-collect`; step 3 prints the **arg-slot table**
+   (every slot: index / kind / shape / scalar value).
+2. Identify which slot is the loop bound by cross-referencing **any** of:
+   the kernel's `args[N]` reads, the kernel-top **args-layout comment**
+   (paged-attention kernels have one, e.g.
+   `args[15] = total_logical_blocks scalar`), or the orchestration's
+   `add_input` / `add_scalar` **order** (the i-th `add_*` is slot `i`).
+3. Set the value per the table above.
+4. Re-run, then **self-check** (below).
+
+Verified examples (slots read from source):
+
+| Test | Loop bound | Flag |
+| ---- | ---------- | ---- |
+| `paged_attention_unroll` | `aic_qk_matmul.cpp` `args[4] = n_blocks` (scalar) | `--set-arg 4=4` |
+| `qwen3_14b_decode` (fa_fused) | `fa_fused_aic/aiv.cpp` outer loop `for(i=block_idx; i<v1[0]; i+=24)`; `v1[0]` = slot 0 `fa_total` (a 1-elem **INT32 tensor** read as the work-item count) | `--set-arg 0=96` → `ceil(96/24)=4` blocks. Slot 0 is 0 in replay → empty trace without this |
+| `batch_paged_attention` | `context_lens` **tensor** (slot 1; 2nd `params_sf` `add_input`); the SF kernel (func 1) derives per-batch blocks from its content | `--set-arg 1=512` |
+
+### `--spmd-block-num N` — SPMD grid width
+
+`block_num` is written into the synthesised slot-48 `LocalContext`. Default
+is the case's `block_dim`; override only for a kernel that branches or
+grid-strides on `block_num` (e.g. set the real hw width `24`). `block_idx`
+is always synthesised to `0` (a representative block) and is **not** a flag —
+it has no instruction-stream branches (doc §8).
+
+### `--case NAME` — pin a small case on a multi-case test
+
+When the test declares several `CASES[*]`, omitting `--case` auto-pins the
+**first** case that lists your `--platform` (a deterministic single-case
+dump). That first case is **not** guaranteed to be the smallest, and the
+replay rebuilds every tensor at its **real dumped shape**: a production-size
+case (long sequence, big batch, large KV cache) makes the camodel — a
+cycle-accurate, whole-chip, serial simulator — **crawl or look hung** on the
+oversized buffers. So **pin the small one yourself** with `--case <name>`
+(accepts `ClassName::Case`) whenever the first-platform case is not the
+smallest. Pick the case with the smallest shapes. `--set-arg` shrinks a
+*loop count*; `--case` shrinks the *tensor shapes* — reach for `--case`
+first when a replay stalls. Single-case tests need no `--case`.
+
+Pick a case that is **scaled down, not reshaped** — same tile geometry
+(M/K/N, head_dim, tile size), just fewer blocks / shorter sequence — so the
+per-block pipeline stays identical to production (you lose only iteration
+*count*, which does not change the pipeline shape). A case with different
+tile shapes traces only itself, not production.
+
+## Self-check after every run
+
+A known msprof/camodel export bug can truncate the last loop iteration(s).
+Verify `MMAD == FIX_L0C_TO_DST == n_blocks` in the trace; if they disagree,
+the tail was cut — do not draw timing conclusions, re-run or change the loop
+count (doc §7.4). Read the auto-generated `*_trace_perfetto.json`, not the
+raw Insight `trace.json`, for sub-laned per-instruction overlap (doc §3.4).
+
+## Coverage
+
+Representative command per task shape (single AIC / single AIV / 1+1 mix /
+2-AIV mix / 3-way mix / SPMD single-source / SPMD coop mix / same-AIV-both-
+lanes / paged-attn scalar & control-tensor loops / qwen3) is in doc §3.7.
diff --git a/docs/dfx/l0-swimlane-profiling.md b/docs/dfx/l0-swimlane-profiling.md
new file mode 100644
index 000000000..2e48ad310
--- /dev/null
+++ b/docs/dfx/l0-swimlane-profiling.md
@@ -0,0 +1,623 @@
+# L0 Swimlane Profiling — Intra-core Pipeline Trace for a Task
+
+## 1. Background & Motivation
+
+[L2 swimlane](l2-swimlane-profiling.md) answers *where each task ran on
+the wall clock and how the scheduler spent its loop*. It stops at the
+AICore task boundary — one task is one opaque `[start, end]` block. When
+a single task is slow, the next question is **why inside the core(s)**:
+which pipe (`MTE2` GM→L1, `MTE1` L1→L0, `CUBE` matmul, `FIXP` write-back,
+`SCALAR`, `VECTOR`) is the bottleneck, and how the per-instruction issue
+overlaps across the cluster's sub-cores.
+
+L0 swimlane captures exactly that — the **intra-core pipeline** of a
+task. It runs the task in isolation under `msprof op simulator` (the
+AICore camodel) and exports a MindStudio Insight `trace.json` whose lanes
+are the cluster's pipes, not the chip's cores. It deliberately
+**bypasses AICPU orchestration**: scheduler / tensormap / ringbuffer
+state is out of scope (that is L2's job, and needs real silicon). L0 is
+the per-pipe, per-instruction zoom that sits one level below an L2 task
+block.
+
+A task may be a single kernel or a **mix** — multiple sub-task kernels
+sharing one `args[]` on the 1C2V cluster (1 AIC + up to 2 AIV). L0
+replays the **whole task together**: a mix runs its AIC + AIV0 + AIV1
+kernels in one combined op, so the trace shows all the cluster's
+sub-cores side by side, not one kernel in isolation.
+
+The hard part of an isolated replay is rebuilding the task's exact
+`args[]` — Tensor descriptors (shape / dtype / strides / start_offset)
+plus scalar values — which orchestration normally computes on the fly.
+Hand-authoring them is error-prone. L0 swimlane removes the guesswork:
+it captures the **real** per-task args from an [args
+dump](args-dump.md), uses the dump's `func_id` array to identify the
+task's mix members, and generates the whole replay workspace from those
+captured args — zero hand-written shapes or scalars.
+
+## 2. Overview
+
+- **Per-pipe instruction timeline** — one Insight lane per sub-core pipe
+  (`MTE2` / `MTE1` / `CUBE` / `FIXP` / `SCALAR` / `VECTOR`), each carrying
+  the kernel's individual instructions with simulated `ts` / `dur`.
+- **Mix-together replay** — an entire mix task (any mix: same- or
+  different-source members, 2-way or 3-way) replays as **one** combined
+  `msprof op simulator` op. The cube sub-core runs the AIC member, the
+  vec sub-cores run the AIV member(s) → a combined AIC+AIV swimlane.
+- **Zero-guess args** — the task's real Tensor descriptors and scalars
+  come from a JSON-only `--dump-args 3` capture (metadata + scalar
+  values, no `.bin` payload — all reconstruction needs). The dump's
+  `func_id` array gives the task's mix membership directly.
+- **Loop-count control (`--set-arg SLOT=VALUE`)** — when a kernel's loop
+  trip count comes from a scalar or a control tensor, override it to
+  shrink a runaway loop (so the camodel doesn't hang) or to fix a
+  "fake-fast" zero-filled control tensor — without distorting the
+  per-iteration pipeline structure. Repeatable; default uses the real
+  dump values. See
+  [§7.2](#72---set-arg-floor-for-a-loop-count-without-distortion).
+- **Source-line attribution (`--debug-line` / `-g`)** — compile the
+  kernel with `-g` (skipping the link strip) so the trace carries
+  `debug_line` and Insight maps each instruction back to its kernel
+  source line. Off by default.
+- **Sim or onboard capture** — with a sim `--platform` (`a2a3sim` /
+  `a5sim`) the dump runs with no NPU; with an onboard `--platform`
+  (`a2a3` / `a5`) it runs on a real device. The dump only needs arg
+  **geometry**, which sim captures identically to onboard, and the
+  replay is camodel either way — so `a2a3sim` is the default and needs
+  no NPU and no arch-precheck. Use onboard only for a kernel whose sync
+  idiom (e.g. a manual `prod.record()`) compiles only for the device.
+- **Two trace outputs** — a native Insight `trace.json` and an
+  auto-generated Perfetto-friendly variant (sub-laned + atomic flags;
+  see [§3.4](#34-viewing--insight-vs-perfetto)).
+
+Drive it in one line (`--func-id` is the task's member set):
+
+```bash
+python -m simpler_setup.tools.l0_swimlane \
+    --test tests/st/<case>/test_<name>.py --func-id 0,1,2 --platform a2a3sim
+```
+
+## 3. How to Use
+
+### 3.1 Prerequisites (one-time per test case)
+
+L0 swimlane reuses the args-dump pipeline to recover args, so the target
+case must satisfy what the dump needs (see
+[args-dump.md](args-dump.md)):
+
+1. **Args dump is compiled in.** Built into the platform code; needs a
+   `pip install --no-build-isolation -e .` so it is compiled in.
+2. **Incores declare complete signatures.** Under the #1181 positional
+   model, each incore declares its full tensor `signature` (covering the
+   task payload in slot order); the dump maps signature entry `i` to
+   payload slot `i` and stamps every record with the task's active
+   sub-task `func_id` **array** (its mix membership). This is the repo
+   norm — no l0-specific marker.
+3. **The case declares the `--platform` you pass.** `CASES[*].platforms`
+   must include it. Pick a case with shapes small enough for the camodel
+   replay buffers.
+4. **`name` is optional.** When `incores[*].name` is absent the tool
+   falls back to the kernel source filename for labels / paths.
+
+### 3.2 Run
+
+```bash
+# Environment (once per shell): activate the venv and source CANN.
+source .venv/bin/activate
+export ASCEND_HOME_PATH=<your CANN install>     # e.g. .../cann-9.0.0
+source "$ASCEND_HOME_PATH/set_env.sh"
+
+# Sim capture (no NPU dump) — the default.
+python -m simpler_setup.tools.l0_swimlane \
+    --test tests/st/a2a3/tensormap_and_ringbuffer/mixed_example/test_mixed_example.py \
+    --func-id 0,1,2 --platform a2a3sim
+
+# Onboard capture — wrap the WHOLE tool in one task-submit so the dump and
+# the collect share the locked $TASK_DEVICE (no nested lock). Only needed for
+# a kernel whose sync idiom compiles only for the device.
+task-submit --device auto --device-num 1 --run \
+    "python -m simpler_setup.tools.l0_swimlane \
+        --test tests/st/<case>/test_<name>.py --func-id 0 --platform a2a3"
+```
+
+The tool runs five steps internally (the "Uses NPU" column is for an
+onboard `--platform`; a sim `--platform` uses no NPU until step 5):
+
+| Step | Action | Uses NPU |
+| ---- | ------ | -------- |
+| 1 | Read the test's `CALLABLE`; build a `func_id → (source, core_type)` table | No |
+| 2 | Run `--dump-args 3` (JSON-only) → `args_dump.json` (or reuse via `--dump-json`) | Onboard only |
+| 3 | Select the task whose member set == `--func-id`, reconstruct its full positional args, **print the arg-slot table** (slot / kind / shape / value) | No |
+| 4 | Emit the replay workspace and smoke-build it locally | No |
+| 5 | `msprof op simulator` collect + export → `trace.json`, then auto-converts a Perfetto variant | **Yes** |
+
+Step 3 prints every arg slot so you can pick a `--set-arg` target without
+reading the kernel source — names are not in the dump (only kind / shape
+/ value), so cross-reference the kernel's `args:` header for those:
+
+```text
+[l0_swimlane] func_id=0 task=0x... mix=[0, 1, 2] mode=mix block_dim=3
+              members=[MATMUL(aic,func 0), ADD(aiv,func 1), MUL(aiv,func 2)]
+[l0_swimlane] arg slots (override with --set-arg SLOT=VALUE):
+    slot 0  tensor  FLOAT32  [16384]
+    ...
+```
+
+A scalar slot holds the value directly (`--set-arg 4=4`); a tensor slot
+holds a pointer, so `--set-arg` fills its buffer (`--set-arg 4=512`). See
+[§7.2](#72---set-arg-floor-for-a-loop-count-without-distortion).
+
+### 3.3 Key flags
+
+| Flag | Meaning |
+| ---- | ------- |
+| `--test <file.py>` | SceneTest test file (required) |
+| `--func-id A[,B,C]` | The task's **member set** (comma-separated func_ids), required. `--func-id 0` traces the single-kernel task `{0}`; `--func-id 0,1,2` traces that 3-way mix. The set must exactly match a dispatched task's `func_id` array (you wrote the orchestration, so you know the members) |
+| `--task-id <hex>` | Which task instance to replay (default: lowest). Instances of the same mix shape are structurally identical |
+| `--platform <p>` | Dump platform → arch / compile / SoC params (default `a2a3sim`). Sim (`a2a3sim` / `a5sim`) dumps with no NPU; onboard (`a2a3` / `a5`) dumps on `$TASK_DEVICE` (wrap the tool in `task-submit`). The replay is camodel regardless; geometry is identical, so prefer sim |
+| `--device <ID>` | NPU device for an onboard dump + collect. **Auto-supplied** — `task-submit` appends `--device <id>` (also `$TASK_DEVICE`). Sim platforms ignore it |
+| `--case <NAME>` | Pin the dump to one `CASES[*].name`. Omitting it auto-pins the first case that lists `--platform`; pass it to target a smaller case when that first one overflows the camodel. Accepts `ClassName::Case` |
+| `--dump-json <path>` | Reuse an existing `args_dump.json`, skipping the dump re-run |
+| `--set-arg SLOT=VALUE` | Override an arg by `args[]` slot. Scalar slot → rewrite value; tensor slot → fill its buffer (integer dtypes). Shrinks a loop count without distortion. Repeatable. Default: real dump values |
+| `--spmd-block-num N` | `block_num` written into the synthesized SPMD context (slot 48). Default: the **selected** case's `block_dim` |
+| `--debug-line` / `-g` | Compile with `-g` (skip strip) so the trace carries `debug_line` → Insight maps instructions to source lines |
+| `--no-collect` | Generate + smoke-build only; do not take an NPU |
+| `--max-time <sec>` | `task-submit` budget (default 1800) |
+
+Per-arch build parameters are fixed in the tool's `ARCH_CONFIG`:
+
+| arch | SoC (camodel) | aicore-arch (compile) | prologue macros |
+| ---- | ------------- | --------------------- | --------------- |
+| a2a3 | `dav_2201` | `dav-c220` | `__CCE_AICORE__ 220` / `PTO_NPU_ARCH_A2A3` |
+| a5 | `dav_3510` | `dav-c310` | `__CCE_AICORE__ 310` / `PTO_NPU_ARCH_A5` |
+
+### 3.4 Viewing — Insight vs Perfetto
+
+The workspace lands at
+`outputs/l0_swimlane_<label>_<ts>/`, where
+`<label>` = `<TestClass>_<Case>_<platform>_<kernel>_mix<members>` (the
+`mix<members>` segment is the task's func_id set, e.g. `mix0_1_2` for a
+3-way mix or `mix0` for a single-kernel task). Two final
+traces are written, both using that same `<label>` (pretty-printed):
+
+| File | Open in |
+| ---- | ------- |
+| `<label>_trace.json` | **MindStudio Insight** (a copy of the export) |
+| `<label>_trace_perfetto.json` | **Perfetto** (auto-converted, see below) |
+
+The raw export is under `<ws>/insight_export/OPPROF_*/simulator/`.
+
+- **Insight** — drag the `simulator/` directory in (native, correct), or
+  open `<label>_trace.json`.
+- **Perfetto** — opening the raw Insight `trace.json` directly **drops
+  task records and mis-pairs flags** (Insight packs concurrent,
+  pipelined instructions onto one track; overlapping `ph:X` events break
+  stack nesting and `B`/`E` pairing). The tool therefore emits
+  `<label>_trace_perfetto.json` with two **lossless** transforms:
+  concurrent instructions on a pipe are split into sub-lanes
+  (`MTE1#0..#k`, none overlapping within a lane), and each `B`/`E` flag
+  pair is merged into one `ph:X` slice. Open that file in Perfetto. The
+  same transform is documented in
+  [`.claude/skills/insight-trace/SKILL.md`](../../.claude/skills/insight-trace/SKILL.md);
+  here it is built into the tool.
+
+### 3.5 Selecting a task / mix, and what to `--set-arg`
+
+`--func-id` **is** the task's member set — you name the exact func_ids the
+task is made of, and the tool picks the task whose `func_id` array matches.
+There is no shape-guessing: you wrote the orchestration, so you know which
+func_ids bind into a task. Name the task's **full** member set — for a mix,
+that means all of its members, so the trace shows the whole cluster's
+sub-cores cooperating as they do in production.
+
+- **Single-kernel task** — `--func-id 0` selects the task whose set is
+  exactly `{0}` — a kernel the orchestration dispatches on its own (e.g.
+  `vector_example`'s `kernel_add`, or a standalone AIC matmul).
+- **A mix** — name every member: `--func-id 0,1,2` selects the 3-way mix
+  `{0,1,2}`, `--func-id 3,4` the 2-AIV mix `{3,4}`.
+- If the set matches no dispatched task, the tool errors and lists the
+  `func_id` shapes present in the dump.
+
+**What loop count to shrink** for a fast camodel is a scalar `n_blocks`,
+a control *tensor* (`context_lens`), or nothing — the slot is shown in
+the step-3 table and `4` is the prefetch floor (see
+[§7.2](#72---set-arg-floor-for-a-loop-count-without-distortion)). For
+the `mixed_example` matmul/add/mul kernels the loop count derives from
+the tensor **shape** (`shapes[0]`), which the dump captures truthfully,
+so **no `--set-arg` is needed** — the real count (one 128×128 tile) is
+already small.
+
+Omitting `--case` auto-pins the **first** `CASES[*]` that lists your
+`--platform`, so the dump always targets exactly one case (deterministic —
+no "run every case, reconstruct from the newest dump dir" ambiguity). Pass
+`--case` explicitly when that first case is not the smallest — a full-size
+production case's shapes overflow the camodel replay (§3.4). The synthesized
+slot-48 `block_num` is taken from the **selected** case's `block_dim`.
+
+### 3.6 Reusing a dump across kernels
+
+The dump in step 2 is the slow part. When tracing several tasks from the
+same case, capture once and reuse:
+
+```bash
+# First: runs the dump, traces one task (the 3-way mix).
+python -m simpler_setup.tools.l0_swimlane --test <file> --func-id 0,1,2 --platform a2a3sim
+
+# Subsequent: another task from the same case, reusing the manifest.
+python -m simpler_setup.tools.l0_swimlane --test <file> --func-id 3,4 --platform a2a3sim \
+    --dump-json outputs/<ClassName>_<Case>_<ts>/args_dump/args_dump.json
+```
+
+The manifest holds every task's args for the whole case.
+
+### 3.7 Coverage across the #1181 test suite
+
+Commit `b1e4bd23` (#1181) touched ~70 test files. The
+`tensormap_and_ringbuffer` kernels among them fall into these l0
+categories — one representative each, with its verified `--func-id`. The
+runnable commands follow the table, wrapped in `task-submit` (the step-5
+`msprof` collect takes a device on the shared box). Most use
+`--platform a2a3sim` (the dump runs off-NPU); `alternating_matmul_add`,
+`paged_attention_unroll`, and `qwen3_14b_decode`, whose `CASES` declare
+**no `a2a3sim`**, are grouped separately and use `--platform a2a3` after an
+arch-precheck (the case must declare the `--platform` you pass — §3.1).
+
+| Category | Representative `<TEST>` | `--func-id` + flags | What it exercises |
+| -------- | ----------------------- | ------------------- | ----------------- |
+| Single AIC | `alternating_matmul_add` | `--func-id 0` | standalone `rt_submit_aic_task(MATMUL)` — a genuine single-AIC task, not a mix member (a2a3-only) |
+| Single AIV | `vector_example` | `--func-id 0` | `kernel_add`, dispatched `rt_submit_aiv_task(0)` (vec only) |
+| Mix 2 AIV (per-lane) | `mixed_example` | `--func-id 3,4` | ADD_STD@AIV0 + MUL_STD@AIV1 (`get_subblockid` routing) |
+| Mix 3-way 1C2V | `mixed_example` | `--func-id 0,1,2` | MATMUL@AIC + ADD@AIV0 + MUL@AIV1 |
+| SPMD single-source | `spmd_multiblock_aiv` | `--func-id 0` | single AIV reading `get_block_idx` (`block_dim=24`; replay traces block 0) |
+| SPMD mix, 2 AIV share a source | `spmd_multiblock_mix` | `--func-id 0,1,2` | func 1 & 2 are distinct ids but **both `kernel_spmd_mix.cpp`** → the 2 AIV collapse to one (both lanes run it). Routes by `get_sub_block_id` (slot 49) → in replay both lanes read `sub_block_id=0`; AIV0/AIV1 differ only by write offset, so the pipeline stays representative. (The same-source collapse also covers the duplicate-func_id `[0,1,1]` shape an SPMD mix produces when `aiv0 = aiv1`.) |
+| Paged-attn, loop = scalar | `paged_attention_unroll` | `--func-id 0 --set-arg 4=4` | QK stage; `n_blocks` scalar (slot 4) → shrink to 4 ([§7.2](#72---set-arg-floor-for-a-loop-count-without-distortion)) |
+| Paged-attn, loop = control tensor | `batch_paged_attention` | `--func-id 1 --set-arg 1=512 --case CaseSmall1` | SF reads `context_lens` (**slot 1**) content (`aiv_softmax_prepare.cpp`); `--set-arg 1=512` fills it uniformly → shrinks the derived per-batch block count |
+| Real SPMD workload | `qwen3_14b_decode` | `--func-id 16,17 --set-arg 0=96` | the `fa_fused` mix `{16,17,17}` (AIC + 2 same-source AIV → collapses). a2a3-only. `--set-arg 0=96` sets slot 0 `fa_total` (the outer work-item count) → `ceil(96/24)=4` fa blocks → real QK/PV `MMAD` + online-softmax (`VMAX`/`VEXP`/`VSUB`/`VCADD`) lanes. Slot 0 defaults to 0 in replay → empty trace, so this `--set-arg` is required. camodel simulates ~19k instrs cycle-by-cycle — expect minutes, not a hang (§7.1) |
+
+Runnable commands (one per category):
+
+```bash
+T=tests/st/a2a3/tensormap_and_ringbuffer        # most representatives
+E=examples/a2a3/tensormap_and_ringbuffer        # vector_example / qwen3
+
+# --- a2a3sim cases (case declares a2a3sim; dump takes no NPU) ---
+L0="python -m simpler_setup.tools.l0_swimlane --platform a2a3sim -g"  # -g: source-line attribution
+# Single AIV — vector_example kernel_add
+task-submit --device auto --max-time 1800 --run "$L0 --func-id 0     --test $E/vector_example/test_vector_example.py"
+# Mix 2 AIV (per-lane) — ADD_STD + MUL_STD
+task-submit --device auto --max-time 1800 --run "$L0 --func-id 3,4   --test $T/mixed_example/test_mixed_example.py"
+# Mix 3-way 1C2V — MATMUL + ADD + MUL
+task-submit --device auto --max-time 1800 --run "$L0 --func-id 0,1,2 --test $T/mixed_example/test_mixed_example.py"
+# SPMD single-source
+task-submit --device auto --max-time 1800 --run "$L0 --func-id 0     --test $T/spmd_multiblock_aiv/test_spmd_multiblock_aiv.py"
+# SPMD mix, 2 AIV share a source
+task-submit --device auto --max-time 1800 --run "$L0 --func-id 0,1,2 --test $T/spmd_multiblock_mix/test_spmd_multiblock_mix.py"
+# Paged-attn, loop = control tensor (context_lens = slot 1; fill it to shrink the per-batch block count)
+task-submit --device auto --max-time 1800 --run "$L0 --func-id 1 --set-arg 1=512 --case CaseSmall1 --test $T/batch_paged_attention/test_batch_paged_attention.py"
+
+# --- a2a3-ONLY cases (CASES declare no a2a3sim) ---
+# Onboard: run arch-precheck once, then --platform a2a3 (the dump runs on the locked device).
+.claude/skills/onboard-arch-precheck/check.sh a2a3 || exit 1
+L0a="python -m simpler_setup.tools.l0_swimlane --platform a2a3 -g"
+# Single AIC — standalone matmul (genuine single-AIC task)
+task-submit --device auto --max-time 1800 --run "$L0a --func-id 0 --test $T/alternating_matmul_add/test_alternating_matmul_add.py"
+# Paged-attn, loop = scalar (shrink n_blocks to 4)
+task-submit --device auto --max-time 1800 --run "$L0a --func-id 0 --set-arg 4=4 --test $T/paged_attention_unroll/test_paged_attention_unroll.py"
+# Real SPMD workload — qwen3 fa_fused mix {16,17,17}. --set-arg 0=96 sets slot 0
+# (fa_total = outer work-item count) → ceil(96/24)=4 fa blocks → real MMAD + online
+# softmax. Slot 0 is 0 in replay otherwise → empty trace. camodel takes minutes (§7.1).
+task-submit --device auto --max-time 1800 --run "$L0a --func-id 16,17 --set-arg 0=96 --test $E/qwen3_14b_decode/test_qwen3_14b_decode.py"
+```
+
+**Not l0 targets (excluded).** Runtime-mechanics tests (`orch_so_cache`,
+`prepared_callable`, `dynamic_register`, `l3_group`, `l3_dependency`,
+`l3_l2_orch_comm`, `aicore_op_timeout`, `scope_stats`); comm / notify
+demos (`async_notify_demo`, `deferred_notify_demo`,
+`sdma_async_completion_demo`); DFX wrappers that reuse other kernels
+(`dep_gen`, `pmu`, `args_dump`, `l2_swimlane` — they trace `vector_example`
+/ `mixed_example`); `host_build_graph/*` (a different runtime whose dump
+stamps `func_id=[-1]`); `spmd_paged_attention` (`pytest.mark.skip` — a known a2a3 507018 flake, #1156; its `[0,1,1]` same-source collapse is covered by `spmd_multiblock_mix`); and the `ut/py/test_task_interface.py` unit test.
+
+## 4. Capabilities
+
+What the L0 swimlane shows:
+
+- **Per-pipe occupancy** per sub-core for one task, so a memory-bound vs
+  compute-bound diagnosis is direct.
+- **Cluster overlap** — for a mix, the AIC and AIV sub-cores appear as
+  separate lanes in one trace, so you see how the cooperating kernels'
+  pipelines overlap intra-cluster.
+- **Per-instruction issue overlap** — each instruction is a slice on its
+  pipe lane; the Perfetto sub-lane split makes concurrent issue legible.
+- **Source-line attribution** (with `--debug-line`).
+- **Cross-arch comparison** (`a2a3sim` vs `a5sim`) surfaces real ISA
+  differences (see [§7](#7-findings)).
+
+What it does **not** show (use [L2 swimlane](l2-swimlane-profiling.md)):
+
+- AICPU dispatch / finish latency, scheduler phases, dependency arrows.
+- **Cross-core synchronization timing.** The isolated replay has no
+  AICPU, so orchestration-driven inter-core waits are absent — sub-cores
+  appear freely parallel (see [§9](#9-limitations), tier C).
+- Multi-task placement across clusters. L0 is one task, one cluster.
+
+## 5. How It Works
+
+L0 swimlane is **tooling-only** — there is no dedicated device-side data
+path. It composes three existing pieces: args dump (for args), `msprof
+op simulator` (for the pipeline trace), and a generated replay workspace
+(for the isolated build).
+
+### 5.1 The generated workspace
+
+A single mix-arch translation unit — no per-member files:
+
+| File | Role |
+| ---- | ---- |
+| `replay_kernel.cpp` | The combined `replay_entry`. The AIC member is `#include`d under `#if defined(__DAV_CUBE__)`, the AIV member(s) under `#if defined(__DAV_VEC__)`; `replay_entry` routes each sub-core to its kernel (see [§5.4](#54-mix-together-codegen)) |
+| `replay_launch.cpp` | `replay_entry<<<1, ...>>>` launcher — one block = 1 AIC + 2 AIV sub-cores |
+| `replay_host.cpp` | Builds the 128-byte Tensor descriptors from the dump's real args + fills scalars, then launches. **Auto-generated; never hand-edited** |
+| `CMakeLists.txt` | Single mix-arch `.so` (`--cce-aicore-arch=dav-cXXX`) |
+| `run_collect.sh` | `msprof op simulator` collect (`--kernel-name=replay_entry`) + export |
+
+### 5.2 Args reconstruction (the zero-guess part)
+
+`reconstruct_task_args` reads `args_dump.json` (`data["args"]`), selects
+the task whose `func_id` SET equals `--func-id`, groups by `task_id`
+(default: lowest), and
+**unions both dump stages** — inputs + scalars from `before_dispatch`,
+outputs from `after_completion` — keyed by `arg_index`. It returns the
+task's **full positional payload** (every slot, sorted by `arg_index`)
+plus the mix membership (`func_id` array, slot order AIC, AIV0, AIV1).
+Each member kernel reads its own slice of the shared `args[]` (the
+replay places each tensor at its real slot), so feeding the whole
+payload to every member is correct. For each tensor it emits the literal
+shape / strides / dtype / start_offset into `make_desc`, with these
+correctness-critical details:
+
+- **Descriptor field offsets** are pinned by the `static_assert`s in
+  `src/{arch}/runtime/tensormap_and_ringbuffer/runtime/tensor.h`
+  (`sizeof(Tensor) == 128`).
+- **dtype** comes from a string→enum table mirroring
+  `src/common/task_interface/data_type.h` (note `BFLOAT16 = 6`).
+- **Buffer size uses the extent formula**
+  `(start_offset + 1 + Σ(shape[i]-1)*stride[i]) * elem_size`, not
+  `numel` — strided / offset views read past `numel`.
+- Replay data is **memset to 0**; only the descriptor metadata is real.
+  Data-dependent branches / addresses can distort while pure pipeline
+  structure stays faithful (see [§8](#8-fidelity-rules)).
+
+### 5.3 Build & collect
+
+The smoke build (no NPU) runs cmake + builds `replay_host`, then asserts
+`replay_entry` and `launch_replay` are present in `libreplay_kernel.so`.
+With `--no-collect` it stops here. Otherwise the collect step runs
+`run_collect.sh` (the camodel needs a device context), locates the
+exported `trace.json`, and writes the two viewer copies. Device
+selection follows the lock already held:
+
+- **Under an outer `task-submit`** (`$TASK_DEVICE` set): reuse it, no
+  nested `task-submit`.
+- **Standalone** with `task-submit` on `PATH`: self-lock via
+  `task-submit --device auto`.
+- **No `task-submit`**: unlocked run with a warning (per
+  [running-onboard.md](../../.claude/rules/running-onboard.md)).
+
+### 5.4 Mix-together codegen
+
+`emit_replay_kernel_combined` builds one `replay_entry` that runs every
+member of the mix on its sub-core, in a single translation unit:
+
+- **AIC member** — `#include`d under `#if defined(__DAV_CUBE__)`, so it
+  compiles in the cube ISA variant.
+- **AIV member(s)** — `#include`d under `#if defined(__DAV_VEC__)`, so
+  the vector ISA target feature is in scope (compiling an AIV kernel
+  outside the vec variant fails on `vadd` / `set_vector_mask`).
+- **2 AIV members** — both kernels live in the **same** vec section. To
+  avoid same-TU clashes (both define `extern "C" kernel_entry` and a
+  `static get_num_tiles`), each `#include` is wrapped in
+  `#define kernel_entry l0_f<id>_entry` + `#define get_num_tiles
+  l0_f<id>_get_num_tiles` … `#undef`. Keeping it one TU avoids the
+  cross-object device-link problem (bisheng device-links per `.o`, so a
+  call into a separately-compiled member object does not resolve).
+- **`replay_entry`** (`__global__`) routes: the cube section calls the
+  AIC member; the vec section calls
+  `get_subblockid() == 0 ? <AIV0> : <AIV1>`. A sub-core with no member in
+  the set gets an empty body.
+
+**Per-AIV-lane routing primitive — `get_subblockid()`.** simpler's
+*runtime* treats CCE `get_subblockid()` as unreliable (issue #900: it
+returns 0 for both AIV lanes because the runtime does not program that
+register) and reads `get_sub_block_id(args)` from the slot-49
+`GlobalContext` instead. That variant is **not** usable here: the
+isolated replay synthesizes one shared `args[]`, so slot-49 is a single
+value both lanes read identically. The bare camodel op, however, **does**
+model the physical sub-block id per AIV lane, so `get_subblockid()` is
+the correct primitive in this context — and it is validated to route
+correctly (see [§6](#6-validation)).
+
+### 5.5 SPMD context synthesis
+
+SPMD kernels read an execution context the orchestration builds per
+dispatch — `LocalContext{block_idx, block_num}` at args slot 48 and
+`GlobalContext{sub_block_id}` at slot 49. The isolated replay has no
+orchestration, so `replay_host.cpp` **synthesizes** it: one
+`LocalContext{block_idx=0, block_num=block_dim}` + `GlobalContext`
+pointed at slots 48/49. This is harmless for positional kernels (they
+ignore 48/49) and required for SPMD kernels that read `get_block_idx` /
+`get_block_num` (which would otherwise dereference null). `block_idx=0`
+traces a representative block; `block_num = block_dim` (the **selected**
+case's grid width — the `--case` case, else the auto-pinned first-platform
+case) keeps steady-state branches (`block_idx+1 < block_num`) on their
+normal path — see [§8](#8-fidelity-rules). `--spmd-block-num` overrides
+`block_num`.
+Note the per-AIV-lane routing for a mix uses the hardware
+`get_subblockid()` (§5.4), not the synthesized slot-49 value.
+
+## 6. Validation
+
+Confirmed on the `a2a3sim` camodel (`mixed_example`):
+
+| Mix | func_id | Result |
+| --- | ------- | ------ |
+| MATMUL + ADD | `[0,1]` | `cubecore0` MMAD (MATMUL) + `veccore` VADD (ADD) |
+| ADD_STD + MUL_STD | `[3,4]` | `veccore0` VADD (ADD), `veccore1` VMUL (MUL) |
+| MATMUL+ADD+MUL | `[0,1,2]` | `cubecore0` MMAD, `veccore0` VADD, `veccore1` VMUL |
+
+The 2-AIV cases (`[3,4]`, `[0,1,2]`) confirm `get_subblockid()` routes
+the two physical AIV lanes to distinct kernels in the bare camodel op —
+i.e. the issue-#900 "0-for-both" behavior is a *runtime* artifact and
+does not apply to an isolated replay.
+
+## 7. Findings
+
+Measured behaviors worth knowing before you read a trace.
+
+### 7.1 The a5 camodel is much slower than a2a3 (wall-clock)
+
+The camodel is a cycle-by-cycle, whole-chip (32-core), serial software
+model. "Total tick" is **not** comparable across platforms (tick
+granularity differs); wall-clock and the simulated µs are. a5 pays
+roughly ~25× per tick vs a2a3, so it is slower end-to-end even with
+fewer ticks. Much of the cost is fixed setup, so shrinking a loop count
+helps only modestly. **Prefer a2a3 for logic validation; run a5 only
+when you specifically need the a5 pipeline.**
+
+### 7.2 `--set-arg` floor for a loop count (without distortion)
+
+Double-buffered prefetch kernels guard the prefetch + `pipe_barrier`
+with `if (i+1 < n_blocks)`:
+
+| `n_blocks` | Captures | Distortion |
+| ---------- | -------- | ---------- |
+| 1 | No prefetch (`if` never runs) | **Distorted** — double-buffering lost |
+| 2 | Prefetch, single buffer phase | Slightly incomplete |
+| 3 | Ping-pong both phases + tail block | Faithful (minimum) |
+| 4 | Plus one steady-state block | Faithful, most stable |
+
+→ **Floor 3, recommend 4; `n_blocks = 1` always distorts.** Shrinking
+the loop count cuts iterations without changing per-block pipeline
+structure; it does **not** change template branches (those are decided
+by tile shape `shapes[0]`, which must stay real).
+
+**Where the loop count lives — scalar vs tensor.** A single-task
+`n_blocks` is a **scalar** (`--set-arg 4=4`); a mix paged-attention
+`n_blocks` is **derived from a `context_lens` tensor**, so `--set-arg`
+fills that buffer (`--set-arg 4=512` → every element 512 →
+`n_blocks = ceil(512 / block_size)`). `--set-arg` accepts a tensor slot
+only for **integer** dtypes. A kernel whose loop count is purely a
+function of tensor **shape** needs no `--set-arg` (the dump shape is
+real).
+
+### 7.3 a2a3 (`dav-c220`) vs a5 (`dav-c310`) swimlanes differ
+
+Each platform runs the kernel under a different msprof SoC config (a2a3 =
+`dav_2201` / `dav-c220`, a5 = `dav_3510` / `dav-c310`), so the same kernel
+produces a different L0 swimlane — both in **lane export** (a2a3 shows
+`cubecore0 + veccore0/1`; a5 exports only the sub-cores that ran real
+code, so an AIC-only kernel shows just `cubecore0`) and in **instructions
+/ per-pipe timing** (real ISA). Both are expected, not tool bugs. Read
+`cubecore0` for AIC and `veccore` for AIV, and compare structure *within*
+one platform, not absolute numbers across the two.
+
+### 7.4 Trace can truncate the last loop iteration(s) — self-check
+
+A known **collection-side bug** in CANN's msprof/camodel: the exported
+instruction stream sometimes **ends early**, dropping the last loop
+iteration's compute / write-back. Symptom: `MMAD` / `FIX_L0C_TO_DST`
+counts come out **less than `n_blocks`** while the loads are complete.
+It is **not** a fixed `n-1` rule; the sim runs all blocks, only the
+exported stream is cut.
+
+**Self-check:** after each run, verify `MMAD == n_blocks` **and**
+`FIX_L0C_TO_DST == n_blocks`. If they disagree the tail was truncated —
+do not draw timing conclusions; retry with a different `n_blocks`.
+
+## 8. Fidelity Rules
+
+| Knob | Change it? | Distorts? |
+| ---- | ---------- | --------- |
+| Tile M/K/N (shape) | **No** | Alters cycle counts and switches template branches |
+| Case selection (`--case`) | Pick a *scaled-down* case | Faithful if it keeps the tile geometry (just fewer blocks / shorter sequence); a case that changes tile M/K/N / head_dim traces only itself, not production |
+| Scalar values (`scale` / offsets …) | Use real dump values | Wrong value → wrong branch → distorted |
+| Loop count (`n_blocks`, via `--set-arg`) | Shrinkable to ≥ 3–4 | Faithful at ≥ 3–4; `= 1` distorts (§7.2) |
+| Data filled to 0 | Default (memset 0) | Data-dependent branches / addresses distort; pure pipeline structure is fine |
+| SPMD `block_idx` (slot 48) | Fixed 0 | Traces a real block 0 — representative for uniform SPMD |
+| SPMD `block_num` (slot 48) | Default `block_dim`; `--spmd-block-num` | Any value ≥ 2 keeps steady-state branches |
+| Per-AIV-lane routing (`get_subblockid`) | Automatic | Faithful — lanes run their real kernels (§6) |
+| Cross-core sync timing | Not modeled | **Optimistic** — sub-cores appear freely parallel (§9 tier C) |
+| Cross-platform (a2a3 vs a5) | Set by target | Instructions / timing genuinely differ (real silicon — §7.3) |
+
+## 9. Limitations
+
+- **AICPU orchestration is out of scope.** L0 sees only the AICore
+  pipeline of one task. For dispatch / finish / scheduler / dependency
+  data use [L2 swimlane](l2-swimlane-profiling.md).
+- **Orchestration-driven sync is not modeled (tier C).** Two kinds of
+  cross-core sync: **(a) in-kernel** — cross-core flags / L2 FIFOs written
+  in the kernel (the AIC↔AIV producer/consumer handshake of a cooperative
+  mix) — these **are reproduced**, the camodel runs the combined binary
+  instruction-by-instruction; **(b) orchestration-driven** — task
+  dependencies / barriers / scheduling the AICPU enforces — these are
+  **absent**, the isolated replay has no AICPU. So a mix's own in-kernel
+  AIC↔AIV coordination is faithful; what's lost is mainly **inter-task**
+  ordering (task A → task B), which is out of L0's single-task scope
+  anyway — that is [L2 swimlane](l2-swimlane-profiling.md)'s view. Edge
+  case: if a mix's sub-core ordering relied on the AICPU rather than
+  in-kernel flags, the replay shows those cores more parallel than
+  reality.
+- **Simulation clock, not silicon.** Use it for *relative* per-pipe /
+  per-arch structure, not absolute-latency claims.
+- **Replay data is zero.** Only descriptor metadata is real; data-driven
+  control flow can diverge (§8).
+- **Tail-truncation collection bug.** Validate `MMAD`/`FIX` counts every
+  run (§7.4).
+- **1C2V only.** The mix path assumes 1 AIC + up to 2 AIV (the only
+  cluster shape both current chips support). A mix with > 2 AIV members,
+  or > 1 AIC, is rejected.
+
+## 10. FAQ / Debug Guide
+
+**`func_id=N not found`.** The first `--func-id` member is not an incore;
+the tool prints the available `(func_id, name, core_type)` from the test's
+`CALLABLE.incores`.
+
+**`--func-id [...] matches no task`.** No dispatched task has exactly that
+member set. The tool lists the `func_id` shapes present in the dump — pick
+one of those (a shape it printed, not an arbitrary combination of func_ids).
+
+**No dump records for the task.** The incore `signature` likely
+disagrees with the dispatched payload, so the dump skipped it — see
+[§3.1](#31-prerequisites-one-time-per-test-case) and
+[args-dump.md](args-dump.md).
+
+**Smoke build fails on a missing symbol.** `replay_entry` /
+`launch_replay` must appear in `libreplay_kernel.so`. A wrong
+`--platform` picks the wrong `ARCH_CONFIG`.
+
+**`ASCEND_HOME_PATH is not set`.** Source CANN's `set_env.sh` first.
+
+**Both AIV lanes show the same kernel.** `get_subblockid()` did not
+distinguish the lanes in your camodel build (the issue-#900 behavior).
+Trace each AIV kernel as its own single-kernel task instead (e.g.
+`--func-id 3` then `--func-id 4`).
+
+**Perfetto shows overlapping / missing slices.** Open
+`<label>_trace_perfetto.json`, not the raw Insight `trace.json`
+(see [§3.4](#34-viewing--insight-vs-perfetto)).
+
+**`MMAD` / `FIX` count < `n_blocks`.** The export truncated the tail
+(§7.4). Re-run or change `n_blocks`.
+
+## 11. Related docs
+
+- [`.claude/skills/l0-swimlane/SKILL.md`](../../.claude/skills/l0-swimlane/SKILL.md)
+  — the operating procedure for this tool (picking `--func-id` /
+  `--set-arg` / `--spmd-block-num`).
+- [l2-swimlane-profiling.md](l2-swimlane-profiling.md) — the
+  per-task / scheduler swimlane one level up.
+- [args-dump.md](args-dump.md) — the `func_id`-array-tagged per-task arg
+  capture L0 reconstructs from.
+- [`.claude/skills/insight-trace/SKILL.md`](../../.claude/skills/insight-trace/SKILL.md)
+  — the manual `msprof op simulator` replay recipe + Perfetto notes.
+- [chip-level-arch.md](../chip-level-arch.md) — the AICore pipe model the
+  lanes represent.
diff --git a/simpler_setup/tools/l0_swimlane.py b/simpler_setup/tools/l0_swimlane.py
new file mode 100644
index 000000000..0a1db15da
--- /dev/null
+++ b/simpler_setup/tools/l0_swimlane.py
@@ -0,0 +1,1431 @@
+#!/usr/bin/env python3
+# Copyright (c) PyPTO Contributors.
+# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+# CANN Open Software License Agreement Version 2.0 (the "License").
+# Please refer to the License for details. You may not use this file except in compliance with the License.
+# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+# See LICENSE in the root of the software repository for the full text of the License.
+# -----------------------------------------------------------------------------------------------------------
+"""l0_swimlane — generate an AICore intra-core swimlane trace.json for a task.
+
+Given a SceneTest test file + platform + a comma-list of func_ids (the mix
+member set), this tool:
+  1. runs (or reuses) a JSON-only args dump (`--dump-args 3`) to capture the
+     task's real per-arg metadata,
+  2. picks the task whose active-subtask set == `--func-id` and reconstructs its
+     FULL positional args[] (shapes / dtypes / strides / start_offset / scalar
+     values) from the #1181 dump; the task's func_id ARRAY is its mix membership,
+  3. generates the "intra-core replay" workspace — a single combined
+     `replay_entry` whose cube sub-core runs the AIC member and vec sub-core(s)
+     the AIV member, so a MIX task replays AS ONE OP (msprof op simulator recipe
+     from .claude/skills/insight-trace/SKILL.md),
+  4. smoke-builds it, then runs `msprof op simulator` and exports the task's
+     trace.json (a combined AIC+AIV swimlane for a mix).
+
+Mix support: any mix (same- or different-source members) replays as one task.
+`--func-id` IS the member set — name the task's FULL set: `--func-id 0` traces
+a task the orchestration dispatches on its own (a single-core task {0}),
+`--func-id 0,1,2` traces a 3-way mix in full. The author wrote the
+orchestration, so the member set is known directly — there is no shape-guessing.
+LIMITATION
+(tier C): cross-core synchronisation that the orchestration drove (task deps /
+barriers) is absent in this isolated replay, so inter-core waits are optimistic
+— the per-core pipeline structure is faithful, the AIC<->AIV handoff timing is
+not. 2-AIV mixes (e.g. func_id 0,1,2) route per AIV lane via the hardware
+get_subblockid(); validated on the a2a3sim camodel (the two AIV lanes run
+different kernels, e.g. VADD vs VMUL).
+
+Usage (onboard — recommended; wrap the whole tool in one task-submit so the dump
+and the camodel collect share the locked device):
+    task-submit --device auto --run \
+        "python -m simpler_setup.tools.l0_swimlane \
+            --test tests/st/a2a3/.../test_paged_attention_unroll.py \
+            --func-id 0 --platform a2a3 --case <small case>"
+
+Both the dump (when `--platform` is onboard) and the msprof collect step need an
+NPU device context. Run the tool under a single outer task-submit: it appends
+--device <id> to this command (also exported as $TASK_DEVICE); that one device
+threads through both the dump and the collect, so neither grabs a second lock.
+With a sim `--platform` (a2a3sim/a5sim) the dump needs no NPU, and the collect
+self-locks via its own task-submit. Requires ASCEND_HOME_PATH (source CANN
+set_env.sh first).
+"""
+
+import argparse
+import importlib.util
+import json
+import os
+import re
+import shutil
+import subprocess
+import sys
+from collections import defaultdict
+from datetime import datetime
+from pathlib import Path
+
+from simpler_setup.environment import PROJECT_ROOT
+from simpler_setup.platform_info import parse_platform
+from simpler_setup.pto_isa import ensure_pto_isa_root
+
+# Dump emits dtype as an UPPERCASE string (src/common/task_interface/data_type.h
+# get_dtype_name). Map string -> (DataType raw enum value, element bytes).
+DTYPE_RAW = {
+    "FLOAT32": 0,
+    "FLOAT16": 1,
+    "INT32": 2,
+    "INT16": 3,
+    "INT8": 4,
+    "UINT8": 5,
+    "BFLOAT16": 6,
+    "INT64": 7,
+    "UINT64": 8,
+    "UINT16": 9,
+    "UINT32": 10,
+}
+DTYPE_SIZE = {
+    "FLOAT32": 4,
+    "FLOAT16": 2,
+    "INT32": 4,
+    "INT16": 2,
+    "INT8": 1,
+    "UINT8": 1,
+    "BFLOAT16": 2,
+    "INT64": 8,
+    "UINT64": 8,
+    "UINT16": 2,
+    "UINT32": 4,
+}
+
+# --set-arg tensor fill writes an integer into every element, so it only makes
+# sense for integer-typed control tensors (loop counts / indices like
+# context_lens). Filling a float/bf16 tensor with an int is almost always a
+# mistake, so it is refused.
+INTEGER_DTYPES = frozenset(
+    {
+        "INT8",
+        "INT16",
+        "INT32",
+        "INT64",
+        "UINT8",
+        "UINT16",
+        "UINT32",
+        "UINT64",
+    }
+)
+
+KARGS_SLOTS = 50  # MAX_TENSOR_ARGS(16) + MAX_SCALAR_ARGS(32) + 2
+
+# Per-arch build parameters. soc = msprof/simulator SoC version (and the
+# $CANN/aarch64-linux/simulator/<soc>/lib path); aicore_arch = bisheng
+# --cce-aicore-arch (single mix-arch, no -cube/-vec suffix); cce / npu_arch are
+# the standalone-compile prologue macros the kernel headers expect.
+# aiv_lanes_per_block = AIV lanes per AICore cluster (the hardware subblockdim).
+# Both a2a3 and a5 are 1C2V (1 AIC + 2 AIV per cluster), so this is 2; it sizes
+# the AIV context rows in the mix replay. It is a hardware constant, not a
+# per-kernel property — hence it lives here, not in the test.
+ARCH_CONFIG = {
+    "a2a3": {
+        "soc": "dav_2201",
+        "aicore_arch": "dav-c220",
+        "cce": 220,
+        "npu_arch": "PTO_NPU_ARCH_A2A3",
+        "aiv_lanes_per_block": 2,
+    },
+    "a5": {
+        "soc": "dav_3510",
+        "aicore_arch": "dav-c310",
+        "cce": 310,
+        "npu_arch": "PTO_NPU_ARCH_A5",
+        "aiv_lanes_per_block": 2,
+    },
+}
+
+# --- SPMD context + mix detection (no per-test markers needed) --------------
+# SPMD kernels read an execution context at args slots 48/49 —
+# LocalContext{block_idx, block_num} (48) and GlobalContext{sub_block_id} (49) —
+# which the orchestration builds per dispatch. l0_swimlane runs an isolated
+# replay (no orchestration), so it SYNTHESIZES that context host-side. None of
+# its inputs need a per-test marker; they are all derived:
+#   * is-mix      — a cooperative mix kernel is the SAME source compiled for both
+#                   sub-cores, so it appears as an (aic, aiv) incore PAIR sharing
+#                   one source. Detected by load_kernel_meta; everything else
+#                   (incl. independent kernels packed into a mix dispatch, like
+#                   mixed_example) goes through the AIC/AIV-only path.
+#   * hw_block_dim / block_num — the case's `block_dim` (the SPMD grid width).
+#   * aiv_lanes_per_block      — the arch's hardware subblockdim (ARCH_CONFIG).
+# The mix path additionally needs ONE incore to declare the full tensor
+# `signature` (so the dump captures the shared args) — a standard CALLABLE
+# field, not a tool-specific one.
+
+
+# ---------------------------------------------------------------------------
+# Step 1: read kernel metadata (source path + core_type) from the test file
+# ---------------------------------------------------------------------------
+def _first_platform_case(cls, platform):
+    """The case l0 auto-pins when `--case` is omitted: the name of the FIRST
+    `CASES[*]` that lists `platform` in its `platforms` (manual or not — l0
+    always dumps with `--manual include`). None if no case lists the platform.
+    Auto-pinning one case makes the dump deterministic (no "run all cases,
+    reconstruct from the newest dump dir" ambiguity), and ties the synthesized
+    slot-48 block_num to the SAME case the dump ran (block_dim resolved via the
+    caller's per-case map)."""
+    for c in getattr(cls, "CASES", []):
+        if platform in c.get("platforms", []):
+            return c.get("name")
+    return None
+
+
+def load_kernel_meta(test_path: Path, func_id: int, platform: str):
+    spec = importlib.util.spec_from_file_location("l0_swimlane_testmod", str(test_path))
+    if spec is None or spec.loader is None:
+        raise ImportError(f"Cannot import test module from {test_path}")
+    module = importlib.util.module_from_spec(spec)
+    # Register before exec so @scene_test's inspect.getfile(cls) can resolve the
+    # class -> module -> file path for relative CALLABLE source resolution.
+    sys.modules[spec.name] = module
+    spec.loader.exec_module(module)
+
+    from simpler_setup import SceneTestCase  # noqa: PLC0415
+
+    classes = [
+        v
+        for v in vars(module).values()
+        if isinstance(v, type) and issubclass(v, SceneTestCase) and v is not SceneTestCase and hasattr(v, "CALLABLE")
+    ]
+    if not classes:
+        raise ValueError(f"No SceneTestCase with CALLABLE found in {test_path}")
+
+    # Build a func_id -> metadata lookup across every incore of every class. With
+    # the #1181 dump the mix membership is read from the dump's func_id array (a
+    # task property), not guessed here from a shared source — so this is a flat
+    # per-func lookup, and the codegen resolves each mix member through it.
+    #
+    # CALLABLE `source` is relative to the test file's directory (e.g.
+    # "../../mixed_example/kernels/aic/kernel_matmul.cpp"). Resolve each to an
+    # ABSOLUTE path so the generated replay_kernel.cpp can `#include` it from the
+    # workspace dir, and so a mix whose members live in different dirs each
+    # resolve correctly.
+    by_func = {}
+    owner_cls = {}
+    for cls in classes:
+        for inc in cls.CALLABLE.get("incores", []):
+            fid = inc["func_id"]
+            by_func[fid] = {
+                "func_id": fid,
+                "source": (test_path.parent / inc["source"]).resolve(),
+                "core_type": inc["core_type"],
+                "name": inc.get("name") or Path(inc["source"]).stem,
+            }
+            owner_cls[fid] = cls
+    if func_id not in by_func:
+        avail = ", ".join(f"{f}={m['name']}({m['core_type']})" for f, m in sorted(by_func.items()))
+        raise ValueError(f"func_id={func_id} not found. Available incores: {avail}")
+    tgt = by_func[func_id]
+    cls = owner_cls[func_id]
+    auto_case = _first_platform_case(cls, platform)
+    # name -> block_dim for every case, so main can resolve block_dim from the
+    # case actually selected (--case X, or the auto-pinned first-platform case),
+    # not an arbitrary CASES entry. A case declaring no block_dim maps to 1.
+    block_dim_by_case = {
+        c.get("name"): int(c.get("config", {}).get("block_dim") or 1) for c in getattr(cls, "CASES", [])
+    }
+    return {
+        "by_func": by_func,
+        "target_func_id": func_id,
+        "source": tgt["source"],
+        "core_type": tgt["core_type"],
+        "name": tgt["name"],
+        "class_name": cls.__name__,
+        "auto_case": auto_case,
+        "block_dim_by_case": block_dim_by_case,
+    }
+
+
+def _case_from_manifest(manifest: Path, class_name: str) -> str:
+    """Recover the case name from the dump dir `<ClassName>_<Case>_<YYYYMMDD>_<HHMMSS>`."""
+    run_dir = manifest.parent.parent.name  # args_dump's parent = the run dir
+    m = re.match(rf"{re.escape(class_name)}_(.+)_\d{{8}}_\d{{6}}$", run_dir)
+    return m.group(1) if m else "case"
+
+
+# ---------------------------------------------------------------------------
+# Step 2: obtain a tensor_dump.json (run the test in sim, or reuse one)
+# ---------------------------------------------------------------------------
+def get_or_run_dump(test_path: Path, platform: str, variant: str, dump_json, case=None, device=None):
+    if dump_json:
+        p = Path(dump_json)
+        if not p.is_file():
+            raise FileNotFoundError(f"--dump-json not found: {p}")
+        return p
+
+    outputs = PROJECT_ROOT / "outputs"
+    before = set(outputs.glob("*/args_dump")) if outputs.is_dir() else set()
+    # Level 3 (full, JSON-only): every task's tensor *metadata* (shape/dtype/
+    # strides) + scalar values, no .bin payload copy. That is exactly what arg
+    # reconstruction consumes (it never reads the payload), and it skips the
+    # device->host arena copy entirely — cheaper, and avoids the large-shape
+    # copy failing onboard.
+    cmd = [sys.executable, str(test_path), "-p", platform, "--dump-args", "3"]
+    # Pin the dump to exactly one case, allowing it to be `manual` (l0_swimlane
+    # tracing targets are often manual to stay out of CI). `case` is main's
+    # resolved case: the explicit --case, else the auto-pinned first-platform
+    # case. Pinning one case keeps the dump deterministic — the reconstruction
+    # can't pick the wrong (newest) dump dir when several cases ran. `case` is
+    # None only when the test declares no case on this platform (single-case /
+    # no-CASES); then the dump runs the test as-is.
+    if case:
+        cmd += ["--case", case, "--manual", "include"]
+    # Onboard dump runs on a real NPU and must use the task-submit-locked device.
+    # When the whole tool is wrapped in an outer task-submit, that device reaches
+    # us via the appended --device <id> (resolved into `device` by main) — thread
+    # it into the test so the dump and the later camodel collect share the one
+    # lock instead of grabbing separate devices. Sim variants need no device.
+    if variant != "sim":
+        if device:
+            cmd += ["--device", device]
+        else:
+            print(
+                "[l0_swimlane] WARNING: onboard dump with no locked device — not "
+                "under task-submit; the dump will use the framework default "
+                "device unlocked. Wrap the whole tool in task-submit (see "
+                ".claude/rules/running-onboard.md)."
+            )
+    print(f"[l0_swimlane] running dump: {' '.join(cmd[1:])}")
+    # check=True on purpose: a golden PASS on the dump run is the free
+    # validation that func_id / signature / args are wired correctly, so the
+    # captured dump is trustworthy. A golden FAIL means the capture is suspect
+    # — never reconstruct a trace from it.
+    subprocess.run(cmd, cwd=str(PROJECT_ROOT), check=True)
+    after = set(outputs.glob("*/args_dump"))
+    new = sorted(after - before, key=lambda p: p.stat().st_mtime)
+    # MUST come from THIS run. Never fall back to a pre-existing args_dump —
+    # that would silently reconstruct args from an unrelated test's dump and
+    # produce a wrong-but-passing trace. An empty `new` means the dump was
+    # skipped (e.g. signature/payload mismatch -> "tensor dump skipped" /
+    # "No tensor dump data to export" warnings above).
+    if not new:
+        raise RuntimeError(
+            f"dump for {test_path.name} produced no NEW outputs/*/args_dump — the "
+            f"tensor dump was skipped (see the 'tensor dump skipped' / 'No tensor "
+            f"dump data to export' warnings above). The incore signature likely does "
+            f"not match the dispatched payload. Refusing to reuse a stale dump."
+        )
+    cand = new[-1]
+    manifest = cand / "args_dump.json"
+    if not manifest.is_file():
+        raise RuntimeError(f"manifest missing (dump produced no data): {manifest}")
+    return manifest
+
+
+# ---------------------------------------------------------------------------
+# Step 3: reconstruct one task's args from the #1181 dump (func_id ARRAY model)
+# ---------------------------------------------------------------------------
+def reconstruct_task_args(manifest: Path, func_id_list, task_id=None):
+    """Reconstruct one task's full positional args[] from the #1181 dump.
+
+    The dump (`args_dump.json`, top-level "args") stamps every record with the
+    task's active-subtask membership as a func_id ARRAY (slot order AIC, AIV0,
+    AIV1) and emits each payload slot once, positionally. The caller names the
+    exact member set via `--func-id` (e.g. `0,1,2`); we pick the task whose
+    func_id SET matches it and take its FULL payload (every slot, sorted by
+    arg_index); each member kernel reads its own slice. Returns
+    (chosen_task_id, tensor_count, args, mix_func_ids), with mix_func_ids the
+    chosen task's func_id array AS STORED (slot order — NOT the typed order, so
+    lane assignment stays correct).
+
+    `task_id` pins a specific instance; default = lowest.
+    """
+    data = json.loads(manifest.read_text())
+    entries = data["args"]
+
+    def fids(t):
+        return tuple(t.get("func_id") or [])
+
+    want = set(func_id_list)
+    recs = [t for t in entries if set(fids(t)) == want]
+    if not recs:
+        shapes = [list(s) for s in sorted({fids(t) for t in entries})]
+        raise ValueError(f"--func-id {sorted(want)} matches no task; dump has func_id shapes {shapes}")
+
+    tasks = sorted({t["task_id"] for t in recs})
+    chosen = task_id if task_id is not None else tasks[0]
+    if chosen not in set(tasks):
+        raise ValueError(f"task_id {chosen} not among the selected tasks ({tasks})")
+    trecs = [t for t in recs if t["task_id"] == chosen]
+    mix_func_ids = list(fids(trecs[0]))  # slot order: AIC, AIV0, AIV1
+
+    # Union both stages, keyed by arg_index (INOUT appears twice -> keep one).
+    by_arg = {}
+    for t in trecs:
+        ai = t["arg_index"]
+        if ai not in by_arg or t["stage"] == "before_dispatch":
+            by_arg[ai] = t
+
+    tensors = sorted((t for t in by_arg.values() if t["kind"] != "scalar"), key=lambda t: t["arg_index"])
+    scalars = sorted((t for t in by_arg.values() if t["kind"] == "scalar"), key=lambda t: t["arg_index"])
+    tensor_count = len(tensors)
+    # Args need NOT start at arg_index 0 or be contiguous: a kernel dispatched as
+    # a non-first MIX subtask reads its tensors at an offset (e.g. mixed_example's
+    # ADD reads args[3..5], MUL reads args[6..8]). So validate only that every arg
+    # slot is distinct and each tensor has a shape; the kernel reads args[slot],
+    # and the replay places each tensor at its real slot (decoupled from the
+    # 0-based descriptor-array index — see emit_replay_host).
+    seen = set()
+    for t in tensors:
+        if not t.get("shape"):
+            raise ValueError(f"tensor arg {t['arg_index']} has empty shape")
+        if t["arg_index"] in seen:
+            raise ValueError(f"duplicate tensor arg_index {t['arg_index']}")
+        seen.add(t["arg_index"])
+    for s in scalars:
+        if s["arg_index"] in seen:
+            raise ValueError(f"scalar arg_index {s['arg_index']} collides with a tensor slot")
+        seen.add(s["arg_index"])
+
+    args = []
+    for t in tensors:
+        dt = t["dtype"].upper()
+        shape = list(t["shape"])
+        strides = list(t.get("strides") or _row_major(shape))
+        args.append(
+            {
+                "kind": "tensor",
+                "slot": t["arg_index"],
+                "dtype": dt,
+                "shape": shape,
+                "strides": strides,
+                "start_offset": int(t.get("start_offset", 0)),
+            }
+        )
+    for s in scalars:
+        args.append({"kind": "scalar", "slot": s["arg_index"], "value": int(s["value"])})
+    return chosen, tensor_count, args, mix_func_ids
+
+
+def _row_major(shape):
+    st = [1] * len(shape)
+    acc = 1
+    for i in range(len(shape) - 1, -1, -1):
+        st[i] = acc
+        acc *= shape[i]
+    return st
+
+
+def _extent_elem(shape, strides):
+    e = 1
+    for s, stride in zip(shape, strides):
+        if s > 0:
+            e += (s - 1) * stride
+    return e
+
+
+def _is_contiguous(shape, strides, start_offset):
+    exp = 1
+    for s, stride in zip(reversed(shape), reversed(strides)):
+        if stride != exp:
+            return False
+        exp *= s
+    return start_offset == 0
+
+
+# ---------------------------------------------------------------------------
+# Step 4: code-generation for the 5 workspace files
+# ---------------------------------------------------------------------------
+def _prologue(cfg) -> str:
+    return f"""\
+#ifndef __CCE_AICORE__
+#define __CCE_AICORE__ {cfg["cce"]}
+#endif
+#include <cce_aicore_intrinsics.h>
+#ifndef {cfg["npu_arch"]}
+#define {cfg["npu_arch"]}
+#endif
+#ifndef EVENT_ID7
+#define EVENT_ID7 ((event_t)7)
+#endif
+#ifndef PIPE_FIX
+#define PIPE_FIX ((pipe_t)10)
+#endif
+"""
+
+
+def _split_cores(members):
+    """Partition mix members into the AIC and AIV sub-cores (slot order preserved)."""
+    aic = [m for m in members if m["core_type"] == "aic"]
+    aiv = [m for m in members if m["core_type"] == "aiv"]
+    return aic, aiv
+
+
+def _member_entry_symbol(m):
+    """Per-member renamed kernel entry symbol (B-tier separate compilation)."""
+    return f"l0_f{m['func_id']}_entry"
+
+
+def emit_replay_kernel_combined(members, cfg) -> str:
+    """Combined replay_entry for a whole mix task (mix-together).
+
+    `members` is the mix's incore metadata in func_id (slot) order — each
+    {"func_id","core_type","source"(abs Path),"name"}.
+
+    Two regimes by AIV count:
+
+    * **A-tier (<=1 AIV)** — single translation unit: the AIC member is
+      `#include`d under `#if defined(__DAV_CUBE__)`, the AIV member under
+      `#if defined(__DAV_VEC__)`. Different arch variants of one mix-arch
+      compile, so the two `kernel_entry`/`get_num_tiles` definitions never
+      clash. The simulator runs both sub-cores from one `<<<1>>>` launch (one
+      block = 1 AIC + 2 AIV), giving a combined swimlane. With one AIV member
+      the vec body has no per-lane guard, so BOTH AIV lanes run it — redundant
+      for a 1-AIV mix, but the faithful behavior for an SPMD mix whose two AIV
+      lanes legitimately run the same kernel (a duplicate func_id collapsed
+      above), each lane differing only by `get_subblockid()`.
+
+    * **B-tier (2 AIV)** — still ONE translation unit, but the two AIV `.cpp`
+      share file-scope names (`static get_num_tiles`, `extern "C" kernel_entry`),
+      so each `#include` is wrapped in `#define`-renames + `#undef`
+      (`kernel_entry`->`l0_f<id>_entry`, `get_num_tiles`->`l0_f<id>_get_num_tiles`).
+      Both AIV includes sit under `#if defined(__DAV_VEC__)` so the vector ISA
+      target feature is in scope (an un-guarded include compiles for the wrong
+      arch — that is why separate compilation failed). `replay_entry` routes per
+      sub-core, and per AIV lane via `get_subblockid()`.
+
+    NOTE (tier C, PLAN §3.4): cross-core sync the orchestration drove is absent
+    in this isolated replay, so inter-core waits are optimistic.
+    """
+    aic, aiv = _split_cores(members)
+    # Two AIV members that share ONE source are the same program on both lanes —
+    # whether the orchestration gave them the same func_id (spmd_paged_attention_highperf:
+    # aiv0 = aiv1 = PA_AIV → dump `[0,1,1]`) or two distinct func_ids that compile
+    # the SAME `.cpp` (spmd_multiblock_mix: func 1 & 2 both `kernel_spmd_mix.cpp` →
+    # dump `[0,1,2]`). Collapse them by source: the A-tier single-AIV path
+    # `#include`s the source ONCE and both lanes run it (the kernel self-routes by
+    # sub-block id). Including the same source twice would redefine its file-scope
+    # statics (the rename below only covers `kernel_entry` / `get_num_tiles`); only
+    # DISTINCT-source AIV members need the 2-AIV rename path below.
+    seen_aiv = set()
+    aiv = [m for m in aiv if not (m["source"] in seen_aiv or seen_aiv.add(m["source"]))]
+    if len(aic) > 1:
+        raise NotImplementedError(f"{len(aic)} AIC members in one task; 1C2V has a single AIC sub-core")
+    if len(aiv) > 2:
+        raise NotImplementedError(f"{len(aiv)} distinct AIV members in one task; 1C2V has at most 2 AIV sub-cores")
+
+    if len(aiv) <= 1:
+        # A-tier: single-TU #include path (cube=AIC variant, vec=AIV variant).
+        def arch_section(arch_macro, member):
+            if member is None:
+                return "", f"#if defined({arch_macro})\n    // no member on this sub-core.\n#endif"
+            inc = f'#if defined({arch_macro})\n{_prologue(cfg)}#include "{member["source"]}"\n#endif\n'
+            body = (
+                f"#if defined({arch_macro})\n"
+                f"    kernel_entry(args);  // func_id={member['func_id']} {member['name']} ({member['core_type']})\n"
+                f"#endif"
+            )
+            return inc, body
+
+        cube_inc, cube_body = arch_section("__DAV_CUBE__", aic[0] if aic else None)
+        vec_inc, vec_body = arch_section("__DAV_VEC__", aiv[0] if aiv else None)
+        desc = ", ".join(f"{m['name']}(func {m['func_id']},{m['core_type']})" for m in members)
+        return f"""\
+#include <stdint.h>
+
+#ifndef AICORE
+#define AICORE [aicore]
+#endif
+
+// Combined mix replay_entry — cube sub-core runs the AIC member, vec the AIV
+// member; one mix-arch binary, the simulator runs each sub-core's path.
+// Mix members (slot order): {desc}
+{cube_inc}{vec_inc}
+extern "C" __global__ AICORE void replay_entry(__gm__ int64_t *args) {{
+{cube_body}
+{vec_body}
+}}
+"""
+
+    # B-tier (2 AIV): ONE TU like A-tier. The two AIV .cpp share file-scope names
+    # (static get_num_tiles + extern "C" kernel_entry), so each #include is
+    # wrapped in #define-renames + #undef. Both AIV includes go under
+    # #if defined(__DAV_VEC__) so the vector ISA target feature is in scope (an
+    # un-guarded include compiles for the wrong arch — that broke separate
+    # compilation). The renamed entries are routed per sub-core / AIV lane.
+    aiv0, aiv1 = aiv[0], aiv[1]
+
+    def rename_block(m):
+        # Rename every file-scope name the co-resident sources SHARE. Repo
+        # convention: kernel_entry + get_num_tiles (the *_impl helpers already
+        # differ, e.g. add_impl/mul_impl; standalone AIVs have no get_num_tiles,
+        # so renaming it is a harmless no-op there). A future 2-AIV pair that
+        # shares OTHER file-scope statics must add them to this list.
+        return (
+            f"#define kernel_entry {_member_entry_symbol(m)}\n"
+            f"#define get_num_tiles l0_f{m['func_id']}_get_num_tiles\n"
+            f'#include "{m["source"]}"\n'
+            f"#undef get_num_tiles\n"
+            f"#undef kernel_entry\n"
+        )
+
+    cube_inc = f"#if defined(__DAV_CUBE__)\n{_prologue(cfg)}{rename_block(aic[0])}#endif\n" if aic else ""
+    vec_inc = f"#if defined(__DAV_VEC__)\n{_prologue(cfg)}{rename_block(aiv0)}{rename_block(aiv1)}#endif\n"
+    if aic:
+        cube_body = (
+            f"#if defined(__DAV_CUBE__)\n"
+            f"    {_member_entry_symbol(aic[0])}(args);  // AIC: func_id={aic[0]['func_id']} {aic[0]['name']}\n"
+            f"#endif"
+        )
+    else:
+        cube_body = "#if defined(__DAV_CUBE__)\n    // no AIC member in this mix.\n#endif"
+    desc = ", ".join(f"{m['name']}(func {m['func_id']},{m['core_type']})" for m in members)
+    return f"""\
+#include <stdint.h>
+
+#ifndef AICORE
+#define AICORE [aicore]
+#endif
+
+// Combined mix replay_entry (B-tier: 2 AIV) — single TU. Each member #include is
+// wrapped in #define-renames (kernel_entry -> l0_f<id>_entry, get_num_tiles ->
+// l0_f<id>_get_num_tiles) + #undef so the two AIV kernels' shared file-scope
+// names don't clash; each include sits under its sub-core arch guard so the ISA
+// target feature is in scope. Mix members (slot order): {desc}
+{cube_inc}{vec_inc}
+extern "C" __global__ AICORE void replay_entry(__gm__ int64_t *args) {{
+{cube_body}
+#if defined(__DAV_VEC__)
+    // Per-AIV-lane routing. RISK (issue #900 / runtime intrinsic.h): simpler's
+    // runtime considers the CCE get_subblockid() unreliable (returns 0 for BOTH
+    // AIV lanes because the runtime does not program that register). Whether a
+    // bare camodel replay op returns the true physical lane (0=AIV0, 1=AIV1) is
+    // UNVERIFIED — validate the trace shows veccore0 and veccore1 running
+    // DIFFERENT kernels. get_sub_block_id(args) is NOT usable here: the replay's
+    // single shared args[] gives both lanes the same slot-49 GlobalContext.
+    if (get_subblockid() == 0) {{
+        {_member_entry_symbol(aiv0)}(args);  // AIV0: func_id={aiv0["func_id"]} {aiv0["name"]}
+    }} else {{
+        {_member_entry_symbol(aiv1)}(args);  // AIV1: func_id={aiv1["func_id"]} {aiv1["name"]}
+    }}
+#endif
+}}
+"""
+
+
+def emit_replay_launch() -> str:
+    return """\
+#include <stdint.h>
+#ifndef AICORE
+#define AICORE [aicore]
+#endif
+
+extern "C" __global__ AICORE void replay_entry(__gm__ int64_t *args);
+
+// HW_BLOCK_NUM = 1: single task in isolation.
+extern "C" void launch_replay(void *args, void *stream) {
+    replay_entry<<<1, nullptr, stream>>>((__gm__ int64_t *)args);
+}
+"""
+
+
+def _emit_tensor_alloc_descs(args):
+    """Per-tensor (alloc, desc, argrow, free) C snippets.
+
+    `ti` is the tensor's 0-based index in the descriptor array `d_tensors`;
+    `slot` is its real args[] position. These differ when the kernel reads its
+    tensors at an offset (a non-first MIX subtask, e.g. args[3..5]) — so the
+    descriptor index and the args slot are decoupled.
+    """
+    alloc, descs, argrow, frees = [], [], [], []
+    tensor_args = [a for a in args if a["kind"] == "tensor"]
+    for ti, a in enumerate(tensor_args):
+        slot = a["slot"]
+        shape, strides, dt = a["shape"], a["strides"], a["dtype"]
+        esz = DTYPE_SIZE[dt]
+        buf_bytes = (a["start_offset"] + _extent_elem(shape, strides)) * esz
+        contig = 1 if _is_contiguous(shape, strides, a["start_offset"]) else 0
+        ndims = len(shape)
+        shp = ", ".join(str(x) for x in shape)
+        strd = ", ".join(str(x) for x in strides)
+        # Default: data memset to 0 (only descriptor metadata is real). When
+        # --set-arg fills this tensor, write VALUE into every element instead —
+        # for control tensors whose CONTENT drives the kernel (e.g. paged
+        # attention reads n_blocks from the context_lens tensor). The low `esz`
+        # bytes of the int64 VALUE are copied per element (correct for any
+        # integer width, little-endian).
+        fill = a.get("fill")
+        if fill is None:
+            init = f"    ACL_CHECK(aclrtMemset(d_t{ti}, t{ti}Bytes, 0, t{ti}Bytes));"
+        else:
+            init = (
+                f"    {{\n"
+                f"        std::vector<unsigned char> hbuf{ti}(t{ti}Bytes, 0);\n"
+                f"        const int64_t fillv{ti} = {fill}LL;\n"
+                f"        for (size_t off = 0; off + {esz} <= t{ti}Bytes; off += {esz})\n"
+                f"            memcpy(hbuf{ti}.data() + off, &fillv{ti}, {esz});\n"
+                f"        ACL_CHECK(aclrtMemcpy(d_t{ti}, t{ti}Bytes, hbuf{ti}.data(), t{ti}Bytes,\n"
+                f"                              ACL_MEMCPY_HOST_TO_DEVICE));\n"
+                f"    }}"
+            )
+        alloc.append(
+            f"    void *d_t{ti} = nullptr;\n"
+            f"    const size_t t{ti}Bytes = {buf_bytes}ULL;\n"
+            f"    ACL_CHECK(aclrtMalloc(&d_t{ti}, t{ti}Bytes, ACL_MEM_MALLOC_HUGE_FIRST));\n"
+            f"{init}"
+        )
+        descs.append(
+            f"    {{\n"
+            f"        const uint32_t shp[] = {{{shp}}};\n"
+            f"        const uint32_t strd[] = {{{strd}}};\n"
+            f"        make_desc(h_tensors.data() + {ti} * 128, (uint64_t)(uintptr_t)d_t{ti},\n"
+            f"                  t{ti}Bytes, {a['start_offset']}ULL, shp, strd, {ndims}, {DTYPE_RAW[dt]}, {contig});\n"
+            f"    }}"
+        )
+        argrow.append(f"    h_args[{slot}] = (int64_t)((uintptr_t)d_tensors + (size_t){ti} * 128);")
+        frees.append(f"    aclrtFree(d_t{ti});")
+    return alloc, descs, argrow, frees
+
+
+def emit_replay_host(tensor_count: int, args, block_num: int = 1) -> str:
+    alloc, descs, argrow, free_list = _emit_tensor_alloc_descs(args)
+    for a in args:
+        if a["kind"] == "scalar":
+            argrow.append(f"    h_args[{a['slot']}] = (int64_t){a['value']}LL;  // scalar")
+    frees = "\n".join(free_list)
+
+    return f"""\
+// Auto-generated by simpler_setup.tools.l0_swimlane — do not edit by hand.
+// Builds the kernel's real args (from tensor dump) and launches replay_entry.
+#include <acl/acl.h>
+
+#include <cstdint>
+#include <cstdio>
+#include <cstdlib>
+#include <cstring>
+#include <vector>
+
+#define ACL_CHECK(expr)                                                        \\
+    do {{                                                                       \\
+        aclError _e = (expr);                                                  \\
+        if (_e != ACL_SUCCESS) {{                                              \\
+            fprintf(stderr, "ACL error %d at %s:%d\\n", _e, __FILE__, __LINE__);\\
+            return 1;                                                          \\
+        }}                                                                     \\
+    }} while (0)
+
+// 128B Tensor descriptor — offsets pinned by static_assert in tensor.h:
+//   buffer.addr@0 buffer.size@8 start_offset@24 ndims@36 dtype@40
+//   is_contiguous@42 shapes@44 strides@72.
+static void make_desc(void *dst128, uint64_t dev_addr, uint64_t buf_bytes,
+                      uint64_t start_offset, const uint32_t *shapes,
+                      const uint32_t *strides, uint32_t ndims, uint8_t dtype_raw,
+                      uint8_t is_contig) {{
+    uint8_t b[128];
+    memset(b, 0, sizeof(b));
+    *reinterpret_cast<uint64_t *>(b + 0) = dev_addr;
+    *reinterpret_cast<uint64_t *>(b + 8) = buf_bytes;
+    *reinterpret_cast<uint64_t *>(b + 24) = start_offset;
+    *reinterpret_cast<int32_t *>(b + 32) = 0;
+    *reinterpret_cast<uint32_t *>(b + 36) = ndims;
+    b[40] = dtype_raw;
+    b[42] = is_contig;
+    for (uint32_t i = 0; i < ndims; ++i)
+        *reinterpret_cast<uint32_t *>(b + 44 + 4 * i) = shapes[i];
+    for (uint32_t i = 0; i < ndims; ++i)
+        *reinterpret_cast<uint32_t *>(b + 72 + 4 * i) = strides[i];
+    memcpy(dst128, b, sizeof(b));
+}}
+
+extern "C" void launch_replay(void *args, void *stream);
+
+int main() {{
+    const char *dev_s = getenv("ACL_DEVICE_ID");
+    int32_t device_id = dev_s ? atoi(dev_s) : 0;
+    ACL_CHECK(aclInit(nullptr));
+    ACL_CHECK(aclrtSetDevice(device_id));
+    aclrtStream stream = nullptr;
+    ACL_CHECK(aclrtCreateStream(&stream));
+
+    constexpr int kArgsSlots = {KARGS_SLOTS};
+    constexpr int kNumTensors = {tensor_count};
+
+{chr(10).join(alloc)}
+
+    std::vector<uint8_t> h_tensors((size_t)kNumTensors * 128, 0);
+{chr(10).join(descs)}
+
+    void *d_tensors = nullptr;
+    ACL_CHECK(aclrtMalloc(&d_tensors, (size_t)kNumTensors * 128, ACL_MEM_MALLOC_HUGE_FIRST));
+    ACL_CHECK(aclrtMemcpy(d_tensors, (size_t)kNumTensors * 128, h_tensors.data(),
+                          (size_t)kNumTensors * 128, ACL_MEMCPY_HOST_TO_DEVICE));
+
+    std::vector<int64_t> h_args(kArgsSlots, 0);
+{chr(10).join(argrow)}
+
+    // SPMD context at slots 48/49 — built unconditionally. Harmless for
+    // positional kernels (they ignore 48/49); required for SPMD kernels that
+    // read get_block_idx / get_block_num / get_sub_block_id, which would
+    // otherwise dereference a null context. block_idx=0 traces a representative
+    // block; block_num={block_num} (the case's block_dim) keeps steady-state
+    // branches (e.g. `block_idx+1 < block_num`) on their normal path.
+    uint8_t h_local[64] = {{0}};   // LocalContext: block_idx@0, block_num@4
+    *reinterpret_cast<int32_t *>(h_local + 0) = 0;
+    *reinterpret_cast<int32_t *>(h_local + 4) = {block_num};
+    uint8_t h_global[16] = {{0}};  // GlobalContext: sub_block_id@0 = 0
+    void *d_local = nullptr, *d_global = nullptr;
+    ACL_CHECK(aclrtMalloc(&d_local, sizeof(h_local), ACL_MEM_MALLOC_HUGE_FIRST));
+    ACL_CHECK(aclrtMemcpy(d_local, sizeof(h_local), h_local, sizeof(h_local), ACL_MEMCPY_HOST_TO_DEVICE));
+    ACL_CHECK(aclrtMalloc(&d_global, sizeof(h_global), ACL_MEM_MALLOC_HUGE_FIRST));
+    ACL_CHECK(aclrtMemcpy(d_global, sizeof(h_global), h_global, sizeof(h_global), ACL_MEMCPY_HOST_TO_DEVICE));
+    h_args[48] = (int64_t)(uintptr_t)d_local;
+    h_args[49] = (int64_t)(uintptr_t)d_global;
+
+    void *d_args = nullptr;
+    ACL_CHECK(aclrtMalloc(&d_args, kArgsSlots * sizeof(int64_t), ACL_MEM_MALLOC_HUGE_FIRST));
+    ACL_CHECK(aclrtMemcpy(d_args, kArgsSlots * sizeof(int64_t), h_args.data(),
+                          kArgsSlots * sizeof(int64_t), ACL_MEMCPY_HOST_TO_DEVICE));
+
+    launch_replay(d_args, stream);
+    ACL_CHECK(aclrtSynchronizeStream(stream));
+    printf("[replay_host] done: kNumTensors=%d\\n", kNumTensors);
+
+{frees}
+    aclrtFree(d_tensors);
+    aclrtFree(d_local);
+    aclrtFree(d_global);
+    aclrtFree(d_args);
+    aclrtDestroyStream(stream);
+    aclrtResetDevice(device_id);
+    aclFinalize();
+    return 0;
+}}
+"""
+
+
+def emit_cmakelists(arch: str, name: str, cfg, debug: bool = False) -> str:
+    # With -g, also drop the linker `-s` (strip) so the device kernel's
+    # debug_line survives -> Insight can map instructions to source lines.
+    link_opts = "-Wl,-z,relro -Wl,-z,now" if debug else "-s -Wl,-z,relro -Wl,-z,now"
+    dbg_flag = "\n    -g" if debug else ""
+    return f"""\
+cmake_minimum_required(VERSION 3.16)
+
+set(CMAKE_C_COMPILER bisheng)
+set(CMAKE_CXX_COMPILER bisheng)
+
+project(l0_swimlane_{name}_replay)
+
+set(CMAKE_CXX_STANDARD 17)
+set(CMAKE_CXX_STANDARD_REQUIRED ON)
+set(CMAKE_POSITION_INDEPENDENT_CODE ON)
+
+if(NOT DEFINED ENV{{ASCEND_HOME_PATH}})
+    message(FATAL_ERROR "ASCEND_HOME_PATH is not set (source CANN set_env.sh first)")
+endif()
+set(ASCEND_HOME_PATH $ENV{{ASCEND_HOME_PATH}})
+set(SOC_VERSION {cfg["soc"]} CACHE STRING "Simulator SoC version")
+set(PTO_ISA_ROOT $ENV{{PTO_ISA_ROOT}} CACHE PATH "PTO ISA root")
+set(REPO_ROOT $ENV{{REPO_ROOT}} CACHE PATH "simpler repo root")
+
+add_compile_options(
+    -D_FORTIFY_SOURCE=2 -O2 -std=c++17
+    -Wno-macro-redefined -Wno-ignored-attributes
+    -fstack-protector-strong -fPIC
+)
+add_link_options({link_opts})
+
+set(CMAKE_CCE_COMPILE_OPTIONS
+    -xcce -fenable-matrix --cce-aicore-enable-tl -fPIC
+    -Xhost-start -Xhost-end
+    "SHELL:-mllvm -cce-aicore-stack-size=0x8000"
+    "SHELL:-mllvm -cce-aicore-function-stack-size=0x8000"
+    "SHELL:-mllvm -cce-aicore-record-overflow=true"
+    "SHELL:-mllvm -cce-aicore-addr-transform"
+    "SHELL:-mllvm -cce-aicore-dcci-insert-for-scalar=false"
+)
+set(CMAKE_CPP_COMPILE_OPTIONS
+    -xc++
+    "SHELL:-include stdint.h"
+    "SHELL:-include stddef.h"
+)
+
+set(COMMON_INCLUDES
+    ${{PTO_ISA_ROOT}}/include
+    ${{PTO_ISA_ROOT}}/include/pto
+    ${{REPO_ROOT}}/src/{arch}/runtime/tensormap_and_ringbuffer/runtime
+    ${{REPO_ROOT}}/src/{arch}/runtime/tensormap_and_ringbuffer/common
+    ${{REPO_ROOT}}/src/common/task_interface
+    ${{REPO_ROOT}}/src/{arch}/platform/include
+    ${{REPO_ROOT}}/simpler_setup/incore
+    ${{ASCEND_HOME_PATH}}/pkg_inc
+    ${{ASCEND_HOME_PATH}}/pkg_inc/profiling
+    ${{ASCEND_HOME_PATH}}/pkg_inc/runtime/runtime
+    ${{ASCEND_HOME_PATH}}/include
+)
+
+add_library(replay_kernel SHARED replay_kernel.cpp replay_launch.cpp)
+target_compile_options(replay_kernel PRIVATE
+    ${{CMAKE_CCE_COMPILE_OPTIONS}}
+    --cce-aicore-arch={cfg["aicore_arch"]}
+    -DREGISTER_BASE -std=c++17{dbg_flag})
+target_include_directories(replay_kernel PRIVATE ${{COMMON_INCLUDES}})
+target_link_options(replay_kernel PRIVATE --cce-fatobj-link)
+
+add_executable(replay_host replay_host.cpp)
+target_compile_options(replay_host PRIVATE ${{CMAKE_CPP_COMPILE_OPTIONS}})
+target_include_directories(replay_host PRIVATE ${{COMMON_INCLUDES}})
+target_link_directories(replay_host PUBLIC
+    ${{ASCEND_HOME_PATH}}/lib64
+    ${{ASCEND_HOME_PATH}}/aarch64-linux/simulator/${{SOC_VERSION}}/lib
+)
+target_link_libraries(replay_host PRIVATE
+    replay_kernel
+    runtime_camodel
+    stdc++ ascendcl m tiling_api platform c_sec dl nnopbase
+)
+"""
+
+
+def emit_run_collect(cfg) -> str:
+    # Plain string (bash uses ${} braces) — substitute the SoC default via token.
+    return _RUN_COLLECT_TEMPLATE.replace("__SOC_DEFAULT__", cfg["soc"])
+
+
+_RUN_COLLECT_TEMPLATE = """\
+#!/usr/bin/env bash
+set -euo pipefail
+: "${CANN_HOME:?CANN_HOME must be set}"
+: "${PTO_ISA_ROOT:?PTO_ISA_ROOT must be set}"
+: "${REPO_ROOT:?REPO_ROOT must be set}"
+
+WS="${WS:-$(dirname "$(readlink -f "$0")")}"
+SOC_VERSION="${SOC_VERSION:-__SOC_DEFAULT__}"
+DEVICE_ID="${TARGET_DEVICE_ID:-${NPU_LOCKED_DEVICE:-0}}"
+BUILD_DIR="$WS/build"
+COLLECT_DIR="$WS/msprof_collect"
+EXPORT_ROOT="$WS/insight_export"
+
+source "$CANN_HOME/set_env.sh"
+export ASCEND_HOME_PATH="$CANN_HOME"
+SIM_LIB_DIR="$CANN_HOME/aarch64-linux/simulator/$SOC_VERSION/lib"
+LD_LIBS="$BUILD_DIR:$SIM_LIB_DIR:$CANN_HOME/lib64"
+LD_LIBS="$LD_LIBS:$CANN_HOME/aarch64-linux/devlib:$CANN_HOME/devlib"
+export LD_LIBRARY_PATH="$LD_LIBS:${LD_LIBRARY_PATH:-}"
+export ACL_DEVICE_ID="$DEVICE_ID"
+mkdir -p "$BUILD_DIR" "$COLLECT_DIR" "$EXPORT_ROOT"
+
+cmake -G Ninja -S "$WS" -B "$BUILD_DIR" \\
+    -DSOC_VERSION="$SOC_VERSION" -DPTO_ISA_ROOT="$PTO_ISA_ROOT" -DREPO_ROOT="$REPO_ROOT"
+cmake --build "$BUILD_DIR" --target replay_host
+
+msprof op simulator \\
+    --application="$BUILD_DIR/replay_host" --kernel-name="replay_entry" \\
+    --launch-count=1 --soc-version="$SOC_VERSION" --timeout=120 \\
+    --output="$COLLECT_DIR/out" 2>&1 | tee "$COLLECT_DIR/msprof_collect.log"
+
+OPPROF_DIR="$(find "$COLLECT_DIR/out" -maxdepth 1 -mindepth 1 -type d -name 'OPPROF_*' | sort | tail -n 1)"
+test -n "$OPPROF_DIR"
+if [[ -d "$OPPROF_DIR/device0/tmp_dump" ]]; then
+    EXPORT_SRC="$OPPROF_DIR/device0/tmp_dump"
+else
+    EXPORT_SRC="$OPPROF_DIR/dump"
+fi
+msprof op simulator --export="$EXPORT_SRC" --output="$EXPORT_ROOT" \\
+    2>&1 | tee "$EXPORT_ROOT/msprof_export.log"
+echo "[run_collect] done. Insight artifacts under: $EXPORT_ROOT/OPPROF_*/simulator/"
+"""
+
+
+def generate_workspace(  # noqa: PLR0913
+    ws: Path,
+    arch: str,
+    cfg,
+    members,
+    name: str,
+    tensor_count: int,
+    args,
+    debug: bool = False,
+    block_num: int = 1,
+):
+    ws.mkdir(parents=True, exist_ok=True)
+    # One combined replay_entry for the whole mix task (cube=AIC member, vec=AIV
+    # member), driven by ONE shared args[] and a single-block `<<<1>>>` launch
+    # (one block = 1 AIC + 2 AIV sub-cores). A single-kernel task is just a mix
+    # of size 1.
+    (ws / "replay_kernel.cpp").write_text(emit_replay_kernel_combined(members, cfg))
+    (ws / "replay_launch.cpp").write_text(emit_replay_launch())
+    (ws / "replay_host.cpp").write_text(emit_replay_host(tensor_count, args, block_num))
+    (ws / "CMakeLists.txt").write_text(emit_cmakelists(arch, name, cfg, debug))
+    rc = ws / "run_collect.sh"
+    rc.write_text(emit_run_collect(cfg))
+    rc.chmod(0o755)
+
+
+# ---------------------------------------------------------------------------
+# Step 5: build + collect
+# ---------------------------------------------------------------------------
+def _build_env():
+    cann = os.environ.get("ASCEND_HOME_PATH")
+    if not cann:
+        raise OSError("ASCEND_HOME_PATH is not set — source CANN set_env.sh first")
+    env = dict(os.environ)
+    env["ASCEND_HOME_PATH"] = cann
+    env["CANN_HOME"] = cann
+    env["REPO_ROOT"] = str(PROJECT_ROOT)
+    env["PTO_ISA_ROOT"] = ensure_pto_isa_root(verbose=True)
+    return env
+
+
+def smoke_build(ws: Path, env, cfg):
+    build = ws / "build"
+    build.mkdir(exist_ok=True)
+    subprocess.run(
+        [
+            "cmake",
+            "-G",
+            "Ninja",
+            "-S",
+            str(ws),
+            "-B",
+            str(build),
+            f"-DSOC_VERSION={cfg['soc']}",
+            f"-DPTO_ISA_ROOT={env['PTO_ISA_ROOT']}",
+            f"-DREPO_ROOT={env['REPO_ROOT']}",
+        ],
+        cwd=str(ws),
+        env=env,
+        check=True,
+    )
+    subprocess.run(["cmake", "--build", str(build), "--target", "replay_host"], cwd=str(ws), env=env, check=True)
+    so = build / "libreplay_kernel.so"
+    out = subprocess.run(["nm", "-D", str(so)], check=False, capture_output=True, text=True).stdout
+    syms = {line.split()[-1] for line in out.splitlines() if " T " in line}
+    for s in ("replay_entry", "launch_replay"):
+        if s not in syms:
+            raise RuntimeError(f"smoke build: missing symbol {s} in {so}")
+    print("[l0_swimlane] smoke build OK (replay_entry, launch_replay present)")
+
+
+def _to_perfetto(d):  # noqa: PLR0912
+    """In-place transform of an Insight trace into a Perfetto-friendly one.
+
+    Insight encodes each pipe as one track and packs concurrent, pipelined
+    instructions onto it; Perfetto drops overlapping `ph:X` complete events and
+    can mis-pair `B`/`E` flag events. This fixes both, losslessly (same events,
+    same ts/dur; only `tid` changes and `B`/`E` is re-encoded as `X`):
+
+      * sub-lanes: greedily pack each duration event into the lowest lane whose
+        previous event ended, so no two events on a lane overlap. A split pipe
+        `MTE1` becomes `MTE1#0..#k`.
+      * atomic flags: merge each `B`/`E` pair into one `ph:X` slice.
+      * thread_sort_index rebuilt as base*100+lane so lanes render adjacent.
+    """
+    EPS = 1e-9
+    evs = d["traceEvents"]
+    orig_sort = {(e["pid"], e["tid"]): e["args"]["sort_index"] for e in evs if e.get("name") == "thread_sort_index"}
+
+    def is_core(e):
+        return str(e.get("pid", "")).startswith("core")
+
+    # 1. Duration intervals per (pid,tid): X events + B/E pairs (matched by id).
+    intervals = defaultdict(list)
+    be = defaultdict(dict)
+    for e in evs:
+        if not is_core(e):
+            continue
+        ph = e.get("ph")
+        key = (e["pid"], e["tid"])
+        if ph == "X":
+            intervals[key].append(
+                {
+                    "ts": e["ts"],
+                    "end": e["ts"] + e.get("dur", 0.0),
+                    "name": e["name"],
+                    "args": e.get("args", {}),
+                }
+            )
+        elif ph in ("B", "E"):
+            slot = be[(e["pid"], e["tid"], e.get("id"))]
+            slot[ph] = e["ts"]
+            slot.setdefault("src", e)
+    for (pid, tid, eid), slot in be.items():
+        s = slot.get("B", slot.get("E"))
+        en = slot.get("E", slot.get("B"))
+        if s is not None and en is not None and s > en:
+            s, en = en, s
+        src = slot["src"]
+        intervals[(pid, tid)].append(
+            {
+                "ts": s,
+                "end": en,
+                "name": src["name"],
+                "args": src.get("args", {}),
+                "id": eid,
+            }
+        )
+
+    # 2. Greedy lane assignment (interval partitioning) per track.
+    max_lane = defaultdict(int)
+    be_lane = {}
+    for key, iv in intervals.items():
+        iv.sort(key=lambda t: (t["ts"], -(t["end"] - t["ts"])))
+        lane_end = []
+        for it in iv:
+            placed = None
+            for ln in range(len(lane_end)):
+                if lane_end[ln] <= it["ts"] + EPS:
+                    placed = ln
+                    break
+            if placed is None:
+                placed = len(lane_end)
+                lane_end.append(it["end"])
+            else:
+                lane_end[placed] = it["end"]
+            it["lane"] = placed
+            max_lane[key] = max(max_lane[key], placed)
+            if "id" in it:
+                be_lane[(key[0], key[1], it["id"])] = placed
+
+    def split(pid, tid):
+        return max_lane.get((pid, tid), 0) > 0
+
+    def laned(pid, tid, lane):
+        return f"{tid}#{lane}" if split(pid, tid) else tid
+
+    out = []
+    # 3. Emit every duration interval as an atomic X with its sub-lane tid.
+    for (pid, tid), iv in intervals.items():
+        for it in iv:
+            out.append(
+                {
+                    "ph": "X",
+                    "name": it["name"],
+                    "ts": it["ts"],
+                    "dur": it["end"] - it["ts"],
+                    "pid": pid,
+                    "tid": laned(pid, tid, it["lane"]),
+                    "args": it["args"],
+                }
+            )
+
+    # 4. Pass through the rest; anchor flow arrows to their id-pair's lane.
+    for e in evs:
+        ph = e.get("ph")
+        if e.get("name") == "thread_sort_index" or ph in ("X", "B", "E"):
+            continue
+        ne = dict(e)
+        if is_core(e) and ph in ("s", "t", "f") and split(e["pid"], e["tid"]):
+            ln = be_lane.get((e["pid"], e["tid"], e.get("id")), 0)
+            ne["tid"] = f"{e['tid']}#{ln}"
+        out.append(ne)
+
+    # 5. Rebuild thread_sort_index: integer, contiguous, lanes adjacent.
+    for (pid, tid), base in orig_sort.items():
+        mx = max_lane.get((pid, tid), 0)
+        if mx > 0:
+            for k in range(mx + 1):
+                out.append(
+                    {
+                        "args": {"sort_index": base * 100 + k},
+                        "name": "thread_sort_index",
+                        "ph": "M",
+                        "pid": pid,
+                        "tid": f"{tid}#{k}",
+                    }
+                )
+        else:
+            out.append(
+                {"args": {"sort_index": base * 100}, "name": "thread_sort_index", "ph": "M", "pid": pid, "tid": tid}
+            )
+
+    d["traceEvents"] = out
+    return d
+
+
+def collect(ws: Path, env, max_time: int, device=None, dest_name: str = "trace.json"):
+    if device:
+        # Already inside an outer task-submit lock (the recommended onboard
+        # workflow: wrap the whole tool so the dump and this collect share one
+        # device — `device` is the appended --device <id>). Reuse it for the
+        # camodel collect rather than nesting a second task-submit, which would
+        # grab a different device and could deadlock against the outer lock.
+        cmd = ["bash", str(ws / "run_collect.sh")]
+        env = {**env, "TARGET_DEVICE_ID": device}
+    elif shutil.which("task-submit") is not None:
+        # Standalone (no outer lock): self-lock just the collect step.
+        cmd = [
+            "task-submit",
+            "--device",
+            "auto",
+            "--max-time",
+            str(max_time),
+            "--run",
+            f"CANN_HOME={env['CANN_HOME']} PTO_ISA_ROOT={env['PTO_ISA_ROOT']} "
+            f"REPO_ROOT={env['REPO_ROOT']} TARGET_DEVICE_ID=$TASK_DEVICE "
+            f"bash {ws}/run_collect.sh",
+        ]
+    else:
+        print(
+            "[l0_swimlane] WARNING: task-submit not found; running run_collect.sh "
+            "unlocked (results may be noisy if another process shares the NPU)"
+        )
+        cmd = ["bash", str(ws / "run_collect.sh")]
+        env = {**env, "TARGET_DEVICE_ID": os.environ.get("ACL_DEVICE_ID", "0")}
+    subprocess.run(cmd, cwd=str(ws), env=env, check=True)
+
+    sims = list(ws.glob("insight_export/OPPROF_*/simulator"))
+    if not sims:
+        raise RuntimeError("no simulator/ export produced — check msprof_collect.log")
+    trace = sorted(sims, key=lambda p: p.stat().st_mtime)[-1] / "trace.json"
+    if not trace.is_file():
+        raise RuntimeError(f"trace.json missing under {trace.parent}")
+    dst = ws / dest_name
+    perfetto_dst = ws / dest_name.replace(".json", "_perfetto.json")
+    # Pretty-print the Insight copy (one record per line); the export original
+    # stays compact (Insight loads the simulator/ directory). Then emit a
+    # Perfetto-friendly version (sub-lanes + atomic flags). Fall back to a
+    # verbatim copy if the JSON can't be parsed.
+    try:
+        with open(trace) as f:
+            data = json.load(f)
+        with open(dst, "w") as f:
+            json.dump(data, f, ensure_ascii=False, indent=2)
+        _to_perfetto(data)  # mutates data in place (after the pretty dump above)
+        with open(perfetto_dst, "w") as f:
+            json.dump(data, f, ensure_ascii=False, indent=2)
+    except (json.JSONDecodeError, OSError):
+        shutil.copy(trace, dst)
+        perfetto_dst = None
+    return dst, perfetto_dst
+
+
+# ---------------------------------------------------------------------------
+def apply_arg_overrides(kargs: list[dict], set_arg, ap: argparse.ArgumentParser):
+    """Apply --set-arg SLOT=VALUE overrides to reconstructed args.
+
+    A scalar slot has its value rewritten; a tensor slot has its data buffer
+    filled with VALUE (every element) instead of memset-0. Both shrink a replay
+    loop count without touching shapes: single-task kernels carry the count as a
+    scalar (n_blocks), the mix paged-attention kernel derives it from the
+    context_lens tensor's content. Only real arg slots from the dump are
+    settable; tensor fill requires an integer dtype. Increasing a scalar beyond
+    its dump value risks out-of-bounds and is warned about.
+    """
+    if not set_arg:
+        return
+    by_slot = {a["slot"]: a for a in kargs}
+    for spec in set_arg:
+        if "=" not in spec:
+            ap.error(f"--set-arg must be SLOT=VALUE (got {spec!r})")
+        slot_s, val_s = spec.split("=", 1)
+        try:
+            slot, value = int(slot_s), int(val_s)
+        except ValueError:
+            ap.error(f"--set-arg SLOT and VALUE must be integers (got {spec!r})")
+        a = by_slot.get(slot)
+        if a is None:
+            ap.error(f"--set-arg slot {slot} is not an arg of this task")
+        if a["kind"] == "scalar":
+            old = a["value"]
+            a["value"] = value
+            note = "  WARNING: larger than dump value -> buffers may be undersized" if value > old else ""
+            print(f"[l0_swimlane] overriding scalar slot {slot}: {old} -> {value}{note}")
+        else:
+            if a["dtype"] not in INTEGER_DTYPES:
+                ap.error(
+                    f"--set-arg slot {slot} is a {a['dtype']} tensor; tensor "
+                    f"fill is only supported for integer dtypes (loop-count / "
+                    f"index control tensors like context_lens)"
+                )
+            a["fill"] = value
+            print(f"[l0_swimlane] filling tensor slot {slot} ({a['dtype']} {a['shape']}) with {value} in every element")
+
+
+def main():
+    ap = argparse.ArgumentParser(
+        description="Generate an AICore intra-core swimlane trace.json for one kernel.",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+    )
+    # ----- what to trace -----
+    ap.add_argument("--test", required=True, help="SceneTest test file (.py)")
+    ap.add_argument(
+        "--func-id",
+        required=True,
+        metavar="A[,B,C]",
+        help="mix member set: comma-separated func_ids of the kernels that form "
+        "the task. Name the task's FULL set — `--func-id 0` for a task the "
+        "orchestration dispatches on its own, `--func-id 0,1,2` for a 3-way mix "
+        "(all its members). The set must exactly match a dispatched task's "
+        "func_id array (you wrote the orchestration, so you know the members).",
+    )
+    ap.add_argument("--task-id", default=None, help="task_id hex (default: lowest)")
+    # ----- dump source -----
+    ap.add_argument(
+        "--platform",
+        default="a2a3sim",
+        help="dump platform (default a2a3sim). Sim variants "
+        "(a2a3sim/a5sim) need no NPU. Onboard variants "
+        "(a2a3/a5) run the dump on a real device — wrap the "
+        "whole tool in task-submit so the dump and collect "
+        "share the locked $TASK_DEVICE. Required for kernels "
+        "whose sync idiom (e.g. manual prod.record()) only "
+        "compiles for the device, not the cpu sim.",
+    )
+    ap.add_argument(
+        "--device",
+        default=None,
+        metavar="ID",
+        help="NPU device id for an onboard dump + collect. Normally "
+        "supplied automatically — task-submit appends "
+        "--device <id> to the wrapped command; also read from "
+        "$TASK_DEVICE. Sim platforms ignore it.",
+    )
+    ap.add_argument(
+        "--case",
+        default=None,
+        metavar="NAME",
+        help="pin the dump to one CASES[*].name (e.g. SmallCase1). "
+        "Omitting it auto-pins the FIRST case that lists --platform "
+        "(deterministic dump). Pass it explicitly when that first "
+        "case is a full-size production case whose shapes overflow "
+        "the camodel replay — name the small one instead. Accepts "
+        "ClassName::Case too.",
+    )
+    ap.add_argument("--dump-json", default=None, help="reuse an existing tensor_dump.json")
+    # ----- replay tuning -----
+    ap.add_argument(
+        "--set-arg",
+        action="append",
+        default=[],
+        metavar="SLOT=VALUE",
+        help="override an arg by args[] slot for the replay. Scalar "
+        "slot -> rewrite its value; tensor slot -> fill its data "
+        "buffer with VALUE (integer dtypes only). Use to shrink a "
+        "loop count for the camodel without distorting pipeline "
+        "structure: single-task n_blocks is a scalar "
+        "(--set-arg 4=4); mix paged-attention derives n_blocks "
+        "from the context_lens tensor (--set-arg 4=512 -> "
+        "n_blocks=ceil(512/block_size)). Only real arg slots are "
+        "settable. Repeatable. Default: real dump values.",
+    )
+    ap.add_argument(
+        "--spmd-block-num",
+        type=int,
+        default=None,
+        metavar="N",
+        help="block_num written into the synthesized SPMD LocalContext "
+        "(slot 48). Default: the case's block_dim. Only matters for "
+        "kernels that branch/stride on block_num; set the real grid "
+        "width for those.",
+    )
+    ap.add_argument(
+        "--debug-line",
+        "-g",
+        action="store_true",
+        help="compile the kernel with -g (and skip link strip) so the "
+        "trace carries debug_line -> Insight maps instructions to "
+        "source lines. Default off.",
+    )
+    # ----- run control -----
+    ap.add_argument("--no-collect", action="store_true", help="smoke build only")
+    ap.add_argument("--max-time", type=int, default=1800, help="task-submit budget (sec)")
+    args = ap.parse_args()
+
+    test_path = Path(args.test).resolve()
+    arch, variant = parse_platform(args.platform)
+    cfg = ARCH_CONFIG.get(arch)
+    if cfg is None:
+        ap.error(f"unsupported arch {arch} (from {args.platform}); supported: {', '.join(ARCH_CONFIG)}")
+
+    func_id_list = [int(x) for x in args.func_id.split(",")]
+    meta = load_kernel_meta(test_path, func_id_list[0], args.platform)
+    by_func = meta["by_func"]
+    name, class_name = meta["name"], meta["class_name"]
+
+    # Which case the dump runs. Explicit --case wins; otherwise auto-pin the
+    # first case that lists --platform so the dump targets exactly one case
+    # (deterministic — no "run all default cases, reconstruct from the newest
+    # dump dir" ambiguity). Still None only when the test declares no case on
+    # this platform (a single-case / no-CASES test — the dump then runs as-is).
+    selected_case = args.case or meta["auto_case"]
+    if not args.case and meta["auto_case"]:
+        print(f"[l0_swimlane] no --case; auto-pinned first {args.platform} case: {meta['auto_case']}")
+    # block_num for the synthesized slot-48 LocalContext: the grid width of the
+    # SELECTED case (--case bare name, ignoring any ClassName:: prefix), not an
+    # arbitrary CASES entry. Defaults to 1 (a non-SPMD single block) when the
+    # selected case declares no block_dim, or when no case is selected (a
+    # single-case / no-CASES test) — never guessed from a different case.
+    case_key = selected_case.split("::")[-1] if selected_case else None
+    block_dim = meta["block_dim_by_case"].get(case_key, 1)
+    block_num = args.spmd_block_num if args.spmd_block_num is not None else block_dim
+
+    # task-submit hands the locked device by appending --device <id> to argv
+    # (and may also set $TASK_DEVICE). One resolved value threads through both
+    # the dump and the collect so they share the single outer lock.
+    device = args.device or os.environ.get("TASK_DEVICE")
+    manifest = get_or_run_dump(test_path, args.platform, variant, args.dump_json, selected_case, device)
+    print(f"[l0_swimlane] manifest: {manifest}")
+
+    # Select the task whose member set == --func-id; reconstruct its full
+    # positional payload. mix_func_ids is the dump's array (slot order
+    # AIC,AIV0,AIV1), NOT the typed order, so lane assignment stays correct.
+    chosen, tensor_count, kargs, mix_func_ids = reconstruct_task_args(manifest, func_id_list, args.task_id)
+
+    # Resolve the mix members (slot order) to their sources/core_types.
+    missing = [f for f in mix_func_ids if f not in by_func]
+    if missing:
+        ap.error(f"dump task has func_id(s) {missing} with no matching incore in {test_path.name}")
+    members = [by_func[f] for f in mix_func_ids]
+    for m in members:
+        if m["core_type"] not in ("aic", "aiv"):
+            ap.error(f"unsupported core_type {m['core_type']} for func_id {m['func_id']} (only aic/aiv)")
+    mode = "mix" if len(members) > 1 else members[0]["core_type"]
+    member_desc = ", ".join(f"{m['name']}({m['core_type']},func {m['func_id']})" for m in members)
+    print(
+        f"[l0_swimlane] func_id={func_id_list} task={chosen} mix={mix_func_ids} mode={mode} "
+        f"block_dim={block_dim}\n              members=[{member_desc}]"
+    )
+
+    scalars = [a for a in kargs if a["kind"] == "scalar"]
+    print(f"[l0_swimlane] task {chosen}: {tensor_count} tensors, {len(scalars)} scalars")
+    # Full arg-slot map so the caller can pick a slot for --set-arg without
+    # cross-referencing the kernel source. Names are not in the dump (only
+    # kind/shape/value) — read the kernel's `args:` header for those.
+    print("[l0_swimlane] arg slots (override with --set-arg SLOT=VALUE):")
+    for a in sorted(kargs, key=lambda x: x["slot"]):
+        if a["kind"] == "tensor":
+            print(f"    slot {a['slot']:<2} tensor  {a['dtype']:<8} {a['shape']}")
+        else:
+            print(f"    slot {a['slot']:<2} scalar  = {a['value']}")
+    apply_arg_overrides(kargs, args.set_arg, ap)
+
+    # Self-describing label: <TestClass>_<Case>_<platform>_<kernel>_<mix>, so the
+    # workspace dir and the trace.json filename say which case/kernel/mix they are.
+    case = _case_from_manifest(manifest, class_name)
+    mix_tag = "_".join(str(f) for f in mix_func_ids)
+    label = f"{class_name}_{case}_{args.platform}_{name}_mix{mix_tag}"
+    ts = datetime.now().strftime("%Y%m%d_%H%M%S")
+    ws = PROJECT_ROOT / "outputs" / f"l0_swimlane_{label}_{ts}"
+    generate_workspace(
+        ws,
+        arch,
+        cfg,
+        members,
+        name,
+        tensor_count,
+        kargs,
+        debug=args.debug_line,
+        block_num=block_num,
+    )
+    print(f"[l0_swimlane] workspace: {ws}")
+
+    env = _build_env()
+    smoke_build(ws, env, cfg)
+    if args.no_collect:
+        print(f"[l0_swimlane] --no-collect: stopping after smoke build.\n  {ws}")
+        return
+    trace, trace_perfetto = collect(ws, env, args.max_time, device, dest_name=f"{label}_trace.json")
+    print(f"[l0_swimlane] DONE. trace.json (Insight) -> {trace}")
+    if trace_perfetto:
+        print(f"[l0_swimlane]       perfetto-friendly  -> {trace_perfetto}")
+
+
+if __name__ == "__main__":
+    main()