diff --git a/.claude/skills/l0-swimlane/SKILL.md b/.claude/skills/l0-swimlane/SKILL.md new file mode 100644 index 000000000..130c603f0 --- /dev/null +++ b/.claude/skills/l0-swimlane/SKILL.md @@ -0,0 +1,127 @@ +--- +name: l0-swimlane +description: Produce an L0 (intra-core AICore pipeline) swimlane for one task — a single kernel or a mix — via the dump-driven `simpler_setup.tools.l0_swimlane` tool. Use when the user asks to "run/produce an l0 swimlane", "trace a task's intra-core pipeline", profile why one AICore task is slow inside the core(s), or needs help choosing the tool's manual flags (`--func-id`, `--set-arg`, `--spmd-block-num`, `--case`). The tool captures real per-task args from an args dump and auto-generates the `msprof op simulator` replay — no hand-authored workspace. For a hand-authored single-`kernel_entry` replay use [insight-trace](../insight-trace/SKILL.md); for cross-task / scheduler / dependency timing use the L2 swimlane. +--- + +# L0 Swimlane — Intra-core Pipeline Trace for a Task + +`python -m simpler_setup.tools.l0_swimlane` dumps a task's real `args[]`, +reconstructs them, generates a combined `msprof op simulator` replay of the +**whole task** (a mix runs AIC + AIV0 + AIV1 in one op), and exports an +Insight `trace.json` whose lanes are the cluster's pipes. Full reference: +[docs/dfx/l0-swimlane-profiling.md](../../../docs/dfx/l0-swimlane-profiling.md). +This skill is the **operating procedure** — above all the one genuinely +manual decision: the slot/value for `--set-arg`. + +## When to use + +- **Use** when one task (single kernel or mix) is slow and you need the + per-pipe (`MTE2` / `MTE1` / `CUBE` / `FIXP` / `SCALAR` / `VECTOR`) + intra-core picture, or to confirm AIC↔AIV overlap inside a mix. +- **Not** for cross-task dependencies / scheduler / dispatch / finish + timing — that is the **L2 swimlane**. L0 traces ONE task in isolation + with no AICPU, so inter-task ordering is out of scope (doc §9, tier C). +- **vs `insight-trace`**: that skill hand-authors a wrapper around one + `kernel_entry`; `l0_swimlane` automates the whole thing from a real dump + (real args, mix-together, SPMD context synthesised). Reach for + `insight-trace` only when there is no test/dump to drive the capture. + +## Run + +```bash +source .venv/bin/activate +source "$ASCEND_HOME_PATH/set_env.sh" # CANN env (msprof on PATH) +# Sim dump (no NPU); task-submit locks a device for the step-5 collect. +task-submit --device auto --max-time 1800 --run \ + "python -m simpler_setup.tools.l0_swimlane --platform a2a3sim \ + --func-id --test " +``` + +Onboard `a2a3` instead of `a2a3sim`: run +`.claude/skills/onboard-arch-precheck/check.sh a2a3` first (the dump then +runs on the locked device). The five internal steps and all flags are in +doc §3.2 / §3.3. + +## Choosing the manual flags (the hard part) + +### `--func-id` — the task's member set + +You wrote the orchestration, so the members are known. `--func-id 0` traces +the single-kernel task `{0}`; `--func-id 0,1,2` traces that 3-way mix. +It must equal a dispatched task's func_id **set** — for a same-AIV-on-both- +lanes SPMD mix the dump records a duplicate (`[0,1,1]`), so pass +`--func-id 0,1` (`set([0,1,1]) == {0,1}`). Wrong set → the tool lists the +func_id shapes present in the dump; pick one of those. + +### `--set-arg SLOT=VALUE` — only when a loop count must shrink + +First classify where the kernel's loop trip count comes from: + +| Trip count from | `--set-arg`? | Rule | +| --------------- | ------------ | ---- | +| **Tensor shape** (e.g. `shapes[0] / TILE_ELEMS`) | **No** | shape is the real dump value; changing it distorts. (mixed_example / single-kernel rows need no `--set-arg`.) | +| **A scalar arg** (e.g. `n_blocks`) | **Yes** — set the count directly | camodel would run the full loop; shrink to ≥ 3–4 (doc §7.2: floor 3, prefer 4). | +| **A control-tensor's content** (e.g. `context_lens`) | **Yes** — fill the buffer | the kernel *derives* the count from the data; fill so the derived count ≈ 4 (need `block_size` to back out the value). Integer dtypes only. | +| **The SPMD `block_num`** | **No** — use `--spmd-block-num` | block_num lives in the synthesised slot-48 context, which `--set-arg` cannot reach. | + +Then find the slot — it is **per-kernel, never fixed**. Discover it: + +1. Run once with `--no-collect`; step 3 prints the **arg-slot table** + (every slot: index / kind / shape / scalar value). +2. Identify which slot is the loop bound by cross-referencing **any** of: + the kernel's `args[N]` reads, the kernel-top **args-layout comment** + (paged-attention kernels have one, e.g. + `args[15] = total_logical_blocks scalar`), or the orchestration's + `add_input` / `add_scalar` **order** (the i-th `add_*` is slot `i`). +3. Set the value per the table above. +4. Re-run, then **self-check** (below). + +Verified examples (slots read from source): + +| Test | Loop bound | Flag | +| ---- | ---------- | ---- | +| `paged_attention_unroll` | `aic_qk_matmul.cpp` `args[4] = n_blocks` (scalar) | `--set-arg 4=4` | +| `qwen3_14b_decode` (fa_fused) | `fa_fused_aic/aiv.cpp` outer loop `for(i=block_idx; i` +(accepts `ClassName::Case`) whenever the first-platform case is not the +smallest. Pick the case with the smallest shapes. `--set-arg` shrinks a +*loop count*; `--case` shrinks the *tensor shapes* — reach for `--case` +first when a replay stalls. Single-case tests need no `--case`. + +Pick a case that is **scaled down, not reshaped** — same tile geometry +(M/K/N, head_dim, tile size), just fewer blocks / shorter sequence — so the +per-block pipeline stays identical to production (you lose only iteration +*count*, which does not change the pipeline shape). A case with different +tile shapes traces only itself, not production. + +## Self-check after every run + +A known msprof/camodel export bug can truncate the last loop iteration(s). +Verify `MMAD == FIX_L0C_TO_DST == n_blocks` in the trace; if they disagree, +the tail was cut — do not draw timing conclusions, re-run or change the loop +count (doc §7.4). Read the auto-generated `*_trace_perfetto.json`, not the +raw Insight `trace.json`, for sub-laned per-instruction overlap (doc §3.4). + +## Coverage + +Representative command per task shape (single AIC / single AIV / 1+1 mix / +2-AIV mix / 3-way mix / SPMD single-source / SPMD coop mix / same-AIV-both- +lanes / paged-attn scalar & control-tensor loops / qwen3) is in doc §3.7. diff --git a/docs/dfx/l0-swimlane-profiling.md b/docs/dfx/l0-swimlane-profiling.md new file mode 100644 index 000000000..2e48ad310 --- /dev/null +++ b/docs/dfx/l0-swimlane-profiling.md @@ -0,0 +1,623 @@ +# L0 Swimlane Profiling — Intra-core Pipeline Trace for a Task + +## 1. Background & Motivation + +[L2 swimlane](l2-swimlane-profiling.md) answers *where each task ran on +the wall clock and how the scheduler spent its loop*. It stops at the +AICore task boundary — one task is one opaque `[start, end]` block. When +a single task is slow, the next question is **why inside the core(s)**: +which pipe (`MTE2` GM→L1, `MTE1` L1→L0, `CUBE` matmul, `FIXP` write-back, +`SCALAR`, `VECTOR`) is the bottleneck, and how the per-instruction issue +overlaps across the cluster's sub-cores. + +L0 swimlane captures exactly that — the **intra-core pipeline** of a +task. It runs the task in isolation under `msprof op simulator` (the +AICore camodel) and exports a MindStudio Insight `trace.json` whose lanes +are the cluster's pipes, not the chip's cores. It deliberately +**bypasses AICPU orchestration**: scheduler / tensormap / ringbuffer +state is out of scope (that is L2's job, and needs real silicon). L0 is +the per-pipe, per-instruction zoom that sits one level below an L2 task +block. + +A task may be a single kernel or a **mix** — multiple sub-task kernels +sharing one `args[]` on the 1C2V cluster (1 AIC + up to 2 AIV). L0 +replays the **whole task together**: a mix runs its AIC + AIV0 + AIV1 +kernels in one combined op, so the trace shows all the cluster's +sub-cores side by side, not one kernel in isolation. + +The hard part of an isolated replay is rebuilding the task's exact +`args[]` — Tensor descriptors (shape / dtype / strides / start_offset) +plus scalar values — which orchestration normally computes on the fly. +Hand-authoring them is error-prone. L0 swimlane removes the guesswork: +it captures the **real** per-task args from an [args +dump](args-dump.md), uses the dump's `func_id` array to identify the +task's mix members, and generates the whole replay workspace from those +captured args — zero hand-written shapes or scalars. + +## 2. Overview + +- **Per-pipe instruction timeline** — one Insight lane per sub-core pipe + (`MTE2` / `MTE1` / `CUBE` / `FIXP` / `SCALAR` / `VECTOR`), each carrying + the kernel's individual instructions with simulated `ts` / `dur`. +- **Mix-together replay** — an entire mix task (any mix: same- or + different-source members, 2-way or 3-way) replays as **one** combined + `msprof op simulator` op. The cube sub-core runs the AIC member, the + vec sub-cores run the AIV member(s) → a combined AIC+AIV swimlane. +- **Zero-guess args** — the task's real Tensor descriptors and scalars + come from a JSON-only `--dump-args 3` capture (metadata + scalar + values, no `.bin` payload — all reconstruction needs). The dump's + `func_id` array gives the task's mix membership directly. +- **Loop-count control (`--set-arg SLOT=VALUE`)** — when a kernel's loop + trip count comes from a scalar or a control tensor, override it to + shrink a runaway loop (so the camodel doesn't hang) or to fix a + "fake-fast" zero-filled control tensor — without distorting the + per-iteration pipeline structure. Repeatable; default uses the real + dump values. See + [§7.2](#72---set-arg-floor-for-a-loop-count-without-distortion). +- **Source-line attribution (`--debug-line` / `-g`)** — compile the + kernel with `-g` (skipping the link strip) so the trace carries + `debug_line` and Insight maps each instruction back to its kernel + source line. Off by default. +- **Sim or onboard capture** — with a sim `--platform` (`a2a3sim` / + `a5sim`) the dump runs with no NPU; with an onboard `--platform` + (`a2a3` / `a5`) it runs on a real device. The dump only needs arg + **geometry**, which sim captures identically to onboard, and the + replay is camodel either way — so `a2a3sim` is the default and needs + no NPU and no arch-precheck. Use onboard only for a kernel whose sync + idiom (e.g. a manual `prod.record()`) compiles only for the device. +- **Two trace outputs** — a native Insight `trace.json` and an + auto-generated Perfetto-friendly variant (sub-laned + atomic flags; + see [§3.4](#34-viewing--insight-vs-perfetto)). + +Drive it in one line (`--func-id` is the task's member set): + +```bash +python -m simpler_setup.tools.l0_swimlane \ + --test tests/st//test_.py --func-id 0,1,2 --platform a2a3sim +``` + +## 3. How to Use + +### 3.1 Prerequisites (one-time per test case) + +L0 swimlane reuses the args-dump pipeline to recover args, so the target +case must satisfy what the dump needs (see +[args-dump.md](args-dump.md)): + +1. **Args dump is compiled in.** Built into the platform code; needs a + `pip install --no-build-isolation -e .` so it is compiled in. +2. **Incores declare complete signatures.** Under the #1181 positional + model, each incore declares its full tensor `signature` (covering the + task payload in slot order); the dump maps signature entry `i` to + payload slot `i` and stamps every record with the task's active + sub-task `func_id` **array** (its mix membership). This is the repo + norm — no l0-specific marker. +3. **The case declares the `--platform` you pass.** `CASES[*].platforms` + must include it. Pick a case with shapes small enough for the camodel + replay buffers. +4. **`name` is optional.** When `incores[*].name` is absent the tool + falls back to the kernel source filename for labels / paths. + +### 3.2 Run + +```bash +# Environment (once per shell): activate the venv and source CANN. +source .venv/bin/activate +export ASCEND_HOME_PATH= # e.g. .../cann-9.0.0 +source "$ASCEND_HOME_PATH/set_env.sh" + +# Sim capture (no NPU dump) — the default. +python -m simpler_setup.tools.l0_swimlane \ + --test tests/st/a2a3/tensormap_and_ringbuffer/mixed_example/test_mixed_example.py \ + --func-id 0,1,2 --platform a2a3sim + +# Onboard capture — wrap the WHOLE tool in one task-submit so the dump and +# the collect share the locked $TASK_DEVICE (no nested lock). Only needed for +# a kernel whose sync idiom compiles only for the device. +task-submit --device auto --device-num 1 --run \ + "python -m simpler_setup.tools.l0_swimlane \ + --test tests/st//test_.py --func-id 0 --platform a2a3" +``` + +The tool runs five steps internally (the "Uses NPU" column is for an +onboard `--platform`; a sim `--platform` uses no NPU until step 5): + +| Step | Action | Uses NPU | +| ---- | ------ | -------- | +| 1 | Read the test's `CALLABLE`; build a `func_id → (source, core_type)` table | No | +| 2 | Run `--dump-args 3` (JSON-only) → `args_dump.json` (or reuse via `--dump-json`) | Onboard only | +| 3 | Select the task whose member set == `--func-id`, reconstruct its full positional args, **print the arg-slot table** (slot / kind / shape / value) | No | +| 4 | Emit the replay workspace and smoke-build it locally | No | +| 5 | `msprof op simulator` collect + export → `trace.json`, then auto-converts a Perfetto variant | **Yes** | + +Step 3 prints every arg slot so you can pick a `--set-arg` target without +reading the kernel source — names are not in the dump (only kind / shape +/ value), so cross-reference the kernel's `args:` header for those: + +```text +[l0_swimlane] func_id=0 task=0x... mix=[0, 1, 2] mode=mix block_dim=3 + members=[MATMUL(aic,func 0), ADD(aiv,func 1), MUL(aiv,func 2)] +[l0_swimlane] arg slots (override with --set-arg SLOT=VALUE): + slot 0 tensor FLOAT32 [16384] + ... +``` + +A scalar slot holds the value directly (`--set-arg 4=4`); a tensor slot +holds a pointer, so `--set-arg` fills its buffer (`--set-arg 4=512`). See +[§7.2](#72---set-arg-floor-for-a-loop-count-without-distortion). + +### 3.3 Key flags + +| Flag | Meaning | +| ---- | ------- | +| `--test ` | SceneTest test file (required) | +| `--func-id A[,B,C]` | The task's **member set** (comma-separated func_ids), required. `--func-id 0` traces the single-kernel task `{0}`; `--func-id 0,1,2` traces that 3-way mix. The set must exactly match a dispatched task's `func_id` array (you wrote the orchestration, so you know the members) | +| `--task-id ` | Which task instance to replay (default: lowest). Instances of the same mix shape are structurally identical | +| `--platform

` | Dump platform → arch / compile / SoC params (default `a2a3sim`). Sim (`a2a3sim` / `a5sim`) dumps with no NPU; onboard (`a2a3` / `a5`) dumps on `$TASK_DEVICE` (wrap the tool in `task-submit`). The replay is camodel regardless; geometry is identical, so prefer sim | +| `--device ` | NPU device for an onboard dump + collect. **Auto-supplied** — `task-submit` appends `--device ` (also `$TASK_DEVICE`). Sim platforms ignore it | +| `--case ` | Pin the dump to one `CASES[*].name`. Omitting it auto-pins the first case that lists `--platform`; pass it to target a smaller case when that first one overflows the camodel. Accepts `ClassName::Case` | +| `--dump-json ` | Reuse an existing `args_dump.json`, skipping the dump re-run | +| `--set-arg SLOT=VALUE` | Override an arg by `args[]` slot. Scalar slot → rewrite value; tensor slot → fill its buffer (integer dtypes). Shrinks a loop count without distortion. Repeatable. Default: real dump values | +| `--spmd-block-num N` | `block_num` written into the synthesized SPMD context (slot 48). Default: the **selected** case's `block_dim` | +| `--debug-line` / `-g` | Compile with `-g` (skip strip) so the trace carries `debug_line` → Insight maps instructions to source lines | +| `--no-collect` | Generate + smoke-build only; do not take an NPU | +| `--max-time ` | `task-submit` budget (default 1800) | + +Per-arch build parameters are fixed in the tool's `ARCH_CONFIG`: + +| arch | SoC (camodel) | aicore-arch (compile) | prologue macros | +| ---- | ------------- | --------------------- | --------------- | +| a2a3 | `dav_2201` | `dav-c220` | `__CCE_AICORE__ 220` / `PTO_NPU_ARCH_A2A3` | +| a5 | `dav_3510` | `dav-c310` | `__CCE_AICORE__ 310` / `PTO_NPU_ARCH_A5` | + +### 3.4 Viewing — Insight vs Perfetto + +The workspace lands at +`outputs/l0_swimlane_