Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 30 additions & 30 deletions docs/dfx/args-dump.md
Original file line number Diff line number Diff line change
Expand Up @@ -481,12 +481,12 @@ normal execution continues.

`halHostRegister` maps device memory into host virtual address
space so the host can read device buffers directly.
`TensorDumpCollector` runs two background threads on top of a
`TensorDumpCollector` runs split mgmt threads and collector shards on top of a
[`BufferPoolManager<DumpModule>`](../src/common/platform/include/host/buffer_pool_manager.h):
a mgmt thread that polls SPSC ready queues and recycles full
metadata buffers **while kernels are still executing**, plus a
poll thread that drains the L2 hand-off queue into
`on_buffer_collected`.
drain/refill shards poll SPSC ready queues and recycle full metadata
buffers **while kernels are still executing**, a replenish thread keeps
free queues topped up, and collector shards drain the host hand-off queues
into `on_buffer_collected`.

```text
HOST DEVICE
Expand All @@ -499,19 +499,19 @@ poll thread that drains the L2 hand-off queue into
│ │ │ │
│ start() │ │ per-task run loop: │
│ ┌────────────────────┐ │ │ BEFORE_DISPATCH │
│ │ mgmt thread │ │ │ dump_arg_record() │
│ │ (BufferPool driver)│ │ SPSC ready │ → write to arena │
│ │ drain/refill shard │ │ │ dump_arg_record() │
│ │ + replenish thread │ │ SPSC ready │ → write to arena │
│ │ poll ready queue │<┼──queues──────<│ → append record │
│ │ recycle buffers │─┼──free queue──>│ → push to ready_q │
│ └────────────────────┘ │ │ dispatch kernel │
│ ┌────────────────────┐ │ │ wait FIN │
│ │ poll thread │ │ │ AFTER_COMPLETION │
│ │ collector shard │ │ │ AFTER_COMPLETION │
│ │ reads arena via │ │ shared mem │ dump_arg_record() │
│ │ host mapping │<┼──mapping─────<│ │
│ └────────────────────┘ │ │ │
│ │ │ dump_args_flush() │
│ stop() │ │ log per-thread stats │
│ join mgmt → join poll │ └──────────────────────────┘
│ join mgmt → collectors │ └──────────────────────────┘
│ reconcile_counters() │
│ recover leftovers │
│ + dropped accounting │
Expand All @@ -530,29 +530,28 @@ poll thread that drains the L2 hand-off queue into
init_tensor_dump()
dump_collector_.initialize(..., output_prefix_)
kernel_args_.args.dump_data_base = dump_collector_.get_dump_shm_device_ptr()
start() ← spawn mgmt thread (drains L1 ringbuffer)
then spawn poll thread (consumes L2 queue)
start() ← spawn split mgmt threads (drain/refill
+ replenish), then collector shards
launch AICPU / AICore
rtStreamSynchronize ← wait for kernel completion
stop() ← join mgmt (its final-drain pass into L2
has poll as the consumer), then signal
poll and join it
stop() ← join mgmt/replenish after final drain,
then signal collector shards and join them
reconcile_counters() ← recover leftover current buffers
+ dropped accounting
export_dump_files()
```

[`TensorDumpCollector`](../src/a2a3/platform/include/host/tensor_dump_collector.h)
[`TensorDumpCollector`](../src/common/platform/include/host/tensor_dump_collector.h)
on a2a3 inherits from
[`profiling_common::ProfilerBase<TensorDumpCollector, DumpModule>`](../src/common/platform/include/host/profiler_base.h):
the base class owns the mgmt thread, the poll thread, and the
the base class owns split mgmt threads, collector shards, and the
`BufferPoolManager<DumpModule>` they share. `TensorDumpCollector`
only supplies the dump-specific pieces — the `DumpModule` trait
that describes the shared-memory layout, `initialize` that
allocates and pre-fills free queues, an `on_buffer_collected`
callback that gathers payload bytes into the in-memory record
list, plus `reconcile_counters` / `export_dump_files` /
`finalize`. The mgmt/poll threading, buffer pooling, and `Module`
`finalize`. The mgmt/collector threading, buffer pooling, and `Module`
trait pattern are shared with PMU and L2Swimlane — see
[profiling-framework.md](../profiling-framework.md) for the
framework reference.
Expand All @@ -561,7 +560,7 @@ framework reference.

a5's `TensorDumpCollector` derives from
`ProfilerBase<TensorDumpCollector, DumpModule>` and shares the
mgmt + poll thread structure with a2a3. The single behavioral
split mgmt + collector shard structure with a2a3. The single behavioral
deviation from §5.4 is the **transport channel**: a5 has no
`halHostRegister`, so each device buffer is paired with a
host-shadow `malloc()` and the mgmt loop synchronizes the two via
Expand Down Expand Up @@ -597,8 +596,8 @@ the buffer's records.
│ register_mapping(s) │ │ BEFORE_DISPATCH │
│ │ │ dump_arg_record() │
│ start(thread_factory) │ │ dispatch kernel │
mgmt_thread starts │ │ wait FIN │
poll_thread starts │ │ AFTER_COMPLETION │
split mgmt starts │ │ wait FIN │
collector shards start │ │ AFTER_COMPLETION │
│ │ │ dump_arg_record() │
│ mgmt every 10us tick: │ │ if buffer full: │
│ copy_from_device(shm) │<──memcpy─────<│ push ready entry, │
Expand All @@ -612,7 +611,7 @@ the buffer's records.
│ for each modified │ │ │
│ field │ │ │
│ │ │ │
poll thread: │ │ │
collector shard: │ │ │
│ wait_pop_ready │ │ │
│ on_buffer_collected → │ │ │
│ copy arena slice │<──memcpy─────<│ │
Expand All @@ -622,7 +621,7 @@ the buffer's records.
│ │ │ │
│ rtStreamSynchronize │ │ │
│ stop() │ │ │
│ join mgmt + poll │ │ │
│ join mgmt + collectors │ │ │
│ reconcile_counters() │ │ │
│ recover leftovers │ │ │
│ + dropped accounting │ │ │
Expand All @@ -638,17 +637,17 @@ the buffer's records.
init_tensor_dump()
dump_collector_.initialize(num_dump_threads, ..., output_prefix_)
kernel_args_.args.dump_data_base = dump_collector_.get_dump_shm_device_ptr()
dump_collector_.start(thread_factory) ← mgmt + poll threads
dump_collector_.start(thread_factory) ← split mgmt + collector shards
launch AICPU / AICore
rtStreamSynchronize
dump_collector_.stop() ← join mgmt + poll, drain final batch
dump_collector_.stop() ← join mgmt + collectors, drain final batch
dump_collector_.reconcile_counters() ← recover leftover current buffers
+ dropped accounting
dump_collector_.export_dump_files()
dump_collector_.finalize()
```

[`TensorDumpCollector`](../src/a5/platform/include/host/tensor_dump_collector.h)
[`TensorDumpCollector`](../src/common/platform/include/host/tensor_dump_collector.h)
on a5 inherits the same CRTP base
([`profiling_common::ProfilerBase`](../src/common/platform/include/host/profiler_base.h))
as a2a3 and parameterizes
Expand All @@ -670,7 +669,7 @@ before that flush runs, `reconcile_counters` recovers a non-empty
| Device-side layout | identical (same `DumpDataHeader` / `DumpMetaBuffer` / arena shape, `static_assert`-checked) | |
| AICPU recording logic | identical | |
| Buffer model | rotating pool (free + ready queues per thread) | identical |
| Host threads | mgmt + poll, streams during execution | identical |
| Host threads | split mgmt + collector shards, streams during execution | identical |
| Host-class shape | `ProfilerBase<TensorDumpCollector, DumpModule>` | identical |
| Host transport | `halHostRegister` shared memory | host-shadow `malloc` + per-tick `rtMemcpy`/`memcpy` |
| `MemoryOps` callbacks | 3 (`alloc`, `reg`, `free_`) | 5 (+ `copy_to_device`, `copy_from_device`) |
Expand All @@ -694,9 +693,10 @@ With `--dump-args`, AICPU records full `BEFORE_DISPATCH` /
non-contiguous views).
- The completion `pipe_barrier(PIPE_ALL)` before writing FIN, which
serializes all device-side writes for dumped tasks.
- The arena and metadata writes themselves; the host transport
cost is taken concurrently on a2a3 (mgmt + poll threads) or after
the stream finishes on a5.
- The arena and metadata writes themselves; host drain/replenish and
collector work runs concurrently with the stream on both architectures.
a5 additionally pays `rtMemcpy`/`memcpy` transport cost to keep host
shadows in sync.

For interactive debugging, total memory pressure is what to watch:
the default per-thread arena is 128 MiB
Expand Down Expand Up @@ -893,7 +893,7 @@ per-thread arena (default 128 MiB). Bump

**`dropped_overwrite > 0` in summary.** On a5, the run produced
more total payload than fits in the arena; on a2a3, the host
mgmt/poll threads couldn't keep up. Reduce the number of dumped
mgmt/collector pipeline couldn't keep up. Reduce the number of dumped
tasks (filter by `func_id` upstream) or increase
`PLATFORM_DUMP_BUFFERS_PER_THREAD`.

Expand Down
39 changes: 21 additions & 18 deletions docs/dfx/l2-swimlane-profiling.md
Original file line number Diff line number Diff line change
Expand Up @@ -609,11 +609,11 @@ sched overhead per session as price for unbounded session length).

`halHostRegister` maps device memory into host virtual address
space so the host can read device buffers directly.
`L2SwimlaneCollector` runs two background threads on top of a
`L2SwimlaneCollector` runs split mgmt threads and collector shards on top of a
[`BufferPoolManager<L2SwimlaneModule>`](../src/common/platform/include/host/buffer_pool_manager.h):
a mgmt thread that polls SPSC ready queues and recycles full
buffers **while kernels are still executing**, plus a poll
thread that drains the L2 hand-off queue into
drain/refill shards poll SPSC ready queues and recycle full buffers
**while kernels are still executing**, a replenish thread keeps free
queues topped up, and collector shards drain the host hand-off queues into
`on_buffer_collected`.

`L2SwimlaneModule` declares four buffer kinds going through one ready
Expand Down Expand Up @@ -641,19 +641,19 @@ are single-kind.
│ │ │ │
│ start(tf) │ │ AICPU on FIN: │
│ ┌────────────────────┐ │ SPSC ready │ commit AicpuTask │
│ │ mgmt thread │ │ queues │ record (kind 0); fill │
│ │ (BufferPool driver)│ │<──4 kinds────<│ func_id / dispatch / │
│ │ drain/refill shard │ │ queues │ record (kind 0); fill │
│ │ + replenish thread │ │<──4 kinds────<│ func_id / dispatch / │
│ │ poll ready queue │<┼──multiplexed──│ finish; rotate buffer │
│ │ recycle buffers │─┼──free queue──>│ when full │
│ └────────────────────┘ │ │ AICPU scheduler thread: │
│ ┌────────────────────┐ │ │ per work iter: write │
│ │ poll thread │ │ │ SchedPhaseRecord │
│ │ collector shard │ │ │ SchedPhaseRecord │
│ │ reads via host │ │ shared mem │ (kind 1). Per submit: │
│ │ mapping; copies │<┼──mapping─────<│ write OrchPhaseRecord │
│ │ to host vectors │ │ │ (kind 2). │
│ └────────────────────┘ │ │ │
│ stop() │ │ │
│ join mgmt → join poll │ │ │
│ join mgmt → collectors │ │ │
│ read_phase_header_metadata() │ │
│ reconcile_counters() │ │ │
│ export_swimlane_json() │ │ │
Expand All @@ -667,10 +667,10 @@ are single-kind.
init_l2_swimlane()
l2_swimlane_collector_.initialize(num_aicore, ..., output_prefix_)
kernel_args_.args.l2_swimlane_data_base = l2_swimlane_collector_.get_l2_swimlane_shm_device_ptr()
start(tf) ← spawn mgmt + poll threads
start(tf) ← spawn split mgmt + collector shards
launch AICPU / AICore
rtStreamSynchronize
stop() ← join mgmt → join poll
stop() ← join mgmt/replenish → join collectors
read_phase_header_metadata() ← single-shot read of the
core→thread mapping
reconcile_counters() ← three-bucket accounting for both
Expand All @@ -684,7 +684,7 @@ finalize(unregister, free)
[`L2SwimlaneCollector`](../src/a2a3/platform/include/host/l2_swimlane_collector.h)
on a2a3 inherits from
[`profiling_common::ProfilerBase<L2SwimlaneCollector, L2SwimlaneModule>`](../src/common/platform/include/host/profiler_base.h):
the base class owns the mgmt thread, the poll thread, and the
the base class owns split mgmt threads, collector shards, and the
`BufferPoolManager<L2SwimlaneModule>` they share. `L2SwimlaneCollector`
supplies the L2-specific pieces — the `L2SwimlaneModule` trait
(notably `kBufferKinds = 4` and `kind_of()`), `initialize` that
Expand All @@ -694,17 +694,18 @@ allocates and pre-fills all four kinds of free queues, an
to copy into the right per-core or per-thread vector, plus
`read_phase_header_metadata` /
`reconcile_counters` / `export_swimlane_json` / `finalize`. The
mgmt/poll threading and `Module` trait pattern are shared with
mgmt/collector threading and `Module` trait pattern are shared with
PMU and TensorDump — see
[profiling-framework.md](../profiling-framework.md) for the
framework reference.

### 5.3 a5 — same framework, host-shadow transport

a5's `L2SwimlaneCollector` derives from
`ProfilerBase<L2SwimlaneCollector, L2SwimlaneModule>` and shares the
mgmt + poll thread structure with a2a3. The single behavioral
deviation from §5.2 is the **transport channel**: a5 has no
`ProfilerBase<L2SwimlaneCollector, L2SwimlaneModule>` and uses the same
framework abstractions as a2a3. Its current L2 module keeps the default
single mgmt + collector thread shape; the larger behavioral deviation
from §5.2 is the **transport channel**: a5 has no
`halHostRegister`, so each device buffer is paired with a
host-shadow `malloc()` and the mgmt loop synchronizes the two via
`profiling_copy.h` (`rtMemcpy` onboard, plain `memcpy` in sim).
Expand Down Expand Up @@ -836,7 +837,7 @@ PHASE), same shape as a2a3.
| AICPU commit on FIN | identical | |
| Buffer model | rotating pool (free + ready queues) per kind | identical |
| Ready queue | per-AICPU-thread, multiplexes 4 kinds via `ReadyQueueEntry::kind` | per-AICPU-thread, 2 kinds via `is_phase` |
| Host threads | mgmt + poll, streams during execution | identical |
| Host threads | split mgmt + collector shards, streams during execution | default single mgmt + collector thread |
| Host-class shape | `ProfilerBase<L2SwimlaneCollector, L2SwimlaneModule>` (`kBufferKinds = 4`) | same base, `kBufferKinds = 2` |
| Host transport | `halHostRegister` shared memory | host-shadow `malloc` + per-tick `rtMemcpy`/`memcpy` |
| `MemoryOps` callbacks | 3 (`alloc`, `reg`, `free_`) | 5 (+ `copy_to_device`, `copy_from_device`) |
Expand Down Expand Up @@ -864,8 +865,10 @@ Phase-record overhead (only at `--enable-l2-swimlane >= 3`):
- a5 — one 40 B `L2SwimlaneAicpuPhaseRecord` per emitted phase
(legacy unified shape).

Both architectures drain buffers concurrently with execution via the
mgmt + poll thread pair; a5 additionally pays per-tick
Both architectures drain buffers concurrently with execution through the
ProfilerBase mgmt/collector pipeline; a2a3 uses split mgmt plus collector
shards for this profiler, while a5 currently uses the default single mgmt
plus collector thread. a5 additionally pays per-tick
`rtMemcpy`/`memcpy` round-trips to keep the host shadow in sync,
which overlap with device execution.

Expand Down
Loading
Loading