hw-native-sys · zmnobug · Jun 25, 2026 · Jul 1, 2026 · Jul 1, 2026 · Jul 1, 2026
diff --git a/docs/dfx/args-dump.md b/docs/dfx/args-dump.md
@@ -481,12 +481,12 @@ normal execution continues.
 
 `halHostRegister` maps device memory into host virtual address
 space so the host can read device buffers directly.
-`TensorDumpCollector` runs two background threads on top of a
+`TensorDumpCollector` runs split mgmt threads and collector shards on top of a
 [`BufferPoolManager<DumpModule>`](../src/common/platform/include/host/buffer_pool_manager.h):
-a mgmt thread that polls SPSC ready queues and recycles full
-metadata buffers **while kernels are still executing**, plus a
-poll thread that drains the L2 hand-off queue into
-`on_buffer_collected`.
+drain/refill shards poll SPSC ready queues and recycle full metadata
+buffers **while kernels are still executing**, a replenish thread keeps
+free queues topped up, and collector shards drain the host hand-off queues
+into `on_buffer_collected`.
 
 ```text
         HOST                                         DEVICE
@@ -499,19 +499,19 @@ poll thread that drains the L2 hand-off queue into
 │                          │               │                          │
 │ start()                  │               │ per-task run loop:       │
 │   ┌────────────────────┐ │               │   BEFORE_DISPATCH        │
-│   │ mgmt thread        │ │               │     dump_arg_record()    │
-│   │ (BufferPool driver)│ │ SPSC ready    │     → write to arena     │
+│   │ drain/refill shard │ │               │     dump_arg_record()    │
+│   │ + replenish thread │ │ SPSC ready    │     → write to arena     │
 │   │   poll ready queue │<┼──queues──────<│     → append record      │
 │   │   recycle buffers  │─┼──free queue──>│     → push to ready_q    │
 │   └────────────────────┘ │               │   dispatch kernel        │
 │   ┌────────────────────┐ │               │   wait FIN               │
-│   │ poll thread        │ │               │   AFTER_COMPLETION       │
+│   │ collector shard    │ │               │   AFTER_COMPLETION       │
 │   │   reads arena via  │ │ shared mem    │     dump_arg_record()    │
 │   │   host mapping     │<┼──mapping─────<│                          │
 │   └────────────────────┘ │               │                          │
 │                          │               │ dump_args_flush()        │
 │ stop()                   │               │   log per-thread stats   │
-│   join mgmt → join poll  │               └──────────────────────────┘
+│   join mgmt → collectors │               └──────────────────────────┘
 │ reconcile_counters()     │
 │   recover leftovers      │
 │   + dropped accounting   │
@@ -530,29 +530,28 @@ poll thread that drains the L2 hand-off queue into
 init_tensor_dump()
   dump_collector_.initialize(..., output_prefix_)
   kernel_args_.args.dump_data_base = dump_collector_.get_dump_shm_device_ptr()
-start()                          ← spawn mgmt thread (drains L1 ringbuffer)
-                                   then spawn poll thread (consumes L2 queue)
+start()                          ← spawn split mgmt threads (drain/refill
+                                   + replenish), then collector shards
 launch AICPU / AICore
 rtStreamSynchronize              ← wait for kernel completion
-stop()                           ← join mgmt (its final-drain pass into L2
-                                   has poll as the consumer), then signal
-                                   poll and join it
+stop()                           ← join mgmt/replenish after final drain,
+                                   then signal collector shards and join them
 reconcile_counters()             ← recover leftover current buffers
                                    + dropped accounting
 export_dump_files()
 ```
 
-[`TensorDumpCollector`](../src/a2a3/platform/include/host/tensor_dump_collector.h)
+[`TensorDumpCollector`](../src/common/platform/include/host/tensor_dump_collector.h)
 on a2a3 inherits from
 [`profiling_common::ProfilerBase<TensorDumpCollector, DumpModule>`](../src/common/platform/include/host/profiler_base.h):
-the base class owns the mgmt thread, the poll thread, and the
+the base class owns split mgmt threads, collector shards, and the
 `BufferPoolManager<DumpModule>` they share. `TensorDumpCollector`
 only supplies the dump-specific pieces — the `DumpModule` trait
 that describes the shared-memory layout, `initialize` that
 allocates and pre-fills free queues, an `on_buffer_collected`
 callback that gathers payload bytes into the in-memory record
 list, plus `reconcile_counters` / `export_dump_files` /
-`finalize`. The mgmt/poll threading, buffer pooling, and `Module`
+`finalize`. The mgmt/collector threading, buffer pooling, and `Module`
 trait pattern are shared with PMU and L2Swimlane — see
 [profiling-framework.md](../profiling-framework.md) for the
 framework reference.
@@ -561,7 +560,7 @@ framework reference.
 
 a5's `TensorDumpCollector` derives from
 `ProfilerBase<TensorDumpCollector, DumpModule>` and shares the
-mgmt + poll thread structure with a2a3. The single behavioral
+split mgmt + collector shard structure with a2a3. The single behavioral
 deviation from §5.4 is the **transport channel**: a5 has no
 `halHostRegister`, so each device buffer is paired with a
 host-shadow `malloc()` and the mgmt loop synchronizes the two via
@@ -597,8 +596,8 @@ the buffer's records.
 │   register_mapping(s)    │               │   BEFORE_DISPATCH        │
 │                          │               │     dump_arg_record()    │
 │ start(thread_factory)    │               │   dispatch kernel        │
-│   mgmt_thread starts     │               │   wait FIN               │
-│   poll_thread starts     │               │   AFTER_COMPLETION       │
+│   split mgmt starts      │               │   wait FIN               │
+│   collector shards start │               │   AFTER_COMPLETION       │
 │                          │               │     dump_arg_record()    │
 │ mgmt every 10us tick:    │               │   if buffer full:        │
 │   copy_from_device(shm)  │<──memcpy─────<│     push ready entry,    │
@@ -612,7 +611,7 @@ the buffer's records.
 │     for each modified    │               │                          │
 │     field                │               │                          │
 │                          │               │                          │
-│ poll thread:             │               │                          │
+│ collector shard:         │               │                          │
 │   wait_pop_ready         │               │                          │
 │   on_buffer_collected →  │               │                          │
 │     copy arena slice     │<──memcpy─────<│                          │
@@ -622,7 +621,7 @@ the buffer's records.
 │                          │               │                          │
 │ rtStreamSynchronize      │               │                          │
 │ stop()                   │               │                          │
-│   join mgmt + poll       │               │                          │
+│   join mgmt + collectors │               │                          │
 │ reconcile_counters()     │               │                          │
 │   recover leftovers      │               │                          │
 │   + dropped accounting   │               │                          │
@@ -638,17 +637,17 @@ the buffer's records.
 init_tensor_dump()
   dump_collector_.initialize(num_dump_threads, ..., output_prefix_)
   kernel_args_.args.dump_data_base = dump_collector_.get_dump_shm_device_ptr()
-dump_collector_.start(thread_factory)   ← mgmt + poll threads
+dump_collector_.start(thread_factory)   ← split mgmt + collector shards
 launch AICPU / AICore
 rtStreamSynchronize
-dump_collector_.stop()                  ← join mgmt + poll, drain final batch
+dump_collector_.stop()                  ← join mgmt + collectors, drain final batch
 dump_collector_.reconcile_counters()    ← recover leftover current buffers
                                           + dropped accounting
 dump_collector_.export_dump_files()
 dump_collector_.finalize()
 ```
 
-[`TensorDumpCollector`](../src/a5/platform/include/host/tensor_dump_collector.h)
+[`TensorDumpCollector`](../src/common/platform/include/host/tensor_dump_collector.h)
 on a5 inherits the same CRTP base
 ([`profiling_common::ProfilerBase`](../src/common/platform/include/host/profiler_base.h))
 as a2a3 and parameterizes
@@ -670,7 +669,7 @@ before that flush runs, `reconcile_counters` recovers a non-empty
 | Device-side layout | identical (same `DumpDataHeader` / `DumpMetaBuffer` / arena shape, `static_assert`-checked) | |
 | AICPU recording logic | identical | |
 | Buffer model | rotating pool (free + ready queues per thread) | identical |
-| Host threads | mgmt + poll, streams during execution | identical |
+| Host threads | split mgmt + collector shards, streams during execution | identical |
 | Host-class shape | `ProfilerBase<TensorDumpCollector, DumpModule>` | identical |
 | Host transport | `halHostRegister` shared memory | host-shadow `malloc` + per-tick `rtMemcpy`/`memcpy` |
 | `MemoryOps` callbacks | 3 (`alloc`, `reg`, `free_`) | 5 (+ `copy_to_device`, `copy_from_device`) |
@@ -694,9 +693,10 @@ With `--dump-args`, AICPU records full `BEFORE_DISPATCH` /
   non-contiguous views).
 - The completion `pipe_barrier(PIPE_ALL)` before writing FIN, which
   serializes all device-side writes for dumped tasks.
-- The arena and metadata writes themselves; the host transport
-  cost is taken concurrently on a2a3 (mgmt + poll threads) or after
-  the stream finishes on a5.
+- The arena and metadata writes themselves; host drain/replenish and
+  collector work runs concurrently with the stream on both architectures.
+  a5 additionally pays `rtMemcpy`/`memcpy` transport cost to keep host
+  shadows in sync.
 
 For interactive debugging, total memory pressure is what to watch:
 the default per-thread arena is 128 MiB
@@ -893,7 +893,7 @@ per-thread arena (default 128 MiB). Bump
 
 **`dropped_overwrite > 0` in summary.** On a5, the run produced
 more total payload than fits in the arena; on a2a3, the host
-mgmt/poll threads couldn't keep up. Reduce the number of dumped
+mgmt/collector pipeline couldn't keep up. Reduce the number of dumped
 tasks (filter by `func_id` upstream) or increase
 `PLATFORM_DUMP_BUFFERS_PER_THREAD`.
 

diff --git a/docs/dfx/l2-swimlane-profiling.md b/docs/dfx/l2-swimlane-profiling.md
@@ -609,11 +609,11 @@ sched overhead per session as price for unbounded session length).
 
 `halHostRegister` maps device memory into host virtual address
 space so the host can read device buffers directly.
-`L2SwimlaneCollector` runs two background threads on top of a
+`L2SwimlaneCollector` runs split mgmt threads and collector shards on top of a
 [`BufferPoolManager<L2SwimlaneModule>`](../src/common/platform/include/host/buffer_pool_manager.h):
-a mgmt thread that polls SPSC ready queues and recycles full
-buffers **while kernels are still executing**, plus a poll
-thread that drains the L2 hand-off queue into
+drain/refill shards poll SPSC ready queues and recycle full buffers
+**while kernels are still executing**, a replenish thread keeps free
+queues topped up, and collector shards drain the host hand-off queues into
 `on_buffer_collected`.
 
 `L2SwimlaneModule` declares four buffer kinds going through one ready
@@ -641,19 +641,19 @@ are single-kind.
 │                          │               │                          │
 │ start(tf)                │               │ AICPU on FIN:            │
 │   ┌────────────────────┐ │ SPSC ready    │   commit AicpuTask       │
-│   │ mgmt thread        │ │ queues        │   record (kind 0); fill  │
-│   │ (BufferPool driver)│ │<──4 kinds────<│   func_id / dispatch /   │
+│   │ drain/refill shard │ │ queues        │   record (kind 0); fill  │
+│   │ + replenish thread │ │<──4 kinds────<│   func_id / dispatch /   │
 │   │   poll ready queue │<┼──multiplexed──│   finish; rotate buffer  │
 │   │   recycle buffers  │─┼──free queue──>│   when full              │
 │   └────────────────────┘ │               │ AICPU scheduler thread:  │
 │   ┌────────────────────┐ │               │   per work iter: write   │
-│   │ poll thread        │ │               │   SchedPhaseRecord       │
+│   │ collector shard    │ │               │   SchedPhaseRecord       │
 │   │   reads via host   │ │ shared mem    │   (kind 1). Per submit:  │
 │   │   mapping; copies  │<┼──mapping─────<│   write OrchPhaseRecord  │
 │   │   to host vectors  │ │               │   (kind 2).              │
 │   └────────────────────┘ │               │                          │
 │ stop()                   │               │                          │
-│   join mgmt → join poll  │               │                          │
+│   join mgmt → collectors │               │                          │
 │ read_phase_header_metadata()             │                          │
 │ reconcile_counters()     │               │                          │
 │ export_swimlane_json()   │               │                          │
@@ -667,10 +667,10 @@ are single-kind.
 init_l2_swimlane()
   l2_swimlane_collector_.initialize(num_aicore, ..., output_prefix_)
   kernel_args_.args.l2_swimlane_data_base = l2_swimlane_collector_.get_l2_swimlane_shm_device_ptr()
-start(tf)                          ← spawn mgmt + poll threads
+start(tf)                          ← spawn split mgmt + collector shards
 launch AICPU / AICore
 rtStreamSynchronize
-stop()                             ← join mgmt → join poll
+stop()                             ← join mgmt/replenish → join collectors
 read_phase_header_metadata()       ← single-shot read of the
                                      core→thread mapping
 reconcile_counters()               ← three-bucket accounting for both
@@ -684,7 +684,7 @@ finalize(unregister, free)
 [`L2SwimlaneCollector`](../src/a2a3/platform/include/host/l2_swimlane_collector.h)
 on a2a3 inherits from
 [`profiling_common::ProfilerBase<L2SwimlaneCollector, L2SwimlaneModule>`](../src/common/platform/include/host/profiler_base.h):
-the base class owns the mgmt thread, the poll thread, and the
+the base class owns split mgmt threads, collector shards, and the
 `BufferPoolManager<L2SwimlaneModule>` they share. `L2SwimlaneCollector`
 supplies the L2-specific pieces — the `L2SwimlaneModule` trait
 (notably `kBufferKinds = 4` and `kind_of()`), `initialize` that
@@ -694,17 +694,18 @@ allocates and pre-fills all four kinds of free queues, an
 to copy into the right per-core or per-thread vector, plus
 `read_phase_header_metadata` /
 `reconcile_counters` / `export_swimlane_json` / `finalize`. The
-mgmt/poll threading and `Module` trait pattern are shared with
+mgmt/collector threading and `Module` trait pattern are shared with
 PMU and TensorDump — see
 [profiling-framework.md](../profiling-framework.md) for the
 framework reference.
 
 ### 5.3 a5 — same framework, host-shadow transport
 
 a5's `L2SwimlaneCollector` derives from
-`ProfilerBase<L2SwimlaneCollector, L2SwimlaneModule>` and shares the
-mgmt + poll thread structure with a2a3. The single behavioral
-deviation from §5.2 is the **transport channel**: a5 has no
+`ProfilerBase<L2SwimlaneCollector, L2SwimlaneModule>` and uses the same
+framework abstractions as a2a3. Its current L2 module keeps the default
+single mgmt + collector thread shape; the larger behavioral deviation
+from §5.2 is the **transport channel**: a5 has no
 `halHostRegister`, so each device buffer is paired with a
 host-shadow `malloc()` and the mgmt loop synchronizes the two via
 `profiling_copy.h` (`rtMemcpy` onboard, plain `memcpy` in sim).
@@ -836,7 +837,7 @@ PHASE), same shape as a2a3.
 | AICPU commit on FIN | identical | |
 | Buffer model | rotating pool (free + ready queues) per kind | identical |
 | Ready queue | per-AICPU-thread, multiplexes 4 kinds via `ReadyQueueEntry::kind` | per-AICPU-thread, 2 kinds via `is_phase` |
-| Host threads | mgmt + poll, streams during execution | identical |
+| Host threads | split mgmt + collector shards, streams during execution | default single mgmt + collector thread |
 | Host-class shape | `ProfilerBase<L2SwimlaneCollector, L2SwimlaneModule>` (`kBufferKinds = 4`) | same base, `kBufferKinds = 2` |
 | Host transport | `halHostRegister` shared memory | host-shadow `malloc` + per-tick `rtMemcpy`/`memcpy` |
 | `MemoryOps` callbacks | 3 (`alloc`, `reg`, `free_`) | 5 (+ `copy_to_device`, `copy_from_device`) |
@@ -864,8 +865,10 @@ Phase-record overhead (only at `--enable-l2-swimlane >= 3`):
 - a5 — one 40 B `L2SwimlaneAicpuPhaseRecord` per emitted phase
   (legacy unified shape).
 
-Both architectures drain buffers concurrently with execution via the
-mgmt + poll thread pair; a5 additionally pays per-tick
+Both architectures drain buffers concurrently with execution through the
+ProfilerBase mgmt/collector pipeline; a2a3 uses split mgmt plus collector
+shards for this profiler, while a5 currently uses the default single mgmt
+plus collector thread. a5 additionally pays per-tick
 `rtMemcpy`/`memcpy` round-trips to keep the host shadow in sync,
 which overlap with device execution.