Skip to content

Support: buffer AICPU stall diagnostics in device memory and print on host#2

Open
ChaoZheng109 wants to merge 1 commit into
mainfrom
logbuf
Open

Support: buffer AICPU stall diagnostics in device memory and print on host#2
ChaoZheng109 wants to merge 1 commit into
mainfrom
logbuf

Conversation

@ChaoZheng109

Copy link
Copy Markdown
Owner

DEV_ALWAYS is broken on a5 hardware, silencing the stall diagnostic logs. Replace with a host-allocated device memory buffer that AICPU writes to atomically, copied back and printed by the host after execution.

  • Add DevLogBuffer struct (512 entries x 128 bytes) in dev_log_buffer.h
  • Add dev_log_buffer_dev_ptr_ field and accessors to Runtime
  • Allocate and zero-init the buffer in init_runtime_impl, record with host_ptr=nullptr so copy-back loop skips it; device_free handles cleanup
  • Copy buffer from device and printf each entry in validate_runtime_impl
  • Replace all DEV_ALWAYS calls in the stall block with dev_buf_log()
  • Remove thread_idx==0 guard so all scheduler threads contribute log entries

… host

DEV_ALWAYS is broken on a5 hardware, silencing the stall diagnostic
logs. Replace with a host-allocated device memory buffer that AICPU
writes to atomically, copied back and printed by the host after
execution.

- Add DevLogBuffer struct (512 entries x 128 bytes) in dev_log_buffer.h
- Add dev_log_buffer_dev_ptr_ field and accessors to Runtime
- Allocate and zero-init the buffer in init_runtime_impl, record with
  host_ptr=nullptr so copy-back loop skips it; device_free handles cleanup
- Copy buffer from device and printf each entry in validate_runtime_impl
- Replace all DEV_ALWAYS calls in the stall block with dev_buf_log()
- Remove thread_idx==0 guard so all scheduler threads contribute log entries
zhangqi-chen added a commit to zhangqi-chen/simpler that referenced this pull request Apr 14, 2026
… host

DEV_ALWAYS is broken on a5 hardware, silencing the stall diagnostic
logs. Replace with a host-allocated device memory buffer that AICPU
writes to atomically, copied back and printed by the host after
execution.

- Add DevLogBuffer struct (512 entries x 128 bytes) in dev_log_buffer.h
- Add dev_log_buffer_dev_ptr_ field and accessors to Runtime
- Allocate and zero-init the buffer in init_runtime_impl, record with
  host_ptr=nullptr so copy-back loop skips it; device_free handles cleanup
- Copy buffer from device and printf each entry in validate_runtime_impl
- Replace all DEV_ALWAYS calls in the stall block with dev_buf_log()
- Remove thread_idx==0 guard so all scheduler threads contribute log entries

Cherry-picked from ChaoZheng109#2 and adapted for stable branch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ChaoZheng109 pushed a commit that referenced this pull request Jun 9, 2026
…spill (hw-native-sys#989)

Combines two scheduler-side dispatch improvements that together hit
the two metrics tracked in the PR review thread:

1. **Per-thread first-to-last AICore start span** — bring it back to
   the hardware floor of ~60 ns. The original v3 attempt to hoist
   `handles[]`, `wmb()`, and the publish loop across distinct tasks
   in one pop was reverted in the 2026-06-06 investigation because
   it broke `spmd_sync_start_stress`: bursting all prior tasks' MMIO
   writes immediately before `enter_drain_mode()` collapsed the
   head-start that lets the surrounding completion loop catch up on
   FINs in the drain's resource-insufficient retry window, and the
   loop tripped the 1 s op timeout ~40 % of runs. This commit ships
   the follow-up #2 the investigation left for later: gate the
   cross-task hoist on the popped batch carrying no
   `requires_sync_start()` task. When the batch contains a sync_start
   task, fall back to per-task `flush_publish()` (one wmb + one
   publish per task) so prior tasks land on AICore with the same
   time separation the per-claim-only design had. The check is one
   mask-bit read per popped task — trivial. The drain-entry path
   still calls `flush_publish()` before `enter_drain_mode()` so any
   in-flight handles get out; when `any_sync_start == true` that
   flush is already drained per-task and the entry flush is a no-op.

2. **Cross-thread first-dispatch stagger** — bring the 3-scheduler-
   thread startup delay back to sub-microsecond. When
   `release_fanin_and_check_ready` fast-paths newly-ready consumers
   into the releasing thread's `local_bufs[shape]`, batch releases
   (e.g. attn_fence → 50 out_proj consumers) overshoot this thread's
   slot budget by 6×, and peers spin on an empty shared queue until
   the producing thread's `flush_local_bufs()` between IDLE and
   PENDING exposes the overflow. This commit adds an overflow gate
   at the top of `dispatch_ready_tasks`: if `local_bufs[s].count`
   exceeds the per-shape per-thread block budget AND a peer has idle
   cores in that shape, `push_batch` the trailing excess to the
   shared queue. O(1) count decrement, no memmove. Capacity derives
   from `PLATFORM_MAX_BLOCKDIM / active_sched_threads_ ×
   cores_per_blockdim` so the threshold tracks platform scaling. The
   peer-idle check reads `core_trackers_[t]` (plain 8-byte load on a
   rarely-contended line), deliberately avoiding
   `ready_queues[s].size()` whose two atomic loads against producer/
   consumer cache lines were measurably slow when sampled in the
   swimlane queue-depth instrumentation.

Measurement on a2a3 onboard (qwen3 decode_layer level 4 swimlane,
n=8 runs):

| Metric                                                  | Prior (per-claim only) | This PR |
| ------------------------------------------------------- | ---------------------- | ------- |
| Per-thread first-wave dt span (median)                  | ~6 µs                  | 0 µs    |
| Per-thread first-wave st span (median)                  | ~6 µs                  | ~60 ns  |
| Cross-thread first-dispatch stagger (median)            | 8.78 µs                | 1.92 µs |
| `spmd_sync_start_stress` × 10                           | 9/10 (1 flake)         | 10/10   |
| Wall (median)                                           | 893 µs                 | 902.9 µs (within noise) |

The prepare_subtask_to_core / publish_subtask_to_core split — and
the PublishHandle plumbing it enables — are kept. The investigation
doc at `docs/investigations/2026-06-cross-task-batched-publish.md`
is updated from "dropped" to "shipped with sync_start exclusion"
with the revised measurement table.

Co-authored-by: Chao Wang <wcwxyy@gmail.com>
Co-authored-by: poursoul <poursoul@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant