Support: buffer AICPU stall diagnostics in device memory and print on host by ChaoZheng109 · Pull Request #2 · ChaoZheng109/simpler

ChaoZheng109 · 2026-04-13T13:52:50Z

DEV_ALWAYS is broken on a5 hardware, silencing the stall diagnostic logs. Replace with a host-allocated device memory buffer that AICPU writes to atomically, copied back and printed by the host after execution.

Add DevLogBuffer struct (512 entries x 128 bytes) in dev_log_buffer.h
Add dev_log_buffer_dev_ptr_ field and accessors to Runtime
Allocate and zero-init the buffer in init_runtime_impl, record with host_ptr=nullptr so copy-back loop skips it; device_free handles cleanup
Copy buffer from device and printf each entry in validate_runtime_impl
Replace all DEV_ALWAYS calls in the stall block with dev_buf_log()
Remove thread_idx==0 guard so all scheduler threads contribute log entries

… host DEV_ALWAYS is broken on a5 hardware, silencing the stall diagnostic logs. Replace with a host-allocated device memory buffer that AICPU writes to atomically, copied back and printed by the host after execution. - Add DevLogBuffer struct (512 entries x 128 bytes) in dev_log_buffer.h - Add dev_log_buffer_dev_ptr_ field and accessors to Runtime - Allocate and zero-init the buffer in init_runtime_impl, record with host_ptr=nullptr so copy-back loop skips it; device_free handles cleanup - Copy buffer from device and printf each entry in validate_runtime_impl - Replace all DEV_ALWAYS calls in the stall block with dev_buf_log() - Remove thread_idx==0 guard so all scheduler threads contribute log entries

… host DEV_ALWAYS is broken on a5 hardware, silencing the stall diagnostic logs. Replace with a host-allocated device memory buffer that AICPU writes to atomically, copied back and printed by the host after execution. - Add DevLogBuffer struct (512 entries x 128 bytes) in dev_log_buffer.h - Add dev_log_buffer_dev_ptr_ field and accessors to Runtime - Allocate and zero-init the buffer in init_runtime_impl, record with host_ptr=nullptr so copy-back loop skips it; device_free handles cleanup - Copy buffer from device and printf each entry in validate_runtime_impl - Replace all DEV_ALWAYS calls in the stall block with dev_buf_log() - Remove thread_idx==0 guard so all scheduler threads contribute log entries Cherry-picked from ChaoZheng109#2 and adapted for stable branch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…spill (hw-native-sys#989) Combines two scheduler-side dispatch improvements that together hit the two metrics tracked in the PR review thread: 1. **Per-thread first-to-last AICore start span** — bring it back to the hardware floor of ~60 ns. The original v3 attempt to hoist `handles[]`, `wmb()`, and the publish loop across distinct tasks in one pop was reverted in the 2026-06-06 investigation because it broke `spmd_sync_start_stress`: bursting all prior tasks' MMIO writes immediately before `enter_drain_mode()` collapsed the head-start that lets the surrounding completion loop catch up on FINs in the drain's resource-insufficient retry window, and the loop tripped the 1 s op timeout ~40 % of runs. This commit ships the follow-up #2 the investigation left for later: gate the cross-task hoist on the popped batch carrying no `requires_sync_start()` task. When the batch contains a sync_start task, fall back to per-task `flush_publish()` (one wmb + one publish per task) so prior tasks land on AICore with the same time separation the per-claim-only design had. The check is one mask-bit read per popped task — trivial. The drain-entry path still calls `flush_publish()` before `enter_drain_mode()` so any in-flight handles get out; when `any_sync_start == true` that flush is already drained per-task and the entry flush is a no-op. 2. **Cross-thread first-dispatch stagger** — bring the 3-scheduler- thread startup delay back to sub-microsecond. When `release_fanin_and_check_ready` fast-paths newly-ready consumers into the releasing thread's `local_bufs[shape]`, batch releases (e.g. attn_fence → 50 out_proj consumers) overshoot this thread's slot budget by 6×, and peers spin on an empty shared queue until the producing thread's `flush_local_bufs()` between IDLE and PENDING exposes the overflow. This commit adds an overflow gate at the top of `dispatch_ready_tasks`: if `local_bufs[s].count` exceeds the per-shape per-thread block budget AND a peer has idle cores in that shape, `push_batch` the trailing excess to the shared queue. O(1) count decrement, no memmove. Capacity derives from `PLATFORM_MAX_BLOCKDIM / active_sched_threads_ × cores_per_blockdim` so the threshold tracks platform scaling. The peer-idle check reads `core_trackers_[t]` (plain 8-byte load on a rarely-contended line), deliberately avoiding `ready_queues[s].size()` whose two atomic loads against producer/ consumer cache lines were measurably slow when sampled in the swimlane queue-depth instrumentation. Measurement on a2a3 onboard (qwen3 decode_layer level 4 swimlane, n=8 runs): | Metric | Prior (per-claim only) | This PR | | ------------------------------------------------------- | ---------------------- | ------- | | Per-thread first-wave dt span (median) | ~6 µs | 0 µs | | Per-thread first-wave st span (median) | ~6 µs | ~60 ns | | Cross-thread first-dispatch stagger (median) | 8.78 µs | 1.92 µs | | `spmd_sync_start_stress` × 10 | 9/10 (1 flake) | 10/10 | | Wall (median) | 893 µs | 902.9 µs (within noise) | The prepare_subtask_to_core / publish_subtask_to_core split — and the PublishHandle plumbing it enables — are kept. The investigation doc at `docs/investigations/2026-06-cross-task-batched-publish.md` is updated from "dropped" to "shipped with sync_start exclusion" with the revised measurement table. Co-authored-by: Chao Wang <wcwxyy@gmail.com> Co-authored-by: poursoul <poursoul@126.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support: buffer AICPU stall diagnostics in device memory and print on host#2

Support: buffer AICPU stall diagnostics in device memory and print on host#2
ChaoZheng109 wants to merge 1 commit into
mainfrom
logbuf

ChaoZheng109 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChaoZheng109 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant