Support: buffer AICPU stall diagnostics in device memory and print on host#2
Open
ChaoZheng109 wants to merge 1 commit into
Open
Support: buffer AICPU stall diagnostics in device memory and print on host#2ChaoZheng109 wants to merge 1 commit into
ChaoZheng109 wants to merge 1 commit into
Conversation
… host DEV_ALWAYS is broken on a5 hardware, silencing the stall diagnostic logs. Replace with a host-allocated device memory buffer that AICPU writes to atomically, copied back and printed by the host after execution. - Add DevLogBuffer struct (512 entries x 128 bytes) in dev_log_buffer.h - Add dev_log_buffer_dev_ptr_ field and accessors to Runtime - Allocate and zero-init the buffer in init_runtime_impl, record with host_ptr=nullptr so copy-back loop skips it; device_free handles cleanup - Copy buffer from device and printf each entry in validate_runtime_impl - Replace all DEV_ALWAYS calls in the stall block with dev_buf_log() - Remove thread_idx==0 guard so all scheduler threads contribute log entries
zhangqi-chen
added a commit
to zhangqi-chen/simpler
that referenced
this pull request
Apr 14, 2026
… host DEV_ALWAYS is broken on a5 hardware, silencing the stall diagnostic logs. Replace with a host-allocated device memory buffer that AICPU writes to atomically, copied back and printed by the host after execution. - Add DevLogBuffer struct (512 entries x 128 bytes) in dev_log_buffer.h - Add dev_log_buffer_dev_ptr_ field and accessors to Runtime - Allocate and zero-init the buffer in init_runtime_impl, record with host_ptr=nullptr so copy-back loop skips it; device_free handles cleanup - Copy buffer from device and printf each entry in validate_runtime_impl - Replace all DEV_ALWAYS calls in the stall block with dev_buf_log() - Remove thread_idx==0 guard so all scheduler threads contribute log entries Cherry-picked from ChaoZheng109#2 and adapted for stable branch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ChaoZheng109
pushed a commit
that referenced
this pull request
Jun 9, 2026
…spill (hw-native-sys#989) Combines two scheduler-side dispatch improvements that together hit the two metrics tracked in the PR review thread: 1. **Per-thread first-to-last AICore start span** — bring it back to the hardware floor of ~60 ns. The original v3 attempt to hoist `handles[]`, `wmb()`, and the publish loop across distinct tasks in one pop was reverted in the 2026-06-06 investigation because it broke `spmd_sync_start_stress`: bursting all prior tasks' MMIO writes immediately before `enter_drain_mode()` collapsed the head-start that lets the surrounding completion loop catch up on FINs in the drain's resource-insufficient retry window, and the loop tripped the 1 s op timeout ~40 % of runs. This commit ships the follow-up #2 the investigation left for later: gate the cross-task hoist on the popped batch carrying no `requires_sync_start()` task. When the batch contains a sync_start task, fall back to per-task `flush_publish()` (one wmb + one publish per task) so prior tasks land on AICore with the same time separation the per-claim-only design had. The check is one mask-bit read per popped task — trivial. The drain-entry path still calls `flush_publish()` before `enter_drain_mode()` so any in-flight handles get out; when `any_sync_start == true` that flush is already drained per-task and the entry flush is a no-op. 2. **Cross-thread first-dispatch stagger** — bring the 3-scheduler- thread startup delay back to sub-microsecond. When `release_fanin_and_check_ready` fast-paths newly-ready consumers into the releasing thread's `local_bufs[shape]`, batch releases (e.g. attn_fence → 50 out_proj consumers) overshoot this thread's slot budget by 6×, and peers spin on an empty shared queue until the producing thread's `flush_local_bufs()` between IDLE and PENDING exposes the overflow. This commit adds an overflow gate at the top of `dispatch_ready_tasks`: if `local_bufs[s].count` exceeds the per-shape per-thread block budget AND a peer has idle cores in that shape, `push_batch` the trailing excess to the shared queue. O(1) count decrement, no memmove. Capacity derives from `PLATFORM_MAX_BLOCKDIM / active_sched_threads_ × cores_per_blockdim` so the threshold tracks platform scaling. The peer-idle check reads `core_trackers_[t]` (plain 8-byte load on a rarely-contended line), deliberately avoiding `ready_queues[s].size()` whose two atomic loads against producer/ consumer cache lines were measurably slow when sampled in the swimlane queue-depth instrumentation. Measurement on a2a3 onboard (qwen3 decode_layer level 4 swimlane, n=8 runs): | Metric | Prior (per-claim only) | This PR | | ------------------------------------------------------- | ---------------------- | ------- | | Per-thread first-wave dt span (median) | ~6 µs | 0 µs | | Per-thread first-wave st span (median) | ~6 µs | ~60 ns | | Cross-thread first-dispatch stagger (median) | 8.78 µs | 1.92 µs | | `spmd_sync_start_stress` × 10 | 9/10 (1 flake) | 10/10 | | Wall (median) | 893 µs | 902.9 µs (within noise) | The prepare_subtask_to_core / publish_subtask_to_core split — and the PublishHandle plumbing it enables — are kept. The investigation doc at `docs/investigations/2026-06-cross-task-batched-publish.md` is updated from "dropped" to "shipped with sync_start exclusion" with the revised measurement table. Co-authored-by: Chao Wang <wcwxyy@gmail.com> Co-authored-by: poursoul <poursoul@126.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DEV_ALWAYS is broken on a5 hardware, silencing the stall diagnostic logs. Replace with a host-allocated device memory buffer that AICPU writes to atomically, copied back and printed by the host after execution.