Skip to content

Add: opt-in CHIP_TIMING run_prepared() wall-clock breakdown#5

Open
ChaoZheng109 wants to merge 1 commit into
mainfrom
feat/dfx-chip-timing
Open

Add: opt-in CHIP_TIMING run_prepared() wall-clock breakdown#5
ChaoZheng109 wants to merge 1 commit into
mainfrom
feat/dfx-chip-timing

Conversation

@ChaoZheng109

Copy link
Copy Markdown
Owner

Break a single run_prepared() into per-stage host and device durations, reconciled against the existing host_wall_ns / device_wall_ns numbers (issue hw-native-sys#1012), without intrusive KernelArgs or scheduler changes.

  • Host: emit [CHIP_TIMING] B/E spans around the run_prepared stages (attach, bind_callable, bind_impl, runner_run, validate) plus the envelope and the host_wall / device_wall baselines, on the V4 log tier (off at the default V5 threshold). c_api_shared.cpp onboard+sim.
  • Device: stamp aicpu_wall / aicpu_exec boundaries from get_sys_cnt in the onboard AICPU kernel (a5, a2a3); sim mirrors them in its in-process device_runner execution loop.
  • The timestamp rides in the message body (ns), independent of the log prefix, matching tools/benchmark_rounds.sh's device-log convention.
  • Add simpler_setup/tools/chip_timing.py: parse the logs into an indented per-run tree, reconciled against the wall baselines.
  • Add tests/ut/py/test_chip_timing.py and docs/dfx/chip-timing.md.

Verified: a2a3 + a5 sim examples reconcile exactly; disabled at the default level emits no lines; 22 unit tests pass.

Break a single run_prepared() into per-stage host and device durations,
reconciled against the existing host_wall_ns / device_wall_ns numbers
(issue hw-native-sys#1012), without intrusive KernelArgs or scheduler changes.

- Host: emit [CHIP_TIMING] B/E spans around the run_prepared stages
  (attach, bind_callable, bind_impl, runner_run, validate) plus the
  envelope and the host_wall / device_wall baselines, on the V4 log
  tier (off at the default V5 threshold). c_api_shared.cpp onboard+sim.
- Device: stamp aicpu_wall / aicpu_exec boundaries from get_sys_cnt in
  the onboard AICPU kernel (a5, a2a3); sim mirrors them in its
  in-process device_runner execution loop.
- The timestamp rides in the message body (ns), independent of the log
  prefix, matching tools/benchmark_rounds.sh's device-log convention.
- Add simpler_setup/tools/chip_timing.py: parse the logs into an
  indented per-run tree, reconciled against the wall baselines.
- Add tests/ut/py/test_chip_timing.py and docs/dfx/chip-timing.md.

Verified: a2a3 + a5 sim examples reconcile exactly; disabled at the
default level emits no lines; 22 unit tests pass.
@ChaoZheng109 ChaoZheng109 force-pushed the feat/dfx-chip-timing branch from ec94445 to b68d6d8 Compare June 27, 2026 04:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant