Skip to content

Add: paged attention unroll scene test with 4D input shapes#2

Open
chenshengxin2026 wants to merge 14 commits into
mainfrom
add-paged-attention-unroll-4dims-st
Open

Add: paged attention unroll scene test with 4D input shapes#2
chenshengxin2026 wants to merge 14 commits into
mainfrom
add-paged-attention-unroll-4dims-st

Conversation

@chenshengxin2026

Copy link
Copy Markdown
Owner

Summary

  • New paged_attention_unroll_4dims scene test under tensormap_and_ringbuffer runtime
  • Query and output tensors use 4D format (batch, seq_len, num_heads, head_dim) instead of flattened 2D
  • 6 kernels: QK/PV matmul (AIC), softmax_prepare/online_update (AIV), AIC/AIV hub stubs
  • Orchestration with N_UNROLL=64, 4 tasks per group, online softmax accumulation
  • Golden wraps shared paged_attention_golden with 4D reshape adapter
  • Three test cases covering varying batch/heads/head_dim at production scale (bfloat16)

Testing

  • Simulation tests pass
  • Hardware tests pass

@chenshengxin2026 chenshengxin2026 force-pushed the add-paged-attention-unroll-4dims-st branch from 2125ab0 to d8ccf87 Compare March 27, 2026 08:35
hw-native-sys-bot and others added 11 commits April 13, 2026 11:23
Update all references in GitHub workflow skills, issue templates,
and shared library docs to reflect the repo transfer.

Co-authored-by: wcwxy <26245345+ChaoWao@users.noreply.github.com>
Two independent fixes to orchestration SO handling on AICPU:

1. Orch SO file creation split by platform.  mkstemps
   (libdevice_orch_XXXXXX.so) ensures per-call uniqueness on sim where
   multiple workers may share a process, but is not always available
   on AICPU device libc.  Added platform interface create_orch_so_file
   so sim uses mkstemps + fchmod(0755) and onboard uses pid-based
   naming + open(...,0755) — sufficient since only one runtime runs
   per device process.

2. Deferred dlclose/unlink from run() to deinit().  Closing the SO
   handle at the end of run() made it impossible to re-run the
   orchestrator through repeated calls into the same executor.  The
   handle is kept until deinit, which then unlinks the file.

Applied to a2a3 aicpu_build_graph, a2a3 tensormap_and_ringbuffer, and
a5 tensormap_and_ringbuffer.

Co-authored-by: wcwxy <26245345+ChaoWao@users.noreply.github.com>
- route controlled PTO2 fatal status through aicpu_execute so platform
  runners no longer read tensormap_and_ringbuffer shared memory
- keep the PTO2 status helpers in runtime-local common code and preserve
  host-side finalize handling in runtime_maker
- add UT coverage for fatal short-circuit/reporting paths and keep the
  explicit fatal ST on a2a3sim only
Remove the per-call `runner->create_thread([&]() { ... }).join()`
wrapper introduced by hw-native-sys#493.  Running the body directly on the caller
thread plus two `RAIIScopeGuard`s that clean up on scope exit restores
the pre-hw-native-sys#493 behaviour, without any behavioural difference from the
caller's perspective.

The wrapper was added in anticipation of parallel ChipWorker
execution with GIL-released nanobind bindings.  Because the GIL
release was never landed, the caller still holds the GIL across
t.join(), so the wrap buys no parallelism and is pure overhead.

- src/a2a3/platform/onboard/host/pto_runtime_c_api.cpp
- src/a5/platform/onboard/host/pto_runtime_c_api.cpp
- `tsd_guard` clears `g_runner_key` TSD on every exit path.
- `device_guard` (created after `ensure_device_set` succeeds) calls
  `reset_device_context()` on every exit path from that point onward,
  including the `init_runtime_impl` / `runner->run` error paths and
  the `catch(...)` unwind.  Previously the worker thread's
  destructor handled both; on the caller thread we must do it
  explicitly or leak CANN streams between successive run_runtime()
  calls in the same process (which broke the device_test
  batch_paged_attention case that runs multiple cases).
- `DeviceRunner::create_thread()` and `reset_device_context()`
  retained — `create_thread()` is still used by the profiling
  collector thread inside `device_runner.cpp`.
…ive-sys#527)

CANN 8.5.1 defines `#define BLK BLK_Type()` in
__clang_cce_vector_intrinsics.h, which causes the device compiler
to expand the local `constexpr uint64_t BLK = 64` declaration into
invalid code. Rename to `blk_size` in both a2a3 and a5 tensor.h.

Fixes hw-native-sys#517
…ain protocol (hw-native-sys#501)

Replace the non-blocking ack check (load and return if not all acked) with a
spin-wait loop that blocks until all scheduler threads have set their bit in
drain_ack_mask. This eliminates the window where a non-elected thread returns
to the scheduler loop and resumes tracker writes while the drain worker already
has exclusive tracker access.

Remove drain_barrier_mask (the second atomic introduced as an intermediate step)
— the single spin-wait on drain_ack_mask is sufficient for the full-stop
guarantee. Reset detection uses drain_ack_mask bit-clear (release store on
insufficient resources), not drain_worker_elected which remains zero until after
the barrier completes.

Also fix drain_ack_mask reset ordering: use memory_order_release instead of
relaxed so the clearing store is visible to threads spinning on their own bit.
No cpplint configuration exists in the repo (no CPPLINT.cfg, no CI
invocation), so these markers suppressed a linter that is not wired
up. Strip all trailing `// NOLINT(...)` comments and remove four
standalone NOLINT-only comment lines across the tree. clang-format
rewraps a handful of log format strings that were previously split
to satisfy cpplint line-length.

- 53 files touched across src/a2a3, src/a5, src/common
- 198 NOLINT markers removed
…w-native-sys#538)

- tests/st/a2a3/tensormap_and_ringbuffer/batch_paged_attention: set
  PTO2_RING_HEAP to 1 GiB (2^30) via RUNTIME_ENV. The default 1024 B
  heap is too small for this scene's intermediate tensors and causes
  the test to fail on hardware.
- tests/ut/py/test_dist_worker/test_group_task: remove
  TestGroupParallel.test_group_wall_time. The wall-time assertion is
  flaky under scheduler jitter, and without it the test is redundant
  with TestGroupBasic.test_group_both_workers_execute.
…#539)

Break the monolithic distributed_level_runtime.md and rename
architecture.md so each doc has one audience and one scope.

- Rename architecture.md -> chip-level-arch.md (L2 single-chip scope)
- Slim distributed_level_runtime.md to level model + component overview;
  move internal details to the new per-component docs
- Add orchestrator.md: submit flow, Ring, TensorMap, Scope, state machine
- Add scheduler.md: wiring/ready/completion queues, dispatch loop
- Add worker-manager.md: WorkerManager + WorkerThread,
  THREAD/PROCESS modes, fork + mailbox mechanics
- Add task-flow.md: Callable / TaskArgs / CallConfig handles,
  IWorker interface, L2 ABI edge, end-to-end walkthrough
- Update README, .claude/rules/architecture.md, callable.h doc comment
  to the new filenames
…s#536)

Split the L3 orchestration surface from Worker into a dedicated
Orchestrator class whose lifetime is scoped to one run() call. User
orch fn signature changes to `fn(orch, args)` and calls
orch.submit_next_level / submit_next_level_group / submit_sub /
submit_sub_group. scope_begin/scope_end/drain become private details of
Worker.run() (order: scope_begin -> orch_fn -> scope_end -> drain).

- Add python/simpler/orchestrator.py
- Remove Worker.submit, Worker.scope, and _ScopeGuard from the public
  API; Worker keeps register/init/close/run(Task)
- Migrate ut tests (test_host_worker, test_multi_worker, test_group_task)
  and st tests (test_l3_dependency, test_l3_group) to the new API
- Update simpler_setup/scene_test.py wrapper to pass the Orchestrator
- Refresh docs/distributed_level_runtime.md

C++ DistWorker.submit / scope_* remain internal; a later step of the
hierarchical runtime plan will take them private or re-expose them via
nanobind on Orchestrator directly.
…sys#542)

Direct `python tests/.../test_xxx.py` runs were failing on
`from simpler.task_interface import ...` because `python/` was only on
sys.path under pytest. Make `simpler` (and its dependencies) installable
via wheel so any entry point can find it.

- pyproject.toml: include `python/simpler` in `wheel.packages`; exclude
  the 4 files duplicated with `simpler_setup/` (`elf_parser`,
  `kernel_compiler`, `runtime_compiler`, `toolchain`) so the
  authoritative copies under `simpler_setup/` win in wheel mode while
  the source-tree copies stay reachable for un-migrated callers
- pyproject.toml: enable `editable.rebuild` + `editable.verbose`, set
  `build-dir = "build/{wheel_tag}"` so editable installs auto-rebuild
  the nanobind module on import
- CMakeLists.txt: install `src/` and `build/lib/` under
  `simpler_setup/_assets/` so wheel users get headers + pre-built
  runtime binaries without a source checkout
- simpler_setup/environment.py: rewrite `PROJECT_ROOT` to auto-resolve
  via `importlib.resources` — picks `_assets/` when present (wheel),
  falls back to repo root (source tree / editable). Drop
  `ensure_python_path` helper now that `simpler` is importable
  directly
- simpler_setup/scene_test.py: remove 6 `ensure_python_path()`
  callsites and the import; redundant after packaging
- examples/scripts/run_example.py: drop the `python/` sys.path insert
  for the same reason

Documentation:

- docs/developer-guide.md: update Directory Structure, add Path
  resolution + Python package layout sections, add Editable rebuild
  notes, fix dynamic kernel compilation example, refresh Disk layout
  with `build/{wheel_tag}/`
- docs/getting-started.md: switch broken `simpler.runtime_compiler` /
  `runtime_builder` imports to `simpler_setup.*`
- .claude/rules/architecture.md: add Python Package Layout table,
  update Build System Lookup with simpler_setup paths and PROJECT_ROOT
  entry
- .claude/rules/venv-isolation.md: document `--no-build-isolation`
  requirement and editable workflow
@ChaoWao ChaoWao force-pushed the add-paged-attention-unroll-4dims-st branch 3 times, most recently from 046acec to b355d40 Compare April 14, 2026 03:22
…w-native-sys#544)

Disallow full-gap allocation when `tail - top == alloc_size` to preserve unambiguous top/tail semantics, and add targeted allocator logs for wrap-around and allocation-failure paths to simplify field debugging.

Made-with: Cursor
@ChaoWao ChaoWao force-pushed the add-paged-attention-unroll-4dims-st branch from b355d40 to 1da6433 Compare April 14, 2026 03:38
hw-native-sys-bot and others added 2 commits April 14, 2026 15:15
…ocs (hw-native-sys#546)

The 4 build-time modules (runtime_compiler, kernel_compiler, elf_parser,
toolchain) are excluded from the simpler wheel — simpler_setup is the
authoritative source. Migrate all imports to simpler_setup.

Update testing.md: fix stale scene test example to current
CALLABLE/TaskArgsBuilder API and add migration guide section.

Fix self-hosted CI jobs to use per-run venv with pip upgrade, preventing
package conflicts on shared runners and old pip failures.

Co-authored-by: wcwxy <26245345+ChaoWao@users.noreply.github.com>
- New paged_attention_unroll_4dims test under tensormap_and_ringbuffer
- Query and output tensors use 4D format (batch, seq_len, num_heads, head_dim)
- 4 kernels: QK/PV matmul (AIC), softmax_prepare/online_update (AIV)
- Orchestration with N_UNROLL=64, 4 tasks per group, online softmax accumulation
- Uses SceneTestCase pattern with shared paged_attention_golden
- Three test cases: varying batch/heads/head_dim at production scale (bfloat16)

Co-authored-by: chenshengxin <hw_chenshengxin@163.com>
@hw-native-sys-bot hw-native-sys-bot force-pushed the add-paged-attention-unroll-4dims-st branch from 1da6433 to c0c9bd3 Compare April 14, 2026 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants