Add: paged attention unroll scene test with 4D input shapes by chenshengxin2026 · Pull Request #2 · chenshengxin2026/simpler

chenshengxin2026 · 2026-03-27T06:12:51Z

Summary

New paged_attention_unroll_4dims scene test under tensormap_and_ringbuffer runtime
Query and output tensors use 4D format (batch, seq_len, num_heads, head_dim) instead of flattened 2D
6 kernels: QK/PV matmul (AIC), softmax_prepare/online_update (AIV), AIC/AIV hub stubs
Orchestration with N_UNROLL=64, 4 tasks per group, online softmax accumulation
Golden wraps shared paged_attention_golden with 4D reshape adapter
Three test cases covering varying batch/heads/head_dim at production scale (bfloat16)

Testing

Simulation tests pass
Hardware tests pass

Update all references in GitHub workflow skills, issue templates, and shared library docs to reflect the repo transfer. Co-authored-by: wcwxy <26245345+ChaoWao@users.noreply.github.com>

Two independent fixes to orchestration SO handling on AICPU: 1. Orch SO file creation split by platform. mkstemps (libdevice_orch_XXXXXX.so) ensures per-call uniqueness on sim where multiple workers may share a process, but is not always available on AICPU device libc. Added platform interface create_orch_so_file so sim uses mkstemps + fchmod(0755) and onboard uses pid-based naming + open(...,0755) — sufficient since only one runtime runs per device process. 2. Deferred dlclose/unlink from run() to deinit(). Closing the SO handle at the end of run() made it impossible to re-run the orchestrator through repeated calls into the same executor. The handle is kept until deinit, which then unlinks the file. Applied to a2a3 aicpu_build_graph, a2a3 tensormap_and_ringbuffer, and a5 tensormap_and_ringbuffer. Co-authored-by: wcwxy <26245345+ChaoWao@users.noreply.github.com>

- route controlled PTO2 fatal status through aicpu_execute so platform runners no longer read tensormap_and_ringbuffer shared memory - keep the PTO2 status helpers in runtime-local common code and preserve host-side finalize handling in runtime_maker - add UT coverage for fatal short-circuit/reporting paths and keep the explicit fatal ST on a2a3sim only

Remove the per-call `runner->create_thread([&]() { ... }).join()` wrapper introduced by hw-native-sys#493. Running the body directly on the caller thread plus two `RAIIScopeGuard`s that clean up on scope exit restores the pre-hw-native-sys#493 behaviour, without any behavioural difference from the caller's perspective. The wrapper was added in anticipation of parallel ChipWorker execution with GIL-released nanobind bindings. Because the GIL release was never landed, the caller still holds the GIL across t.join(), so the wrap buys no parallelism and is pure overhead. - src/a2a3/platform/onboard/host/pto_runtime_c_api.cpp - src/a5/platform/onboard/host/pto_runtime_c_api.cpp - `tsd_guard` clears `g_runner_key` TSD on every exit path. - `device_guard` (created after `ensure_device_set` succeeds) calls `reset_device_context()` on every exit path from that point onward, including the `init_runtime_impl` / `runner->run` error paths and the `catch(...)` unwind. Previously the worker thread's destructor handled both; on the caller thread we must do it explicitly or leak CANN streams between successive run_runtime() calls in the same process (which broke the device_test batch_paged_attention case that runs multiple cases). - `DeviceRunner::create_thread()` and `reset_device_context()` retained — `create_thread()` is still used by the profiling collector thread inside `device_runner.cpp`.

…ive-sys#527) CANN 8.5.1 defines `#define BLK BLK_Type()` in __clang_cce_vector_intrinsics.h, which causes the device compiler to expand the local `constexpr uint64_t BLK = 64` declaration into invalid code. Rename to `blk_size` in both a2a3 and a5 tensor.h. Fixes hw-native-sys#517

…ain protocol (hw-native-sys#501) Replace the non-blocking ack check (load and return if not all acked) with a spin-wait loop that blocks until all scheduler threads have set their bit in drain_ack_mask. This eliminates the window where a non-elected thread returns to the scheduler loop and resumes tracker writes while the drain worker already has exclusive tracker access. Remove drain_barrier_mask (the second atomic introduced as an intermediate step) — the single spin-wait on drain_ack_mask is sufficient for the full-stop guarantee. Reset detection uses drain_ack_mask bit-clear (release store on insufficient resources), not drain_worker_elected which remains zero until after the barrier completes. Also fix drain_ack_mask reset ordering: use memory_order_release instead of relaxed so the clearing store is visible to threads spinning on their own bit.

No cpplint configuration exists in the repo (no CPPLINT.cfg, no CI invocation), so these markers suppressed a linter that is not wired up. Strip all trailing `// NOLINT(...)` comments and remove four standalone NOLINT-only comment lines across the tree. clang-format rewraps a handful of log format strings that were previously split to satisfy cpplint line-length. - 53 files touched across src/a2a3, src/a5, src/common - 198 NOLINT markers removed

…w-native-sys#538) - tests/st/a2a3/tensormap_and_ringbuffer/batch_paged_attention: set PTO2_RING_HEAP to 1 GiB (2^30) via RUNTIME_ENV. The default 1024 B heap is too small for this scene's intermediate tensors and causes the test to fail on hardware. - tests/ut/py/test_dist_worker/test_group_task: remove TestGroupParallel.test_group_wall_time. The wall-time assertion is flaky under scheduler jitter, and without it the test is redundant with TestGroupBasic.test_group_both_workers_execute.

…#539) Break the monolithic distributed_level_runtime.md and rename architecture.md so each doc has one audience and one scope. - Rename architecture.md -> chip-level-arch.md (L2 single-chip scope) - Slim distributed_level_runtime.md to level model + component overview; move internal details to the new per-component docs - Add orchestrator.md: submit flow, Ring, TensorMap, Scope, state machine - Add scheduler.md: wiring/ready/completion queues, dispatch loop - Add worker-manager.md: WorkerManager + WorkerThread, THREAD/PROCESS modes, fork + mailbox mechanics - Add task-flow.md: Callable / TaskArgs / CallConfig handles, IWorker interface, L2 ABI edge, end-to-end walkthrough - Update README, .claude/rules/architecture.md, callable.h doc comment to the new filenames

…s#536) Split the L3 orchestration surface from Worker into a dedicated Orchestrator class whose lifetime is scoped to one run() call. User orch fn signature changes to `fn(orch, args)` and calls orch.submit_next_level / submit_next_level_group / submit_sub / submit_sub_group. scope_begin/scope_end/drain become private details of Worker.run() (order: scope_begin -> orch_fn -> scope_end -> drain). - Add python/simpler/orchestrator.py - Remove Worker.submit, Worker.scope, and _ScopeGuard from the public API; Worker keeps register/init/close/run(Task) - Migrate ut tests (test_host_worker, test_multi_worker, test_group_task) and st tests (test_l3_dependency, test_l3_group) to the new API - Update simpler_setup/scene_test.py wrapper to pass the Orchestrator - Refresh docs/distributed_level_runtime.md C++ DistWorker.submit / scope_* remain internal; a later step of the hierarchical runtime plan will take them private or re-expose them via nanobind on Orchestrator directly.

…sys#542) Direct `python tests/.../test_xxx.py` runs were failing on `from simpler.task_interface import ...` because `python/` was only on sys.path under pytest. Make `simpler` (and its dependencies) installable via wheel so any entry point can find it. - pyproject.toml: include `python/simpler` in `wheel.packages`; exclude the 4 files duplicated with `simpler_setup/` (`elf_parser`, `kernel_compiler`, `runtime_compiler`, `toolchain`) so the authoritative copies under `simpler_setup/` win in wheel mode while the source-tree copies stay reachable for un-migrated callers - pyproject.toml: enable `editable.rebuild` + `editable.verbose`, set `build-dir = "build/{wheel_tag}"` so editable installs auto-rebuild the nanobind module on import - CMakeLists.txt: install `src/` and `build/lib/` under `simpler_setup/_assets/` so wheel users get headers + pre-built runtime binaries without a source checkout - simpler_setup/environment.py: rewrite `PROJECT_ROOT` to auto-resolve via `importlib.resources` — picks `_assets/` when present (wheel), falls back to repo root (source tree / editable). Drop `ensure_python_path` helper now that `simpler` is importable directly - simpler_setup/scene_test.py: remove 6 `ensure_python_path()` callsites and the import; redundant after packaging - examples/scripts/run_example.py: drop the `python/` sys.path insert for the same reason Documentation: - docs/developer-guide.md: update Directory Structure, add Path resolution + Python package layout sections, add Editable rebuild notes, fix dynamic kernel compilation example, refresh Disk layout with `build/{wheel_tag}/` - docs/getting-started.md: switch broken `simpler.runtime_compiler` / `runtime_builder` imports to `simpler_setup.*` - .claude/rules/architecture.md: add Python Package Layout table, update Build System Lookup with simpler_setup paths and PROJECT_ROOT entry - .claude/rules/venv-isolation.md: document `--no-build-isolation` requirement and editable workflow

…w-native-sys#544) Disallow full-gap allocation when `tail - top == alloc_size` to preserve unambiguous top/tail semantics, and add targeted allocator logs for wrap-around and allocation-failure paths to simplify field debugging. Made-with: Cursor

…ocs (hw-native-sys#546) The 4 build-time modules (runtime_compiler, kernel_compiler, elf_parser, toolchain) are excluded from the simpler wheel — simpler_setup is the authoritative source. Migrate all imports to simpler_setup. Update testing.md: fix stale scene test example to current CALLABLE/TaskArgsBuilder API and add migration guide section. Fix self-hosted CI jobs to use per-run venv with pip upgrade, preventing package conflicts on shared runners and old pip failures. Co-authored-by: wcwxy <26245345+ChaoWao@users.noreply.github.com>

- New paged_attention_unroll_4dims test under tensormap_and_ringbuffer - Query and output tensors use 4D format (batch, seq_len, num_heads, head_dim) - 4 kernels: QK/PV matmul (AIC), softmax_prepare/online_update (AIV) - Orchestration with N_UNROLL=64, 4 tasks per group, online softmax accumulation - Uses SceneTestCase pattern with shared paged_attention_golden - Three test cases: varying batch/heads/head_dim at production scale (bfloat16) Co-authored-by: chenshengxin <hw_chenshengxin@163.com>

chenshengxin2026 force-pushed the add-paged-attention-unroll-4dims-st branch from 2125ab0 to d8ccf87 Compare March 27, 2026 08:35

hw-native-sys-bot and others added 11 commits April 13, 2026 11:23

Update: rename upstream org ChaoWao → hw-native-sys (hw-native-sys#530)

9d5a7d0

Update all references in GitHub workflow skills, issue templates, and shared library docs to reflect the repo transfer. Co-authored-by: wcwxy <26245345+ChaoWao@users.noreply.github.com>

ChaoWao force-pushed the add-paged-attention-unroll-4dims-st branch 3 times, most recently from 046acec to b355d40 Compare April 14, 2026 03:22

ChaoWao force-pushed the add-paged-attention-unroll-4dims-st branch from b355d40 to 1da6433 Compare April 14, 2026 03:38

hw-native-sys-bot and others added 2 commits April 14, 2026 15:15

hw-native-sys-bot force-pushed the add-paged-attention-unroll-4dims-st branch from 1da6433 to c0c9bd3 Compare April 14, 2026 07:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: paged attention unroll scene test with 4D input shapes#2

Add: paged attention unroll scene test with 4D input shapes#2
chenshengxin2026 wants to merge 14 commits into
mainfrom
add-paged-attention-unroll-4dims-st

chenshengxin2026 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

chenshengxin2026 commented Mar 27, 2026

Summary

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants