Add: paged attention unroll scene test with 4D input shapes#2
Open
chenshengxin2026 wants to merge 14 commits into
Open
Add: paged attention unroll scene test with 4D input shapes#2chenshengxin2026 wants to merge 14 commits into
chenshengxin2026 wants to merge 14 commits into
Conversation
2125ab0 to
d8ccf87
Compare
Update all references in GitHub workflow skills, issue templates, and shared library docs to reflect the repo transfer. Co-authored-by: wcwxy <26245345+ChaoWao@users.noreply.github.com>
Two independent fixes to orchestration SO handling on AICPU: 1. Orch SO file creation split by platform. mkstemps (libdevice_orch_XXXXXX.so) ensures per-call uniqueness on sim where multiple workers may share a process, but is not always available on AICPU device libc. Added platform interface create_orch_so_file so sim uses mkstemps + fchmod(0755) and onboard uses pid-based naming + open(...,0755) — sufficient since only one runtime runs per device process. 2. Deferred dlclose/unlink from run() to deinit(). Closing the SO handle at the end of run() made it impossible to re-run the orchestrator through repeated calls into the same executor. The handle is kept until deinit, which then unlinks the file. Applied to a2a3 aicpu_build_graph, a2a3 tensormap_and_ringbuffer, and a5 tensormap_and_ringbuffer. Co-authored-by: wcwxy <26245345+ChaoWao@users.noreply.github.com>
- route controlled PTO2 fatal status through aicpu_execute so platform runners no longer read tensormap_and_ringbuffer shared memory - keep the PTO2 status helpers in runtime-local common code and preserve host-side finalize handling in runtime_maker - add UT coverage for fatal short-circuit/reporting paths and keep the explicit fatal ST on a2a3sim only
Remove the per-call `runner->create_thread([&]() { ... }).join()`
wrapper introduced by hw-native-sys#493. Running the body directly on the caller
thread plus two `RAIIScopeGuard`s that clean up on scope exit restores
the pre-hw-native-sys#493 behaviour, without any behavioural difference from the
caller's perspective.
The wrapper was added in anticipation of parallel ChipWorker
execution with GIL-released nanobind bindings. Because the GIL
release was never landed, the caller still holds the GIL across
t.join(), so the wrap buys no parallelism and is pure overhead.
- src/a2a3/platform/onboard/host/pto_runtime_c_api.cpp
- src/a5/platform/onboard/host/pto_runtime_c_api.cpp
- `tsd_guard` clears `g_runner_key` TSD on every exit path.
- `device_guard` (created after `ensure_device_set` succeeds) calls
`reset_device_context()` on every exit path from that point onward,
including the `init_runtime_impl` / `runner->run` error paths and
the `catch(...)` unwind. Previously the worker thread's
destructor handled both; on the caller thread we must do it
explicitly or leak CANN streams between successive run_runtime()
calls in the same process (which broke the device_test
batch_paged_attention case that runs multiple cases).
- `DeviceRunner::create_thread()` and `reset_device_context()`
retained — `create_thread()` is still used by the profiling
collector thread inside `device_runner.cpp`.
…ive-sys#527) CANN 8.5.1 defines `#define BLK BLK_Type()` in __clang_cce_vector_intrinsics.h, which causes the device compiler to expand the local `constexpr uint64_t BLK = 64` declaration into invalid code. Rename to `blk_size` in both a2a3 and a5 tensor.h. Fixes hw-native-sys#517
…ain protocol (hw-native-sys#501) Replace the non-blocking ack check (load and return if not all acked) with a spin-wait loop that blocks until all scheduler threads have set their bit in drain_ack_mask. This eliminates the window where a non-elected thread returns to the scheduler loop and resumes tracker writes while the drain worker already has exclusive tracker access. Remove drain_barrier_mask (the second atomic introduced as an intermediate step) — the single spin-wait on drain_ack_mask is sufficient for the full-stop guarantee. Reset detection uses drain_ack_mask bit-clear (release store on insufficient resources), not drain_worker_elected which remains zero until after the barrier completes. Also fix drain_ack_mask reset ordering: use memory_order_release instead of relaxed so the clearing store is visible to threads spinning on their own bit.
No cpplint configuration exists in the repo (no CPPLINT.cfg, no CI invocation), so these markers suppressed a linter that is not wired up. Strip all trailing `// NOLINT(...)` comments and remove four standalone NOLINT-only comment lines across the tree. clang-format rewraps a handful of log format strings that were previously split to satisfy cpplint line-length. - 53 files touched across src/a2a3, src/a5, src/common - 198 NOLINT markers removed
…w-native-sys#538) - tests/st/a2a3/tensormap_and_ringbuffer/batch_paged_attention: set PTO2_RING_HEAP to 1 GiB (2^30) via RUNTIME_ENV. The default 1024 B heap is too small for this scene's intermediate tensors and causes the test to fail on hardware. - tests/ut/py/test_dist_worker/test_group_task: remove TestGroupParallel.test_group_wall_time. The wall-time assertion is flaky under scheduler jitter, and without it the test is redundant with TestGroupBasic.test_group_both_workers_execute.
…#539) Break the monolithic distributed_level_runtime.md and rename architecture.md so each doc has one audience and one scope. - Rename architecture.md -> chip-level-arch.md (L2 single-chip scope) - Slim distributed_level_runtime.md to level model + component overview; move internal details to the new per-component docs - Add orchestrator.md: submit flow, Ring, TensorMap, Scope, state machine - Add scheduler.md: wiring/ready/completion queues, dispatch loop - Add worker-manager.md: WorkerManager + WorkerThread, THREAD/PROCESS modes, fork + mailbox mechanics - Add task-flow.md: Callable / TaskArgs / CallConfig handles, IWorker interface, L2 ABI edge, end-to-end walkthrough - Update README, .claude/rules/architecture.md, callable.h doc comment to the new filenames
…s#536) Split the L3 orchestration surface from Worker into a dedicated Orchestrator class whose lifetime is scoped to one run() call. User orch fn signature changes to `fn(orch, args)` and calls orch.submit_next_level / submit_next_level_group / submit_sub / submit_sub_group. scope_begin/scope_end/drain become private details of Worker.run() (order: scope_begin -> orch_fn -> scope_end -> drain). - Add python/simpler/orchestrator.py - Remove Worker.submit, Worker.scope, and _ScopeGuard from the public API; Worker keeps register/init/close/run(Task) - Migrate ut tests (test_host_worker, test_multi_worker, test_group_task) and st tests (test_l3_dependency, test_l3_group) to the new API - Update simpler_setup/scene_test.py wrapper to pass the Orchestrator - Refresh docs/distributed_level_runtime.md C++ DistWorker.submit / scope_* remain internal; a later step of the hierarchical runtime plan will take them private or re-expose them via nanobind on Orchestrator directly.
…sys#542) Direct `python tests/.../test_xxx.py` runs were failing on `from simpler.task_interface import ...` because `python/` was only on sys.path under pytest. Make `simpler` (and its dependencies) installable via wheel so any entry point can find it. - pyproject.toml: include `python/simpler` in `wheel.packages`; exclude the 4 files duplicated with `simpler_setup/` (`elf_parser`, `kernel_compiler`, `runtime_compiler`, `toolchain`) so the authoritative copies under `simpler_setup/` win in wheel mode while the source-tree copies stay reachable for un-migrated callers - pyproject.toml: enable `editable.rebuild` + `editable.verbose`, set `build-dir = "build/{wheel_tag}"` so editable installs auto-rebuild the nanobind module on import - CMakeLists.txt: install `src/` and `build/lib/` under `simpler_setup/_assets/` so wheel users get headers + pre-built runtime binaries without a source checkout - simpler_setup/environment.py: rewrite `PROJECT_ROOT` to auto-resolve via `importlib.resources` — picks `_assets/` when present (wheel), falls back to repo root (source tree / editable). Drop `ensure_python_path` helper now that `simpler` is importable directly - simpler_setup/scene_test.py: remove 6 `ensure_python_path()` callsites and the import; redundant after packaging - examples/scripts/run_example.py: drop the `python/` sys.path insert for the same reason Documentation: - docs/developer-guide.md: update Directory Structure, add Path resolution + Python package layout sections, add Editable rebuild notes, fix dynamic kernel compilation example, refresh Disk layout with `build/{wheel_tag}/` - docs/getting-started.md: switch broken `simpler.runtime_compiler` / `runtime_builder` imports to `simpler_setup.*` - .claude/rules/architecture.md: add Python Package Layout table, update Build System Lookup with simpler_setup paths and PROJECT_ROOT entry - .claude/rules/venv-isolation.md: document `--no-build-isolation` requirement and editable workflow
046acec to
b355d40
Compare
…w-native-sys#544) Disallow full-gap allocation when `tail - top == alloc_size` to preserve unambiguous top/tail semantics, and add targeted allocator logs for wrap-around and allocation-failure paths to simplify field debugging. Made-with: Cursor
b355d40 to
1da6433
Compare
…ocs (hw-native-sys#546) The 4 build-time modules (runtime_compiler, kernel_compiler, elf_parser, toolchain) are excluded from the simpler wheel — simpler_setup is the authoritative source. Migrate all imports to simpler_setup. Update testing.md: fix stale scene test example to current CALLABLE/TaskArgsBuilder API and add migration guide section. Fix self-hosted CI jobs to use per-run venv with pip upgrade, preventing package conflicts on shared runners and old pip failures. Co-authored-by: wcwxy <26245345+ChaoWao@users.noreply.github.com>
- New paged_attention_unroll_4dims test under tensormap_and_ringbuffer - Query and output tensors use 4D format (batch, seq_len, num_heads, head_dim) - 4 kernels: QK/PV matmul (AIC), softmax_prepare/online_update (AIV) - Orchestration with N_UNROLL=64, 4 tasks per group, online softmax accumulation - Uses SceneTestCase pattern with shared paged_attention_golden - Three test cases: varying batch/heads/head_dim at production scale (bfloat16) Co-authored-by: chenshengxin <hw_chenshengxin@163.com>
1da6433 to
c0c9bd3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
paged_attention_unroll_4dimsscene test undertensormap_and_ringbufferruntime(batch, seq_len, num_heads, head_dim)instead of flattened 2Dpaged_attention_goldenwith 4D reshape adapterTesting