Rewrite non-L3 Qwen3 kernels through L3 worker#22
Conversation
📝 WalkthroughWalkthroughThis PR implements L3 hierarchical orchestration support for chip program execution with deferred tensor allocation. The Worker class gains L3 submission APIs and orchestrator-aware memory operations. The npu_runner extends L2→L3 execution paths to defer child-memory tensor allocation/upload until inside L3 orchestration scope, enabling safer lifecycle management and post-submit device-to-host synchronization. ChangesL3 Orchestration and Deferred Tensor Handling
Sequence Diagram(s)sequenceDiagram
participant Runner as Qwen314B<br/>ModelRunner
participant Worker as L3<br/>Worker
participant Orch as Orchestrator<br/>Callback
participant Simpler as Simpler<br/>Worker
Runner->>Runner: _run_l2_program(...)
Runner->>Runner: _share_l2_host_args(args)
Runner->>Runner: _ensure_l2_program(callable_spec)
Runner->>Worker: initialize L3 worker
Runner->>Simpler: register_known_callables()
Runner->>Simpler: register_callable(callable_id)
Runner->>Runner: _build_l2_orch_args(spec, args, worker, orchestrator)
Runner->>Runner: _resolve_l3_child_tensor(deferred, worker, orchestrator)
Runner->>Worker: run_next_level(callable_id, orch_args)
Worker->>Orch: submit_next_level(callable_id, orch_args, orchestrator)
Orch->>Simpler: run(orch_args)
Simpler-->>Orch: result
Orch-->>Worker: orchestration complete
Worker-->>Runner: result
Runner->>Runner: post_submit_copies: copy_from(dst, src, nbytes)
Runner->>Worker: discard_l3_children()
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors the Qwen3-14B NPU runner to support deferred child tensor materialization and post-submit device-to-host copies within L3 orchestration scopes, alongside adding cleanup mechanisms for one-shot L3 workers. The review feedback highlights three critical issues: first, the reuse of the deferred child tensor normed is broken because the worker and its cached allocations are discarded at the end of each _run_l2_program execution; second, the blocking os.waitpid call during worker cleanup can cause the main process to hang indefinitely if a child process is deadlocked; and third, the type check for identifying the final output tensor fails to account for _DeferredChildTensor and WorkerTensor types, incorrectly classifying them as inputs.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if isinstance(normed_arg, _DeferredChildTensor): | ||
| lm_head_input = self._l2_child_tensor( | ||
| compiled.lm_head.runtime_name, | ||
| normed, | ||
| self._l2_child_tensor(compiled.lm_head.runtime_name, compiled.padded_lm_head_weight), | ||
| logits_padded, | ||
| upload=False, | ||
| require_existing=True, | ||
| ) |
There was a problem hiding this comment.
The reuse of the deferred child tensor normed via require_existing=True is broken because _run_l2_program discards the worker at the end of every execution.
Specifically:
_run_l2_program(compiled.final_rms, ...)is called withnormed_arg(a_DeferredChildTensor)._resolve_l3_child_tensorallocates the tensor on the worker and caches it inself._l2_child_allocs.- In the
finallyblock of_run_l2_program,self._discard_l2_worker(handle.runtime_name)is called. _discard_l2_workerpops the worker fromself._l2_workers, deletes the cached allocation fromself._l2_child_allocs, and discards the worker's child processes (worker.discard_l3_children())._run_l2_program(compiled.lm_head, ...)is called withlm_head_input(which hasrequire_existing=True)._resolve_l3_child_tensorattempts to look up the existing allocation, but it has been deleted and the worker discarded. It then raisesRuntimeError("missing worker-resident tensor allocation").
To fix this, the lifecycle of the L3 worker and its child allocations needs to persist across these dependent L2 program runs, or both operations should be submitted within a single orchestration session before discarding the worker.
| for pid in list(getattr(impl, "_sub_pids", [])) + list(getattr(impl, "_chip_pids", [])) + list( | ||
| getattr(impl, "_next_level_pids", []) | ||
| ): | ||
| try: | ||
| os.waitpid(int(pid), 0) | ||
| except ChildProcessError: | ||
| pass | ||
| except OSError: | ||
| pass |
There was a problem hiding this comment.
Calling os.waitpid(int(pid), 0) is a blocking call. If any of the child processes (sub-workers, chip workers, or next-level workers) are hung or deadlocked (which is the exact scenario tracked in simpler#980), the main process will hang indefinitely here.
To prevent the main process from hanging, consider using os.waitpid(int(pid), os.WNOHANG) in a loop with a timeout, or sending a termination signal (e.g., SIGTERM or SIGKILL) to the process if it does not exit gracefully within a reasonable timeframe.
| for pid in list(getattr(impl, "_sub_pids", [])) + list(getattr(impl, "_chip_pids", [])) + list( | |
| getattr(impl, "_next_level_pids", []) | |
| ): | |
| try: | |
| os.waitpid(int(pid), 0) | |
| except ChildProcessError: | |
| pass | |
| except OSError: | |
| pass | |
| for pid in list(getattr(impl, "_sub_pids", [])) + list(getattr(impl, "_chip_pids", [])) + list( | |
| getattr(impl, "_next_level_pids", []) | |
| ): | |
| try: | |
| pid_val = int(pid) | |
| res, _ = os.waitpid(pid_val, os.WNOHANG) | |
| if res == 0: | |
| try: | |
| os.kill(pid_val, 9) | |
| os.waitpid(pid_val, 0) | |
| except OSError: | |
| pass | |
| except ChildProcessError: | |
| pass | |
| except OSError: | |
| pass |
| if isinstance(arg, torch.Tensor) and arg_index == arg_count - 1: | ||
| return tensor_arg_type.OUTPUT_EXISTING |
There was a problem hiding this comment.
The check isinstance(arg, torch.Tensor) will evaluate to False if the last argument is a _DeferredChildTensor or a WorkerTensor. This means that if the last argument is one of these types, it will not be automatically classified as OUTPUT_EXISTING and will instead fall back to INPUT, which can cause runtime errors or incorrect execution.
We should update the check to include _DeferredChildTensor and WorkerTensor.
| if isinstance(arg, torch.Tensor) and arg_index == arg_count - 1: | |
| return tensor_arg_type.OUTPUT_EXISTING | |
| if isinstance(arg, (torch.Tensor, _DeferredChildTensor, WorkerTensor)) and arg_index == arg_count - 1: | |
| return tensor_arg_type.OUTPUT_EXISTING |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@python/runtime/worker.py`:
- Around line 247-255: The current cleanup loop calling os.waitpid(pid, 0) can
block indefinitely; change it to poll non‑blocking using os.waitpid(pid,
os.WNOHANG) inside a bounded retry/timeout loop: for each pid from
impl._sub_pids/_chip_pids/_next_level_pids call os.waitpid with WNOHANG and if
it returns (0, 0) sleep briefly and retry until a deadline (use time.time()),
then if still alive attempt os.kill(pid, signal.SIGTERM) and after another short
deadline escalate to os.kill(pid, signal.SIGKILL); keep handling
ChildProcessError/OSError as before and ensure the loop breaks when waitpid
reports the child reaped or errors.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e5ffe8af-c374-42ac-b48c-1ac1af8d7d60
📒 Files selected for processing (3)
examples/model/qwen3_14b/runner/npu_runner.pypython/runtime/worker.pytests/test_npu_runner_l3_worker.py
| for pid in list(getattr(impl, "_sub_pids", [])) + list(getattr(impl, "_chip_pids", [])) + list( | ||
| getattr(impl, "_next_level_pids", []) | ||
| ): | ||
| try: | ||
| os.waitpid(int(pid), 0) | ||
| except ChildProcessError: | ||
| pass | ||
| except OSError: | ||
| pass |
There was a problem hiding this comment.
os.waitpid(..., 0) blocks indefinitely if a child process hangs.
If a chip child process becomes unresponsive (deadlock, infinite loop, etc.), this cleanup will block the caller forever. Consider using os.WNOHANG with a bounded retry/timeout loop, or wrapping in a timeout mechanism.
🛠️ Suggested non-blocking approach
for pid in list(getattr(impl, "_sub_pids", [])) + list(getattr(impl, "_chip_pids", [])) + list(
getattr(impl, "_next_level_pids", [])
):
try:
- os.waitpid(int(pid), 0)
+ # Non-blocking check; child may already be reaped or still running.
+ os.waitpid(int(pid), os.WNOHANG)
except ChildProcessError:
pass
except OSError:
pass🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/runtime/worker.py` around lines 247 - 255, The current cleanup loop
calling os.waitpid(pid, 0) can block indefinitely; change it to poll
non‑blocking using os.waitpid(pid, os.WNOHANG) inside a bounded retry/timeout
loop: for each pid from impl._sub_pids/_chip_pids/_next_level_pids call
os.waitpid with WNOHANG and if it returns (0, 0) sleep briefly and retry until a
deadline (use time.time()), then if still alive attempt os.kill(pid,
signal.SIGTERM) and after another short deadline escalate to os.kill(pid,
signal.SIGKILL); keep handling ChildProcessError/OSError as before and ensure
the loop breaks when waitpid reports the child reaped or errors.
Summary
Worker(level=3)usingsubmit_next_levelINOUTfor prefill/decode, without compact KVVerification
python -m pytest tests/test_npu_runner_l3_worker.py tests/test_batching.py tests/test_cli.py -qruff check --config ruff.toml examples/model/qwen3_14b/runner/npu_runner.py python/runtime/worker.py tests/test_npu_runner_l3_worker.pygit diff --checktask-submit --device auto --max-time 0 --run "PTO2_RING_HEAP=4294967296 PTO2_RING_TASK_WINDOW=1048576 PTO2_RING_DEP_POOL=1048576 python examples/model/qwen3_14b/npu_generate.py --model-dir /data/linyifan/models/Qwen3-14B --prompt 'Huawei is' --platform a2a3 --max-seq-len 512 --max-new-tokens 5"Generated text:
Notes
The smaller default ring settings still fail in prefill with AICPU
507018for this path: