Platform
a2a3 (Ascend 910B/C hardware)
Runtime Variant
tensormap_and_ringbuffer
Description
A level-3 simpler.worker.Worker used as a single-chip host worker can become unusable after a successful Worker.run() that submits a chip callable through orch.submit_next_level(...).
In the observed PyPTO serving integration, the prefill kernel completes successfully and Worker.run() returns to Python, but the chip child process is already left as a defunct process while the parent Worker still appears initialized. Reusing the same L3 worker for the next submitted chip task, or closing/switching the worker through the normal wrapper path, hangs instead of cleanly reusing or tearing down the child. The local workaround was to treat this L3 worker as one-shot: after every submitted child task, write _SHUTDOWN to child mailboxes, waitpid the children, unlink shared-memory mailboxes, and discard the worker state before creating a new Worker for the next kernel.
This looks like a Worker lifecycle bug: either Worker.run() should keep the chip child alive for later runs, or Worker.close() / post-run cleanup should reliably reap and mark the worker unusable when the child has exited.
Related: #824
Steps to Reproduce
Using a PyPTO Serving branch that dispatches non-L3 Qwen3 kernels through an L3 Simpler worker:
cd /data/liuxu/pypto-serving
task-submit --device auto --max-time 0 --run \
"PTO2_RING_HEAP=4294967296 PTO2_RING_TASK_WINDOW=1048576 PTO2_RING_DEP_POOL=1048576 \
python examples/model/qwen3_14b/npu_generate.py \
--model-dir /data/linyifan/models/Qwen3-14B \
--prompt 'Huawei is' \
--platform a2a3 \
--max-seq-len 512 \
--max-new-tokens 5"
The serving-side dispatch shape is roughly:
worker = Worker(
level=3,
platform="a2a3",
runtime="tensormap_and_ringbuffer",
device_ids=[device_id],
num_sub_workers=0,
)
cid = worker.register(chip_callable)
worker.init()
def orch_fn(orch, _args, _cfg):
task_args = TaskArgs()
# host tensors and/or child_memory ContinuousTensor args
orch.submit_next_level(cid, task_args, call_config, worker=0)
worker.run(orch_fn)
# The chip child is observed as defunct here, while the Worker object is still considered initialized.
# A later worker.run(...) or normal close/switch path hangs.
Expected Behavior
After a successful Worker.run():
- the level-3 worker should remain reusable for a later
Worker.run() on the same chip child, or
- if the child process exits, the Worker should detect/reap it and report a clear unusable/closed state, and
Worker.close() should not hang after the child has already exited.
Actual Behavior
The first submitted prefill task completes:
[chip_process pid=574954 dev=4] ready
[timing] prefill: fused 40 layers, 9574.72 ms
Immediately afterward, process inspection shows the child process as defunct while the parent Python process remains alive with the Worker still in use:
554986 ... python examples/model/qwen3_14b/npu_generate.py ... --device 4
574954 554986 Z [python] <defunct>
The parent then makes no progress into the next decode task. In repeated checks it had to be killed manually. Before the one-shot discard workaround, this blocked offline generation after prefill. With the manual one-shot discard/recreate workaround, the same generation completed:
text: a Chinese company. The
token_ids: [264, 8453, 2813, 13, 576]
finish_reason: length
A separate resource-related symptom was also observed with small ring settings (PTO2_RING_HEAP=536870912 PTO2_RING_TASK_WINDOW=131072 PTO2_RING_DEP_POOL=131072): prefill can fail with AICPU 507018. The lifecycle bug above was reproduced with the larger ring settings where prefill itself succeeds.
Git Commit ID
293e88a
CANN Version
9.0.0 (Ascend-cann-toolkit, innerversion=V100R001C10SPC001B250)
Driver Version
npu-smi reports version 26.0.rc1.
Host Platform
Linux (aarch64)
Additional Context
The workaround currently used in PyPTO Serving adds a wrapper-level best-effort discard path for one-shot L3 workers:
- write
_SHUTDOWN into _sub_shms, _chip_shms, and _next_level_shms
waitpid child PIDs
- close/unlink mailbox shared memory
- clear
_worker, _orch, child PID/shm lists, and initialized state
That avoids the hang but relies on Simpler private internals, so the lifecycle should be fixed or exposed as a supported API in Simpler.
Platform
a2a3 (Ascend 910B/C hardware)
Runtime Variant
tensormap_and_ringbuffer
Description
A level-3
simpler.worker.Workerused as a single-chip host worker can become unusable after a successfulWorker.run()that submits a chip callable throughorch.submit_next_level(...).In the observed PyPTO serving integration, the prefill kernel completes successfully and
Worker.run()returns to Python, but the chip child process is already left as a defunct process while the parentWorkerstill appears initialized. Reusing the same L3 worker for the next submitted chip task, or closing/switching the worker through the normal wrapper path, hangs instead of cleanly reusing or tearing down the child. The local workaround was to treat this L3 worker as one-shot: after every submitted child task, write_SHUTDOWNto child mailboxes,waitpidthe children, unlink shared-memory mailboxes, and discard the worker state before creating a new Worker for the next kernel.This looks like a Worker lifecycle bug: either
Worker.run()should keep the chip child alive for later runs, orWorker.close()/ post-run cleanup should reliably reap and mark the worker unusable when the child has exited.Related: #824
Steps to Reproduce
Using a PyPTO Serving branch that dispatches non-L3 Qwen3 kernels through an L3 Simpler worker:
The serving-side dispatch shape is roughly:
Expected Behavior
After a successful
Worker.run():Worker.run()on the same chip child, orWorker.close()should not hang after the child has already exited.Actual Behavior
The first submitted prefill task completes:
Immediately afterward, process inspection shows the child process as defunct while the parent Python process remains alive with the Worker still in use:
The parent then makes no progress into the next decode task. In repeated checks it had to be killed manually. Before the one-shot discard workaround, this blocked offline generation after prefill. With the manual one-shot discard/recreate workaround, the same generation completed:
A separate resource-related symptom was also observed with small ring settings (
PTO2_RING_HEAP=536870912 PTO2_RING_TASK_WINDOW=131072 PTO2_RING_DEP_POOL=131072): prefill can fail with AICPU507018. The lifecycle bug above was reproduced with the larger ring settings where prefill itself succeeds.Git Commit ID
293e88a
CANN Version
9.0.0 (
Ascend-cann-toolkit,innerversion=V100R001C10SPC001B250)Driver Version
npu-smireports version26.0.rc1.Host Platform
Linux (aarch64)
Additional Context
The workaround currently used in PyPTO Serving adds a wrapper-level best-effort discard path for one-shot L3 workers:
_SHUTDOWNinto_sub_shms,_chip_shms, and_next_level_shmswaitpidchild PIDs_worker,_orch, child PID/shm lists, and initialized stateThat avoids the hang but relies on Simpler private internals, so the lifecycle should be fixed or exposed as a supported API in Simpler.