Skip to content

[Bug] L3 Worker leaves chip child defunct after submit_next_level run #980

Description

@ndleslx

Platform

a2a3 (Ascend 910B/C hardware)

Runtime Variant

tensormap_and_ringbuffer

Description

A level-3 simpler.worker.Worker used as a single-chip host worker can become unusable after a successful Worker.run() that submits a chip callable through orch.submit_next_level(...).

In the observed PyPTO serving integration, the prefill kernel completes successfully and Worker.run() returns to Python, but the chip child process is already left as a defunct process while the parent Worker still appears initialized. Reusing the same L3 worker for the next submitted chip task, or closing/switching the worker through the normal wrapper path, hangs instead of cleanly reusing or tearing down the child. The local workaround was to treat this L3 worker as one-shot: after every submitted child task, write _SHUTDOWN to child mailboxes, waitpid the children, unlink shared-memory mailboxes, and discard the worker state before creating a new Worker for the next kernel.

This looks like a Worker lifecycle bug: either Worker.run() should keep the chip child alive for later runs, or Worker.close() / post-run cleanup should reliably reap and mark the worker unusable when the child has exited.

Related: #824

Steps to Reproduce

Using a PyPTO Serving branch that dispatches non-L3 Qwen3 kernels through an L3 Simpler worker:

cd /data/liuxu/pypto-serving

task-submit --device auto --max-time 0 --run \
  "PTO2_RING_HEAP=4294967296 PTO2_RING_TASK_WINDOW=1048576 PTO2_RING_DEP_POOL=1048576 \
   python examples/model/qwen3_14b/npu_generate.py \
     --model-dir /data/linyifan/models/Qwen3-14B \
     --prompt 'Huawei is' \
     --platform a2a3 \
     --max-seq-len 512 \
     --max-new-tokens 5"

The serving-side dispatch shape is roughly:

worker = Worker(
    level=3,
    platform="a2a3",
    runtime="tensormap_and_ringbuffer",
    device_ids=[device_id],
    num_sub_workers=0,
)
cid = worker.register(chip_callable)
worker.init()

def orch_fn(orch, _args, _cfg):
    task_args = TaskArgs()
    # host tensors and/or child_memory ContinuousTensor args
    orch.submit_next_level(cid, task_args, call_config, worker=0)

worker.run(orch_fn)
# The chip child is observed as defunct here, while the Worker object is still considered initialized.
# A later worker.run(...) or normal close/switch path hangs.

Expected Behavior

After a successful Worker.run():

  • the level-3 worker should remain reusable for a later Worker.run() on the same chip child, or
  • if the child process exits, the Worker should detect/reap it and report a clear unusable/closed state, and
  • Worker.close() should not hang after the child has already exited.

Actual Behavior

The first submitted prefill task completes:

[chip_process pid=574954 dev=4] ready
[timing] prefill: fused 40 layers, 9574.72 ms

Immediately afterward, process inspection shows the child process as defunct while the parent Python process remains alive with the Worker still in use:

554986 ... python examples/model/qwen3_14b/npu_generate.py ... --device 4
574954 554986 Z [python] <defunct>

The parent then makes no progress into the next decode task. In repeated checks it had to be killed manually. Before the one-shot discard workaround, this blocked offline generation after prefill. With the manual one-shot discard/recreate workaround, the same generation completed:

text:  a Chinese company. The
token_ids: [264, 8453, 2813, 13, 576]
finish_reason: length

A separate resource-related symptom was also observed with small ring settings (PTO2_RING_HEAP=536870912 PTO2_RING_TASK_WINDOW=131072 PTO2_RING_DEP_POOL=131072): prefill can fail with AICPU 507018. The lifecycle bug above was reproduced with the larger ring settings where prefill itself succeeds.

Git Commit ID

293e88a

CANN Version

9.0.0 (Ascend-cann-toolkit, innerversion=V100R001C10SPC001B250)

Driver Version

npu-smi reports version 26.0.rc1.

Host Platform

Linux (aarch64)

Additional Context

The workaround currently used in PyPTO Serving adds a wrapper-level best-effort discard path for one-shot L3 workers:

  • write _SHUTDOWN into _sub_shms, _chip_shms, and _next_level_shms
  • waitpid child PIDs
  • close/unlink mailbox shared memory
  • clear _worker, _orch, child PID/shm lists, and initialized state

That avoids the hang but relies on Simpler private internals, so the lifecycle should be fixed or exposed as a supported API in Simpler.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions