[Bug] L3 Worker leaves chip child defunct after submit_next_level run

### Platform

a2a3 (Ascend 910B/C hardware)

### Runtime Variant

tensormap_and_ringbuffer

### Description

A level-3 `simpler.worker.Worker` used as a single-chip host worker can become unusable after a successful `Worker.run()` that submits a chip callable through `orch.submit_next_level(...)`.

In the observed PyPTO serving integration, the prefill kernel completes successfully and `Worker.run()` returns to Python, but the chip child process is already left as a defunct process while the parent `Worker` still appears initialized. Reusing the same L3 worker for the next submitted chip task, or closing/switching the worker through the normal wrapper path, hangs instead of cleanly reusing or tearing down the child. The local workaround was to treat this L3 worker as one-shot: after every submitted child task, write `_SHUTDOWN` to child mailboxes, `waitpid` the children, unlink shared-memory mailboxes, and discard the worker state before creating a new Worker for the next kernel.

This looks like a Worker lifecycle bug: either `Worker.run()` should keep the chip child alive for later runs, or `Worker.close()` / post-run cleanup should reliably reap and mark the worker unusable when the child has exited.

Related: #824

### Steps to Reproduce

Using a PyPTO Serving branch that dispatches non-L3 Qwen3 kernels through an L3 Simpler worker:

```bash
cd /data/liuxu/pypto-serving

task-submit --device auto --max-time 0 --run \
  "PTO2_RING_HEAP=4294967296 PTO2_RING_TASK_WINDOW=1048576 PTO2_RING_DEP_POOL=1048576 \
   python examples/model/qwen3_14b/npu_generate.py \
     --model-dir /data/linyifan/models/Qwen3-14B \
     --prompt 'Huawei is' \
     --platform a2a3 \
     --max-seq-len 512 \
     --max-new-tokens 5"
```

The serving-side dispatch shape is roughly:

```python
worker = Worker(
    level=3,
    platform="a2a3",
    runtime="tensormap_and_ringbuffer",
    device_ids=[device_id],
    num_sub_workers=0,
)
cid = worker.register(chip_callable)
worker.init()

def orch_fn(orch, _args, _cfg):
    task_args = TaskArgs()
    # host tensors and/or child_memory ContinuousTensor args
    orch.submit_next_level(cid, task_args, call_config, worker=0)

worker.run(orch_fn)
# The chip child is observed as defunct here, while the Worker object is still considered initialized.
# A later worker.run(...) or normal close/switch path hangs.
```

### Expected Behavior

After a successful `Worker.run()`:

- the level-3 worker should remain reusable for a later `Worker.run()` on the same chip child, or
- if the child process exits, the Worker should detect/reap it and report a clear unusable/closed state, and
- `Worker.close()` should not hang after the child has already exited.

### Actual Behavior

The first submitted prefill task completes:

```text
[chip_process pid=574954 dev=4] ready
[timing] prefill: fused 40 layers, 9574.72 ms
```

Immediately afterward, process inspection shows the child process as defunct while the parent Python process remains alive with the Worker still in use:

```text
554986 ... python examples/model/qwen3_14b/npu_generate.py ... --device 4
574954 554986 Z [python] <defunct>
```

The parent then makes no progress into the next decode task. In repeated checks it had to be killed manually. Before the one-shot discard workaround, this blocked offline generation after prefill. With the manual one-shot discard/recreate workaround, the same generation completed:

```text
text:  a Chinese company. The
token_ids: [264, 8453, 2813, 13, 576]
finish_reason: length
```

A separate resource-related symptom was also observed with small ring settings (`PTO2_RING_HEAP=536870912 PTO2_RING_TASK_WINDOW=131072 PTO2_RING_DEP_POOL=131072`): prefill can fail with AICPU `507018`. The lifecycle bug above was reproduced with the larger ring settings where prefill itself succeeds.

### Git Commit ID

293e88a3277f7fab61b042a4762a29462af58b79

### CANN Version

9.0.0 (`Ascend-cann-toolkit`, `innerversion=V100R001C10SPC001B250`)

### Driver Version

`npu-smi` reports version `26.0.rc1`.

### Host Platform

Linux (aarch64)

### Additional Context

The workaround currently used in PyPTO Serving adds a wrapper-level best-effort discard path for one-shot L3 workers:

- write `_SHUTDOWN` into `_sub_shms`, `_chip_shms`, and `_next_level_shms`
- `waitpid` child PIDs
- close/unlink mailbox shared memory
- clear `_worker`, `_orch`, child PID/shm lists, and initialized state

That avoids the hang but relies on Simpler private internals, so the lifecycle should be fixed or exposed as a supported API in Simpler.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] L3 Worker leaves chip child defunct after submit_next_level run #980

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] L3 Worker leaves chip child defunct after submit_next_level run #980

Description

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions