Skip to content

[Code Health] Worker.close() L2 branch leaks _ChipWorker nanobind instance at interpreter shutdown #1221

Description

@ChaoZheng109

Category

Technical Debt (cleanup, refactor)

Component

Host Runtime

Description

Worker.close() is contracted to release every resource the Worker holds. The
L2 branch violates this: it calls self._chip_worker.finalize() but never drops
the Python reference (self._chip_worker = None). The _ChipWorker nanobind
instance therefore stays alive on the closed Worker object.

This is inconsistent with the two sibling teardown paths, both of which do
drop the handle:

  • the L>=3 branch sets self._worker = None right after self._worker.close();
  • the error/abort path already does self._chip_worker = None after finalize().

Observed symptom (CI, intermittent): nanobind prints a reference-leak dump at
interpreter shutdown, e.g.

nanobind: leaked 1 instances!
 - leaked instance 0x... of type "_task_interface._ChipWorker"
nanobind: leaked 15 types!
nanobind: leaked 165 functions!
nanobind: this is likely caused by a reference counting issue in the binding code.

(The full types/functions list is dumped because nanobind cannot cleanly unload
the module while any one of its instances is still live.)

Why it is intermittent / "sometimes": when a pytest case fails or errors,
pytest retains that case's traceback for reporting, and the traceback strongly
references the failing frame's locals — including the worker object, which in
turn pins _ChipWorker. Those references survive until interpreter exit, where
nanobind's leak check runs and reports them. Passing runs release the locals
normally, so no dump appears. tests/st/aicore_op_timeout/test_aicore_op_timeout.py
is a frequent trigger because it asserts on a timing-sensitive 507xxx code and an
elapsed < 10 bound that can fail on a busy/shared box.

This is a benign teardown-ordering artifact (not a runtime C++ leak), but it is
noisy in CI logs and masks any future real nanobind refcount regression.

Location

  • `python/simpler/worker.py` — `Worker.close()`, L2 branch (`if self.level == 2:`), the `self._chip_worker.finalize()` line.

For reference, the consistent siblings:

  • `python/simpler/worker.py` — L>=3 branch: `self._worker = None` after `self._worker.close()`.
  • `python/simpler/worker.py` — error/abort path: `self._chip_worker = None` after `finalize()`.

Proposed Fix

Drop the handle in the L2 branch immediately after finalizing, mirroring the
other two paths:

if self.level == 2:
    if self._chip_worker:
        self._chip_worker.finalize()
        self._chip_worker = None

This releases the _ChipWorker instance as soon as the Worker is closed, so it
no longer outlives the module even when a failing test's traceback pins the
Worker object. One line; aligns close() with the L>=3 branch and the error
path.

Priority

Low (no impact today, good to fix eventually)

Environment

  • Git commit: 11d03d9b81e1d29162eb30b7b39386842328559d
  • Host platform: Linux (aarch64)

Related: #1082, #980, #1018, #824 (Worker lifecycle / cleanup — distinct root causes).

Metadata

Metadata

Assignees

No one assigned

    Labels

    code healthTechnical debt, robustness, code quality

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions