Category
Technical Debt (cleanup, refactor)
Component
Host Runtime
Description
Worker.close() is contracted to release every resource the Worker holds. The
L2 branch violates this: it calls self._chip_worker.finalize() but never drops
the Python reference (self._chip_worker = None). The _ChipWorker nanobind
instance therefore stays alive on the closed Worker object.
This is inconsistent with the two sibling teardown paths, both of which do
drop the handle:
- the L>=3 branch sets
self._worker = None right after self._worker.close();
- the error/abort path already does
self._chip_worker = None after finalize().
Observed symptom (CI, intermittent): nanobind prints a reference-leak dump at
interpreter shutdown, e.g.
nanobind: leaked 1 instances!
- leaked instance 0x... of type "_task_interface._ChipWorker"
nanobind: leaked 15 types!
nanobind: leaked 165 functions!
nanobind: this is likely caused by a reference counting issue in the binding code.
(The full types/functions list is dumped because nanobind cannot cleanly unload
the module while any one of its instances is still live.)
Why it is intermittent / "sometimes": when a pytest case fails or errors,
pytest retains that case's traceback for reporting, and the traceback strongly
references the failing frame's locals — including the worker object, which in
turn pins _ChipWorker. Those references survive until interpreter exit, where
nanobind's leak check runs and reports them. Passing runs release the locals
normally, so no dump appears. tests/st/aicore_op_timeout/test_aicore_op_timeout.py
is a frequent trigger because it asserts on a timing-sensitive 507xxx code and an
elapsed < 10 bound that can fail on a busy/shared box.
This is a benign teardown-ordering artifact (not a runtime C++ leak), but it is
noisy in CI logs and masks any future real nanobind refcount regression.
Location
- `python/simpler/worker.py` — `Worker.close()`, L2 branch (`if self.level == 2:`), the `self._chip_worker.finalize()` line.
For reference, the consistent siblings:
- `python/simpler/worker.py` — L>=3 branch: `self._worker = None` after `self._worker.close()`.
- `python/simpler/worker.py` — error/abort path: `self._chip_worker = None` after `finalize()`.
Proposed Fix
Drop the handle in the L2 branch immediately after finalizing, mirroring the
other two paths:
if self.level == 2:
if self._chip_worker:
self._chip_worker.finalize()
self._chip_worker = None
This releases the _ChipWorker instance as soon as the Worker is closed, so it
no longer outlives the module even when a failing test's traceback pins the
Worker object. One line; aligns close() with the L>=3 branch and the error
path.
Priority
Low (no impact today, good to fix eventually)
Environment
- Git commit:
11d03d9b81e1d29162eb30b7b39386842328559d
- Host platform: Linux (aarch64)
Related: #1082, #980, #1018, #824 (Worker lifecycle / cleanup — distinct root causes).
Category
Technical Debt (cleanup, refactor)
Component
Host Runtime
Description
Worker.close()is contracted to release every resource the Worker holds. TheL2 branch violates this: it calls
self._chip_worker.finalize()but never dropsthe Python reference (
self._chip_worker = None). The_ChipWorkernanobindinstance therefore stays alive on the closed
Workerobject.This is inconsistent with the two sibling teardown paths, both of which do
drop the handle:
self._worker = Noneright afterself._worker.close();self._chip_worker = Noneafterfinalize().Observed symptom (CI, intermittent): nanobind prints a reference-leak dump at
interpreter shutdown, e.g.
(The full types/functions list is dumped because nanobind cannot cleanly unload
the module while any one of its instances is still live.)
Why it is intermittent / "sometimes": when a pytest case fails or errors,
pytest retains that case's traceback for reporting, and the traceback strongly
references the failing frame's locals — including the
workerobject, which inturn pins
_ChipWorker. Those references survive until interpreter exit, wherenanobind's leak check runs and reports them. Passing runs release the locals
normally, so no dump appears.
tests/st/aicore_op_timeout/test_aicore_op_timeout.pyis a frequent trigger because it asserts on a timing-sensitive 507xxx code and an
elapsed < 10bound that can fail on a busy/shared box.This is a benign teardown-ordering artifact (not a runtime C++ leak), but it is
noisy in CI logs and masks any future real nanobind refcount regression.
Location
For reference, the consistent siblings:
Proposed Fix
Drop the handle in the L2 branch immediately after finalizing, mirroring the
other two paths:
This releases the
_ChipWorkerinstance as soon as the Worker is closed, so itno longer outlives the module even when a failing test's traceback pins the
Workerobject. One line; alignsclose()with the L>=3 branch and the errorpath.
Priority
Low (no impact today, good to fix eventually)
Environment
11d03d9b81e1d29162eb30b7b39386842328559dRelated: #1082, #980, #1018, #824 (Worker lifecycle / cleanup — distinct root causes).