Skip to content

Fix: release _ChipWorker handle in Worker.close() L2 branch#1222

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoZheng109:fix/worker-l2-close-chipworker-leak
Jul 1, 2026
Merged

Fix: release _ChipWorker handle in Worker.close() L2 branch#1222
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoZheng109:fix/worker-l2-close-chipworker-leak

Conversation

@ChaoZheng109

Copy link
Copy Markdown
Collaborator

Summary

  • Worker.close() L2 branch called self._chip_worker.finalize() but never dropped the Python reference, leaving the _ChipWorker nanobind instance alive on the closed Worker.
  • Set self._chip_worker = None after finalize(), mirroring the L>=3 branch (self._worker = None) and the error path (which already does this).

Why

When a pytest case fails, pytest retains its traceback, which strongly references the failing frame's worker local and through it the _ChipWorker instance — surviving until interpreter exit, where nanobind's leak check fires:

nanobind: leaked instance ... of type "_task_interface._ChipWorker"
nanobind: leaked 15 types! / leaked 165 functions!

(nanobind dumps the whole types/functions list because it cannot cleanly unload the module while any one instance is live.) This is a benign teardown-ordering artifact, not a runtime C++ leak, but it is noisy in CI and masks future real refcount regressions. Dropping the handle on close releases the instance so it no longer outlives the module.

Testing

  • Simulation tests pass — examples/workers/l2/vector_add on a2a3sim (exercises L2 register/init/run/close), no leak dump.
  • Hardware tests pass (not run; teardown-only change, sim-covered).

Closes #1221

The L2 teardown branch called self._chip_worker.finalize() but never
dropped the Python reference, so the _ChipWorker nanobind instance stayed
alive on the closed Worker. This is inconsistent with the L>=3 branch
(self._worker = None) and the error path (self._chip_worker = None).

When a pytest case fails, pytest retains its traceback, which pins the
failing frame's worker local and through it the _ChipWorker instance until
interpreter exit — where nanobind's leak check fires and dumps the
"leaked instance ... _ChipWorker" / leaked types / leaked functions
warning. Dropping the handle on close releases the instance so it no
longer outlives the module.

Closes hw-native-sys#1221

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the close method in python/simpler/worker.py to set self._chip_worker to None after calling its finalize method when self.level is 2. There are no review comments, and I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 46cf39fa-6e90-4c8b-9ceb-bfb2db92c318

📥 Commits

Reviewing files that changed from the base of the PR and between b0b1976 and 629e070.

📒 Files selected for processing (1)
  • python/simpler/worker.py

📝 Walkthrough

Walkthrough

In Worker.close(), the level == 2 branch now sets self._chip_worker = None immediately after calling self._chip_worker.finalize(), matching the behavior of the sibling L>=3 and error/abort teardown paths.

Changes

Stale reference fix in Worker.close()

Layer / File(s) Summary
Null _chip_worker after finalize in L2 branch
python/simpler/worker.py
Adds self._chip_worker = None after self._chip_worker.finalize() in the level == 2 branch, consistent with the L>=3 branch (self._worker = None) and the error/abort path.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Poem

🐇 A reference left dangling, a ghost in the heap,
One line sets it null so nanobind can sleep.
No leaks in the logs, no noise in CI,
The L2 branch closes with a tidy goodbye!
Consistency wins — now all paths align. 🌿

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: releasing the _ChipWorker handle in the L2 close path.
Description check ✅ Passed The description matches the code change and explains the cleanup fix and rationale.
Linked Issues check ✅ Passed The change satisfies #1221 by clearing self._chip_worker after finalize() in the L2 close branch.
Out of Scope Changes check ✅ Passed The PR contains a single targeted fix and no unrelated code changes.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@ChaoWao ChaoWao merged commit c91deb0 into hw-native-sys:main Jul 1, 2026
16 checks passed
@yanghaoran29

Copy link
Copy Markdown
Contributor

How to Trigger a Genuine Local Dump
nanobind registers the leak checker routine internals_cleanup via Py_AtExit (nb_internals.cpp:562). This routine runs at the very end of interpreter finalization — after module dictionary cleanup and the mandatory garbage collection pass enforced by CPython during shutdown. This leads to the following behavior:

  • Instances referenced solely by ordinary global variables are released during module cleanup before the leak checker executes, so no leak dump is generated.
  • Self-referential cycles combined with gc.disable() are also ineffective, as the forced GC triggered at finalization ignores the gc.disable() flag.

To reproduce a real leak dump, the target instance must survive the entire finalization workflow. The tracebacks generated by failed test cases can trigger this leak because they pin stack frames and create persistent references that CPython cannot collect automatically. I use an equivalent method — ctypes.pythonapi.Py_IncRef(worker) — to "immortalize" the worker object: its reference count will never drop to zero, making it unreclaimable by CPython. This faithfully replicates the edge case where the worker outlives module teardown logic. The test script can be found at /tmp/leaktest/nb_dump.py.

Before the Fix (Buggy Version — self._chip_worker = None Removed)

[nb_dump] worker immortalized (Py_IncRef); exiting -> nanobind Py_AtExit leak check follows
nanobind: leaked 2 instances!
 - leaked instance 0xfffdaeed9f98 of type "_task_interface.ChipCallable"
 - leaked instance 0xfffdaf0fb648 of type "_task_interface._ChipWorker"
nanobind: leaked 15 types!
 - leaked type "_task_interface._Worker"
 - leaked type "ArgDirection"
 - leaked type "_task_interface.RuntimeEnv"
 - leaked type "_task_interface._ChipWorker"
 - leaked type "_task_interface.ChipCallable"
 - ... skipped remainder
nanobind: leaked 165 functions!
 - ... skipped remainder
nanobind: this is likely caused by a reference counting issue in the binding code.
See https://nanobind.readthedocs.io/en/latest/refleaks.html

This output exactly matches the leak dump attached to the issue ticket (leaked 15 types! / leaked 165 functions!), and explicitly includes the target instance _task_interface._ChipWorker.

After the Fix (Corrected Version — Exact Code from PR HEAD)

nanobind: leaked 1 instances!
 - leaked instance 0xfffd9b1cd128 of type "_task_interface.ChipCallable"
nanobind: leaked 15 types!
 - leaked type "_task_interface._ChipWorker"
nanobind: leaked 165 functions!
nanobind: this is likely caused by a reference counting issue in the binding code.

The _ChipWorker instance is removed entirely from the leak list (instance count reduced from 2 instances to 1 instances) — this is the exact intended outcome of the PR.

Note: Both states are functionally valid. The residual leaks are benign teardown artifacts and do not impact computational correctness:
[vector_add] golden check PASSED, max |host_out - expected| = 0.000e+00.

Further Experimental Test: Can Clearing Registries Eliminate the Dump Completely?
Building on the current PR, I ran an experimental test by adding extra logic to clear nanobind handle references held at the Worker layer within the L2 branch of close():

if self.level == 2:
    if self._chip_worker:
        self._chip_worker.finalize()
        self._chip_worker = None
        self._callable_registry.clear()   # Newly added line
        self._identity_registry.clear()   # Newly added line
        self._live_handles.clear()        # Newly added line

When re-running the identical nb_dump.py script (with the worker still immortalized via Py_IncRef), zero nanobind: leaked logs appear in stderr, while the golden result check golden check PASSED remains unaffected.

Three-Way Comparison (Official nanobind Exit Leak Check)

State Operations Performed in close() L2 Branch nanobind Exit Leak Check Result
Pre-fix baseline Only call finalize() leaked 2 instances!ChipCallable + _ChipWorker, plus 15 types / 165 functions leaked
Current PR implementation finalize() + set _chip_worker=None leaked 1 instances! → only ChipCallable remains; 15 types / 165 functions still leak (the _ChipWorker leak is resolved, residual noise persists)
Current PR + registry clearing Add _callable_registry/_identity_registry/_live_handles.clear() on top of PR logic Zero leak logs printed; the dump is fully eliminated 100%

In short: The module can only be unloaded cleanly with all leak noise eliminated if all nanobind handles retained by the Worker after close() are fully released. This PR addresses just one link in that chain: the _ChipWorker reference.

Merge Tradeoff Note: The semantic contract of close() is "release all held resources". The routines _cleanup_l3_l2_regions() and finalize() already execute before close() reaches this branch, so clearing these dictionary registries is logically consistent. For rigor, prior to merging this extended cleanup logic, we must verify there are no code paths that access _callable_registry / _identity_registry / _live_handles after close() completes — this can be captured as a dedicated regression test requirement for this modification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Code Health] Worker.close() L2 branch leaks _ChipWorker nanobind instance at interpreter shutdown

3 participants