Skip to content

fix: stable fingerprint for closures capturing non-deterministic state#8228

Open
kamaleshpanda wants to merge 1 commit into
huggingface:mainfrom
kamaleshpanda:fix/fingerprint-closure-nondeterministic-state
Open

fix: stable fingerprint for closures capturing non-deterministic state#8228
kamaleshpanda wants to merge 1 commit into
huggingface:mainfrom
kamaleshpanda:fix/fingerprint-closure-nondeterministic-state

Conversation

@kamaleshpanda
Copy link
Copy Markdown

Fixes #7986

When .map() uses a closure capturing self with non-deterministic state
(like UUIDs or loggers), the fingerprint changes every run causing cache misses.

Fix: only hash primitive attributes of captured objects instead of the full object.

Added a regression test in tests/test_fingerprint.py.

When a map function is a closure capturing 'self', dill serializes the
full object including non-deterministic state (UUIDs, loggers, object IDs).
This causes a cache miss on every new class instantiation even when the
actual computation is identical.

Fix: hash only primitive values from captured objects instead of the
full object, so unrelated internal state doesn't affect the fingerprint.

Fixes huggingface#7986
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dataset.map() causes cache miss/fingerprint change when closure captures self containing non-deterministic state.

1 participant