On-disk cache bug starting from v0.3.0

Hitting a bug for on-disk cache starting from `v0.3.0`

Environment:
GB300
CUDA 13.1
Torch 2.10


Repro is pretty simple, run this two times:
`python -m pytest "quack/tests/test_topk.py::test_topk[False-False-1-64-16-input_dtype0]"`

Passes the first time, fails the second time with
`FAILED quack/tests/test_topk.py::test_topk[False-False-1-64-16-input_dtype0] - RuntimeError: CUDA Error: cudaErrorInvalidDeviceFunction`

Quick way to unblock myself (but doesn't fix the root issue):
1. Remove the cache after first run (i.e. `rm -rf /tmp/root/quack_cache/`)
2. Set `QUACK_CACHE_ENABLED=0`

A quick Claude recommendation gives:
>   Root Cause
> 
>   BinaryExecutionEngine (which loads CUDA kernel binaries from .o files) segfaults when initialized inside a torch.library.custom_op dispatch context. The torch dispatch state (device guards, autograd tracking) is thread-local and corrupts the CUDA driver state during binary loading.
> 
>   - First run: kernels are freshly compiled via cute.compile() → returns TVMFFIJitCompiledFunction → works fine
>   - Second run: disk cache hit → cute.runtime.load_module() creates BinaryExecutionEngine inside the custom_op body → CUDA binary init segfaults
> 
>   Fix
> 
>   In ~/quack/quack/cache_utils.py: load cached .o files in a separate thread (via ThreadPoolExecutor(1)), since torch dispatch state is thread-local and the worker thread won't have the custom_op context.

Though it could be solved from different layers, my guess it might be less invasive from Quack.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On-disk cache bug starting from v0.3.0 #105

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

On-disk cache bug starting from v0.3.0 #105

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions