Hitting a bug for on-disk cache starting from v0.3.0
Environment:
GB300
CUDA 13.1
Torch 2.10
Repro is pretty simple, run this two times:
python -m pytest "quack/tests/test_topk.py::test_topk[False-False-1-64-16-input_dtype0]"
Passes the first time, fails the second time with
FAILED quack/tests/test_topk.py::test_topk[False-False-1-64-16-input_dtype0] - RuntimeError: CUDA Error: cudaErrorInvalidDeviceFunction
Quick way to unblock myself (but doesn't fix the root issue):
- Remove the cache after first run (i.e.
rm -rf /tmp/root/quack_cache/)
- Set
QUACK_CACHE_ENABLED=0
A quick Claude recommendation gives:
Root Cause
BinaryExecutionEngine (which loads CUDA kernel binaries from .o files) segfaults when initialized inside a torch.library.custom_op dispatch context. The torch dispatch state (device guards, autograd tracking) is thread-local and corrupts the CUDA driver state during binary loading.
- First run: kernels are freshly compiled via cute.compile() → returns TVMFFIJitCompiledFunction → works fine
- Second run: disk cache hit → cute.runtime.load_module() creates BinaryExecutionEngine inside the custom_op body → CUDA binary init segfaults
Fix
In ~/quack/quack/cache_utils.py: load cached .o files in a separate thread (via ThreadPoolExecutor(1)), since torch dispatch state is thread-local and the worker thread won't have the custom_op context.
Though it could be solved from different layers, my guess it might be less invasive from Quack.
Hitting a bug for on-disk cache starting from
v0.3.0Environment:
GB300
CUDA 13.1
Torch 2.10
Repro is pretty simple, run this two times:
python -m pytest "quack/tests/test_topk.py::test_topk[False-False-1-64-16-input_dtype0]"Passes the first time, fails the second time with
FAILED quack/tests/test_topk.py::test_topk[False-False-1-64-16-input_dtype0] - RuntimeError: CUDA Error: cudaErrorInvalidDeviceFunctionQuick way to unblock myself (but doesn't fix the root issue):
rm -rf /tmp/root/quack_cache/)QUACK_CACHE_ENABLED=0A quick Claude recommendation gives:
Though it could be solved from different layers, my guess it might be less invasive from Quack.