Skip to content

On-disk cache bug starting from v0.3.0 #105

@muchulee8

Description

@muchulee8

Hitting a bug for on-disk cache starting from v0.3.0

Environment:
GB300
CUDA 13.1
Torch 2.10

Repro is pretty simple, run this two times:
python -m pytest "quack/tests/test_topk.py::test_topk[False-False-1-64-16-input_dtype0]"

Passes the first time, fails the second time with
FAILED quack/tests/test_topk.py::test_topk[False-False-1-64-16-input_dtype0] - RuntimeError: CUDA Error: cudaErrorInvalidDeviceFunction

Quick way to unblock myself (but doesn't fix the root issue):

  1. Remove the cache after first run (i.e. rm -rf /tmp/root/quack_cache/)
  2. Set QUACK_CACHE_ENABLED=0

A quick Claude recommendation gives:

Root Cause

BinaryExecutionEngine (which loads CUDA kernel binaries from .o files) segfaults when initialized inside a torch.library.custom_op dispatch context. The torch dispatch state (device guards, autograd tracking) is thread-local and corrupts the CUDA driver state during binary loading.

  • First run: kernels are freshly compiled via cute.compile() → returns TVMFFIJitCompiledFunction → works fine
  • Second run: disk cache hit → cute.runtime.load_module() creates BinaryExecutionEngine inside the custom_op body → CUDA binary init segfaults

Fix

In ~/quack/quack/cache_utils.py: load cached .o files in a separate thread (via ThreadPoolExecutor(1)), since torch dispatch state is thread-local and the worker thread won't have the custom_op context.

Though it could be solved from different layers, my guess it might be less invasive from Quack.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions