Skip to content

[BUG] _assign_kernel cache not invalidated after ctx.reset() #644

@shaunc

Description

@shaunc

Summary

numba.cuda.cudadrv.devicearray._assign_kernel uses @lru_cache but is not
context-aware. After calling cuda.current_context().reset(), the cached
kernel holds a stale reference to an unloaded CUDA module, causing subsequent
device array assignments to fail with CUDA_ERROR_INVALID_HANDLE.

Minimal Reproduction

"""
Minimal reproduction of _assign_kernel cache invalidation bug.

Run with: python test_assign_kernel_bug.py
Expected: CUDA_ERROR_INVALID_HANDLE on second assignment
"""
from numba import cuda
import numpy as np

# Step 1: Trigger _assign_kernel compilation and caching
data = cuda.device_array(10, dtype=np.int32)
data[0] = 1  # This compiles and caches _assign_kernel(ndim=1)
cuda.synchronize()
print("First assignment: OK")

# Step 2: Reset context (invalidates all CUDA modules)
ctx = cuda.current_context()
ctx.reset()
print("Context reset")

# Step 3: Try another assignment - FAILS
data2 = cuda.device_array(10, dtype=np.int32)
try:
    data2[0] = 1  # Uses stale cached _assign_kernel
    print("Second assignment: OK")
except Exception as e:
    print(f"Second assignment FAILED: {type(e).__name__}: {e}")

Output

First assignment: OK
Context reset
Second assignment FAILED: CudaAPIError: [400] Call to cuOccupancyMaxPotentialBlockSize results in CUDA_ERROR_INVALID_HANDLE

Analysis

Root Cause

In numba/cuda/cudadrv/devicearray.py:

@lru_cache
def _assign_kernel(ndim):
    @cuda.jit
    def kernel(lhs, rhs):
        # ... implementation
    return kernel

The cache key is only ndim, with no awareness of CUDA context state. When
ctx.reset() is called:

  1. ctx.modules.clear() unloads all compiled CUDA modules
  2. The _assign_kernel LRU cache still holds the old kernel dispatcher
  3. The dispatcher's internal _func.module reference points to an unloaded module
  4. Next arr[idx] = val call uses stale kernel → CUDA_ERROR_INVALID_HANDLE

Why This Matters

This affects any code that:

  1. Uses device array assignment (arr[idx] = val)
  2. Calls ctx.reset() (common in test fixtures for isolation)
  3. Uses device array assignment again

This is particularly problematic in pytest where fixtures commonly reset
context between tests.

Workaround

from numba.cuda.cudadrv.devicearray import _assign_kernel

ctx.reset()
_assign_kernel.cache_clear()  # Must be called after every reset

Suggested Fix

Option A: Make cache context-aware

def _assign_kernel(ndim):
    ctx = cuda.current_context()
    cache_key = (ndim, id(ctx.modules))  # Invalidates on reset
    # ...

Option B: Clear cache in Context.reset()

def reset(self):
    self.memory_manager.reset()
    self.modules.clear()
    self.deallocations.clear()
    # Clear caches that may hold stale module references
    from numba.cuda.cudadrv.devicearray import _assign_kernel
    _assign_kernel.cache_clear()

Option C: Use weak references in dispatcher

The cached kernel's module reference could use weakrefs that become invalid
when the module is unloaded, triggering recompilation.

Environment

  • numba-cuda: 0.22.1
  • numba: 0.62.1
  • CUDA: 12.x
  • Python: 3.13

Related

Other caches that may have similar issues:

  • numba.cuda.dispatcher.configure (also uses @lru_cache)
  • Any other context-agnostic kernel caches

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions