-
Notifications
You must be signed in to change notification settings - Fork 37
Description
Bug
get_device_context() builds a new torch.tensor from self.heap_bases.tolist() on every call (see #466). Once #466 is fixed by precomputing the tensor in __init__, the context tensor will hold a snapshot of heap_bases at construction time.
If heap_bases were to change after init (e.g., via refresh_peer_access() after a new shmem.allocate() or as_symmetric() call with a future allocator), the precomputed context tensor would contain stale base addresses. Kernels using DeviceContext would translate pointers using wrong bases, causing silent data corruption or hangs.
Today this is not a bug — both the torch and vmem allocators produce stable heap_bases after the first refresh_peer_access(). But it will become one if an allocator ever remaps peer VA ranges.
Fix
After precomputing self._device_context in __init__, add an in-place update in refresh_peer_access():
self._device_context[2:2+self.num_ranks] = self.heap_basesNo allocation, CUDAGraph safe, one line.
Component
iris/iris.py, iris/symmetric_heap.py