Summary
get_child_allocators computes child shared/persistent sizes at call time, but the child may not have registered its own grandchild buffers yet. This causes the shared memory pool to be undersized when a deeply-nested component uses shared buffers.
Root Cause
The allocation chain relies on each parent calling get_child_allocators(self, child) to register the child's total shared/persistent sizes into the parent's buffer group. The problem is when this happens relative to the child's own build().
Example with FIRK → Newton-Krylov → Linear Solver:
-
FIRK __init__ (e.g. generic_firk.py:238): Calls get_child_allocators(self, self.solver, name="solver"). This computes shared_buffer_size(newton_krylov) — but at this point, Newton-Krylov has only registered its own buffers (delta, residual, etc.), not the linear solver's child buffers. Result: 0 shared elements registered for the solver.
-
Newton-Krylov build() (triggered later via device_function property): Calls get_child_allocators(self, self.linear_solver), which registers the linear solver's shared size (e.g. 105 elements) with Newton-Krylov. But FIRK's registration from step 1 is now stale — it still says 0.
-
FIRK build() (generic_firk.py:372): Calls get_child_allocators(self, nonlinear_solver) again, but nonlinear_solver is the compiled device function (not the Newton-Krylov instance), which has no buffer group. Result: 0 shared elements again.
-
Kernel launch: Shared memory pool is too small. Any linear solver buffer allocated from shared memory reads past the end of the pool → CUDA_ERROR_ILLEGAL_ADDRESS.
Current Impact
Currently masked because all existing linear solver buffers (MR/SD) default to location="local", so shared_buffer_size(linear_solver) is always 0. The bug becomes visible when any linear solver buffer uses location="shared".
Affected Components
The same __init__-time get_child_allocators pattern appears in:
generic_firk.py:238
generic_dirk.py:232
backwards_euler.py:128
crank_nicolson.py:129
The SingleIntegratorRunCore calls at lines 198/201 and 566/569 have a similar pattern but re-register during _setup_compiled_functions, which may partially mitigate the issue.
Possible Fixes
-
Deferred sizing: Don't compute child sizes in __init__. Instead, call get_child_allocators only during build(), after all children have built.
-
Two-pass build: Add a prepare() phase that forces all children to register their grandchild buffers before parents compute sizes.
-
Use the actual instance in build(): Change FIRK's build() to pass self.solver (the Newton-Krylov instance) instead of nonlinear_solver (the device function) to get_child_allocators. Combined with re-registration, this would pick up the correct sizes.
-
Lazy recomputation: Have get_child_allocators store a reference to the child and recompute sizes when get_allocator is called, rather than capturing sizes at registration time.
Reproduction
Set any linear solver buffer location to "shared" (e.g. a BiCGSTAB witness vector) and run with a FIRK algorithm. The kernel will crash with CUDA_ERROR_ILLEGAL_ADDRESS on the first step.
Summary
get_child_allocatorscomputes child shared/persistent sizes at call time, but the child may not have registered its own grandchild buffers yet. This causes the shared memory pool to be undersized when a deeply-nested component uses shared buffers.Root Cause
The allocation chain relies on each parent calling
get_child_allocators(self, child)to register the child's total shared/persistent sizes into the parent's buffer group. The problem is when this happens relative to the child's ownbuild().Example with FIRK → Newton-Krylov → Linear Solver:
FIRK
__init__(e.g.generic_firk.py:238): Callsget_child_allocators(self, self.solver, name="solver"). This computesshared_buffer_size(newton_krylov)— but at this point, Newton-Krylov has only registered its own buffers (delta, residual, etc.), not the linear solver's child buffers. Result: 0 shared elements registered for the solver.Newton-Krylov
build()(triggered later viadevice_functionproperty): Callsget_child_allocators(self, self.linear_solver), which registers the linear solver's shared size (e.g. 105 elements) with Newton-Krylov. But FIRK's registration from step 1 is now stale — it still says 0.FIRK
build()(generic_firk.py:372): Callsget_child_allocators(self, nonlinear_solver)again, butnonlinear_solveris the compiled device function (not the Newton-Krylov instance), which has no buffer group. Result: 0 shared elements again.Kernel launch: Shared memory pool is too small. Any linear solver buffer allocated from shared memory reads past the end of the pool →
CUDA_ERROR_ILLEGAL_ADDRESS.Current Impact
Currently masked because all existing linear solver buffers (MR/SD) default to
location="local", soshared_buffer_size(linear_solver)is always 0. The bug becomes visible when any linear solver buffer useslocation="shared".Affected Components
The same
__init__-timeget_child_allocatorspattern appears in:generic_firk.py:238generic_dirk.py:232backwards_euler.py:128crank_nicolson.py:129The
SingleIntegratorRunCorecalls at lines 198/201 and 566/569 have a similar pattern but re-register during_setup_compiled_functions, which may partially mitigate the issue.Possible Fixes
Deferred sizing: Don't compute child sizes in
__init__. Instead, callget_child_allocatorsonly duringbuild(), after all children have built.Two-pass build: Add a
prepare()phase that forces all children to register their grandchild buffers before parents compute sizes.Use the actual instance in
build(): Change FIRK'sbuild()to passself.solver(the Newton-Krylov instance) instead ofnonlinear_solver(the device function) toget_child_allocators. Combined with re-registration, this would pick up the correct sizes.Lazy recomputation: Have
get_child_allocatorsstore a reference to the child and recompute sizes whenget_allocatoris called, rather than capturing sizes at registration time.Reproduction
Set any linear solver buffer location to
"shared"(e.g. a BiCGSTAB witness vector) and run with a FIRK algorithm. The kernel will crash withCUDA_ERROR_ILLEGAL_ADDRESSon the first step.