Skip to content

Buffer registry: get_child_allocators timing causes undersized shared memory pool #520

@ccam80

Description

@ccam80

Summary

get_child_allocators computes child shared/persistent sizes at call time, but the child may not have registered its own grandchild buffers yet. This causes the shared memory pool to be undersized when a deeply-nested component uses shared buffers.

Root Cause

The allocation chain relies on each parent calling get_child_allocators(self, child) to register the child's total shared/persistent sizes into the parent's buffer group. The problem is when this happens relative to the child's own build().

Example with FIRK → Newton-Krylov → Linear Solver:

  1. FIRK __init__ (e.g. generic_firk.py:238): Calls get_child_allocators(self, self.solver, name="solver"). This computes shared_buffer_size(newton_krylov) — but at this point, Newton-Krylov has only registered its own buffers (delta, residual, etc.), not the linear solver's child buffers. Result: 0 shared elements registered for the solver.

  2. Newton-Krylov build() (triggered later via device_function property): Calls get_child_allocators(self, self.linear_solver), which registers the linear solver's shared size (e.g. 105 elements) with Newton-Krylov. But FIRK's registration from step 1 is now stale — it still says 0.

  3. FIRK build() (generic_firk.py:372): Calls get_child_allocators(self, nonlinear_solver) again, but nonlinear_solver is the compiled device function (not the Newton-Krylov instance), which has no buffer group. Result: 0 shared elements again.

  4. Kernel launch: Shared memory pool is too small. Any linear solver buffer allocated from shared memory reads past the end of the pool → CUDA_ERROR_ILLEGAL_ADDRESS.

Current Impact

Currently masked because all existing linear solver buffers (MR/SD) default to location="local", so shared_buffer_size(linear_solver) is always 0. The bug becomes visible when any linear solver buffer uses location="shared".

Affected Components

The same __init__-time get_child_allocators pattern appears in:

  • generic_firk.py:238
  • generic_dirk.py:232
  • backwards_euler.py:128
  • crank_nicolson.py:129

The SingleIntegratorRunCore calls at lines 198/201 and 566/569 have a similar pattern but re-register during _setup_compiled_functions, which may partially mitigate the issue.

Possible Fixes

  1. Deferred sizing: Don't compute child sizes in __init__. Instead, call get_child_allocators only during build(), after all children have built.

  2. Two-pass build: Add a prepare() phase that forces all children to register their grandchild buffers before parents compute sizes.

  3. Use the actual instance in build(): Change FIRK's build() to pass self.solver (the Newton-Krylov instance) instead of nonlinear_solver (the device function) to get_child_allocators. Combined with re-registration, this would pick up the correct sizes.

  4. Lazy recomputation: Have get_child_allocators store a reference to the child and recompute sizes when get_allocator is called, rather than capturing sizes at registration time.

Reproduction

Set any linear solver buffer location to "shared" (e.g. a BiCGSTAB witness vector) and run with a FIRK algorithm. The kernel will crash with CUDA_ERROR_ILLEGAL_ADDRESS on the first step.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions