Buffer registry: get_child_allocators timing causes undersized shared memory pool

## Summary

`get_child_allocators` computes child shared/persistent sizes at **call time**, but the child may not have registered its own grandchild buffers yet. This causes the shared memory pool to be undersized when a deeply-nested component uses shared buffers.

## Root Cause

The allocation chain relies on each parent calling `get_child_allocators(self, child)` to register the child's total shared/persistent sizes into the parent's buffer group. The problem is **when** this happens relative to the child's own `build()`.

Example with FIRK → Newton-Krylov → Linear Solver:

1. **FIRK `__init__`** (e.g. `generic_firk.py:238`): Calls `get_child_allocators(self, self.solver, name="solver")`. This computes `shared_buffer_size(newton_krylov)` — but at this point, Newton-Krylov has only registered its **own** buffers (delta, residual, etc.), not the linear solver's child buffers. Result: **0 shared elements** registered for the solver.

2. **Newton-Krylov `build()`** (triggered later via `device_function` property): Calls `get_child_allocators(self, self.linear_solver)`, which registers the linear solver's shared size (e.g. 105 elements) with Newton-Krylov. But FIRK's registration from step 1 is now **stale** — it still says 0.

3. **FIRK `build()`** (`generic_firk.py:372`): Calls `get_child_allocators(self, nonlinear_solver)` again, but `nonlinear_solver` is the compiled **device function** (not the Newton-Krylov instance), which has no buffer group. Result: **0 shared elements** again.

4. **Kernel launch**: Shared memory pool is too small. Any linear solver buffer allocated from shared memory reads past the end of the pool → `CUDA_ERROR_ILLEGAL_ADDRESS`.

## Current Impact

Currently masked because all existing linear solver buffers (MR/SD) default to `location="local"`, so `shared_buffer_size(linear_solver)` is always 0. The bug becomes visible when any linear solver buffer uses `location="shared"`.

## Affected Components

The same `__init__`-time `get_child_allocators` pattern appears in:
- `generic_firk.py:238`
- `generic_dirk.py:232`
- `backwards_euler.py:128`
- `crank_nicolson.py:129`

The `SingleIntegratorRunCore` calls at lines 198/201 and 566/569 have a similar pattern but re-register during `_setup_compiled_functions`, which may partially mitigate the issue.

## Possible Fixes

1. **Deferred sizing**: Don't compute child sizes in `__init__`. Instead, call `get_child_allocators` only during `build()`, after all children have built.

2. **Two-pass build**: Add a `prepare()` phase that forces all children to register their grandchild buffers before parents compute sizes.

3. **Use the actual instance in `build()`**: Change FIRK's `build()` to pass `self.solver` (the Newton-Krylov instance) instead of `nonlinear_solver` (the device function) to `get_child_allocators`. Combined with re-registration, this would pick up the correct sizes.

4. **Lazy recomputation**: Have `get_child_allocators` store a reference to the child and recompute sizes when `get_allocator` is called, rather than capturing sizes at registration time.

## Reproduction

Set any linear solver buffer location to `"shared"` (e.g. a BiCGSTAB witness vector) and run with a FIRK algorithm. The kernel will crash with `CUDA_ERROR_ILLEGAL_ADDRESS` on the first step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffer registry: get_child_allocators timing causes undersized shared memory pool #520

Summary

Root Cause

Current Impact

Affected Components

Possible Fixes

Reproduction

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Buffer registry: get_child_allocators timing causes undersized shared memory pool #520

Description

Summary

Root Cause

Current Impact

Affected Components

Possible Fixes

Reproduction

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions