fix: adaptive loops no longer hang when newton solver fails on a save boundary by ccam80 · Pull Request #519 · ccam80/cubie

ccam80 · 2026-02-07T02:37:14Z

step_failed was leading to ode_loop.py setting error at an arbitrarily large (1e16) error - this then got divided by atol/rtol and squared, overflowing into inf, which set gain to inf, which was clamped to gain_max, so the step size grew paradoxically and the loop could never advance past the step boundary. t_proposal != t in this case, so the stagnation check was not triggered. PR Also contains some of the test refactor sweep due to poor branch discipline on my part.

feat: more-sensible defaults are set when only a subset of dt_min, dt_max, dt are set
docs: add explanation of output timing, loop duration/start timing, and step timing to the user guide

Remove three vestige properties from SingleIntegratorRun (algorithm_key, compiled_loop_function, threads_per_loop) and update BatchSolverKernel consumers to use the underlying properties directly (algorithm, device_function, threads_per_step). Fix set("algorithm") and set("step_controller") in SingleIntegratorRunCore._switch_algos/_switch_controllers which incorrectly returned sets of individual characters instead of single-element sets. Add tests_plan.md (master functionality inventory) and progress.md (running session log) for the multi-session test sweep. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- _convert_inventory.py: one-shot parser (tests_plan.md -> JSON) - query_inventory.py: agent-facing CLI for file/tag/method queries - _inventory.json: 100 files, 3482 items with auto-inferred tags Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…defaults

…towards infinite hang fix

… save boundary

greptile-apps · 2026-02-07T02:40:39Z

Greptile Overview

Greptile Summary

This PR updates the CUDA IVP loop to handle step-function failures without hanging at save boundaries, and refactors step controller configuration/plumbing (including renaming dt0 → dt) plus associated tests/docs.

The main loop change introduces step_failed handling and stagnation detection inside src/cubie/integrators/loops/ode_loop.py, alongside a new integrated numerical regression test intended to cover the previously observed hang.

One merge-blocking issue remains: on step failure the loop currently forces error to a huge value (1e16), which for the adaptive controllers causes gain to clamp to max_gain and increase the timestep, counter to the intended “shrink dt and move off the boundary” behavior. That can recreate the non-advancing boundary condition the PR aims to fix.

Confidence Score: 2/5

This PR is not safe to merge until step-failure dt adaptation is corrected.
The core regression fix is undermined by forcing error=1e16 on step failure, which drives the adaptive controller to increase dt (clamped to max_gain) and can reproduce the hang/non-advancing behavior at save boundaries. Other refactors appear coherent but depend on this loop behavior being correct.
src/cubie/integrators/loops/ode_loop.py

Important Files Changed

Filename	Overview
src/cubie/integrators/loops/ode_loop.py	Adds step_failed handling by forcing error to 1e16 and stagnation detection; current 1e16 error forces dt growth and can reintroduce non-advancing behavior.
src/cubie/integrators/step_control/adaptive_step_controller.py	Refactors adaptive controller config to use dt property with geometric-mean fallback and moves bounds validation into controller; no direct runtime issues found.
src/cubie/integrators/step_control/base_step_controller.py	Introduces common BaseStepController init/update plumbing and dt property rename; appears consistent with new adaptive config behavior.
src/cubie/integrators/loops/ode_loop_config.py	Renames dt0 to dt in ODELoopConfig; config now provides dt via _dt field with precision-cast.
tests/integrated_numerical_tests/test_dt_min_hang.py	Adds regression test to ensure solve_ivp completes under f32 save drift scenario; test lacks an explicit fast-fail mechanism if hang returns.

greptile-apps

_{6 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{10 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-07T03:04:10Z

Additional Comments (4)

src/cubie/integrators/step_control/adaptive_step_controller.py
Invalid optional dt_max

AdaptiveStepControlConfig._dt_max is typed as Optional[float] but uses validator=getype_validator(float, 0), which rejects None. This makes dt_max=None impossible even though the config/API suggests it’s allowed (and BaseAdaptiveStepController._resolve_step_params() relies on detecting missing bounds). Use the optional validator (or make the field non-optional).

    _dt_max: Optional[float] = field(
        default=1.0, validator=opt_getype_validator(float, 0)
    )

src/cubie/integrators/step_control/adaptive_step_controller.py
dt_max can be None

After the refactor, dt_max returns self.precision(self._dt_max). If _dt_max is still allowed to be None (per the Optional[float] type and doc defaults), this will crash at runtime when precision(None) is called. Either ensure _dt_max is always filled during config construction/derivation, or keep the previous fallback (dt_min * 100) in this property.

src/cubie/integrators/step_control/base_step_controller.py
Base config dt unset

BaseStepControllerConfig now owns _dt: Optional[float] = None, and adaptive controllers’ dt property is precision(self._dt). If the user provides only dt_min or only dt_max, _resolve_step_params() doesn’t set kwargs["dt"], so _dt stays None and later self.dt will call precision(None) in _ensure_sane_bounds().

This breaks the PR’s stated behavior (“sensible defaults when only a subset of dt_min, dt_max, dt are set”). You likely need derivation rules for the one-bound cases (e.g., if only dt_min provided, set dt = dt_min; if only dt_max, set dt = dt_max; plus derive the missing opposite bound), or disallow those partial inputs with a clear error.

src/cubie/integrators/loops/ode_loop.py
Negative dt_eff possible

At output-boundary adjustment, dt_eff = precision(next_event - t_prec). If next_save/next_update_summary has drifted behind t_prec (float32 accumulation vs float64 t), this becomes negative and is passed into step_function(...) (line ~751). That can trigger undefined behavior/hangs in the stepper and matches the PR’s described failure mode; the new regression test documents exactly this scenario.

Fix needs to ensure next_event >= t_prec before subtracting (e.g., clamp next_event = max(next_event, t_prec) or advance the event time when it falls behind), or explicitly treat dt_eff <= 0 as stagnation/irrecoverable to avoid looping forever.

greptile-apps

_{7 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{5 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-07T08:33:27Z

    @property
-    def dt0(self) -> float:
+    def dt(self) -> float:
        """Return the initial step size."""
-        return self.precision(sqrt(self.dt_min * self.dt_max))
+        return self.precision(self._dt)



dt can be None

AdaptiveStepControlConfig.dt returns self.precision(self._dt), but _dt is defined on BaseStepControllerConfig as Optional[float] and _resolve_step_params() only sets kwargs["dt"] when dt is provided or when both bounds are present. If a user provides only one of dt_min/dt_max (or neither), _dt can remain None, and this property will raise during precision-cast. This makes “more-sensible defaults when only a subset is set” fail with a runtime error.

Either ensure _resolve_step_params() always sets dt (derive from the single bound or set a default), or enforce dt as required at config validation time.

greptile-apps · 2026-02-07T08:33:29Z

+        self._resolve_step_params(dt, kwargs)
+        config = build_config(
+            self._config_class,
+            required={"precision": precision, "n": n},
+            **kwargs,
+        )
+        self.setup_compile_settings(config)
+        self._ensure_sane_bounds()
+        self.register_buffers()


Derived dt not passed to config

BaseStepController.__init__() calls self._resolve_step_params(dt, kwargs) but then ignores the dt argument when building the config: build_config(..., **kwargs). In BaseAdaptiveStepController._resolve_step_params(), you set kwargs["dt"] = dt only when dt is provided or both bounds are present; otherwise _dt can remain None (see other comment). More importantly, if a subclass chooses to derive the effective initial dt from the positional dt argument without writing it into kwargs, the config will never receive it.

Given the new “subset of dt_min/dt_max/dt set” behavior, this is a fragile footgun: dt should be plumbed into kwargs (or into required) so _dt is always populated consistently.

greptile-apps · 2026-02-07T08:33:31Z

Additional Comments (1)

src/cubie/integrators/step_control/adaptive_step_controller.py
Optional dt_max no longer optional

AdaptiveStepControlConfig._dt_max is annotated as Optional[float], but its validator is getype_validator(float, 0) and its default is 1.0 (not None). More importantly, dt_max now returns self.precision(self._dt_max) (no fallback), so passing dt_max=None (which is explicitly described in the PR description as a supported partial-specification case) will fail validation / precision-cast at runtime. This breaks the “subset of dt_min/dt_max/dt set” story and can crash controller construction when users omit dt_max.

Fix by either making _dt_max truly optional (validator opt_getype_validator, default None, and a derivation path), or by removing the Optional/None support and updating docs/tests accordingly.

greptile-apps

_{6 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-07T09:55:28Z

+    def _resolve_step_params(self, dt: float, kwargs: dict) -> None:
+        """Derive bounds from dt and track user-provided values.

        Parameters
        ----------
-        config
-            Configuration for the controller.
+        dt
+            Initial step size, or None if not provided.
+        kwargs
+            Mutable dict of keyword arguments. Modified in place.
+        """
+        # Track user-provided values BEFORE derivation
+        if dt is not None:
+            self._user_step_params["dt"] = dt
+        if "dt_min" in kwargs:
+            self._user_step_params["dt_min"] = kwargs["dt_min"]
+        if "dt_max" in kwargs:
+            self._user_step_params["dt_max"] = kwargs["dt_max"]
+
+        # Derive missing values
+        if dt is not None:
+            kwargs.setdefault("dt_min", dt / 100)
+            kwargs.setdefault("dt_max", dt * 100)
+            kwargs["dt"] = dt
+        else:
+            # dt not provided; derive from bounds if both present
+            dt_min = kwargs.get("dt_min")
+            dt_max = kwargs.get("dt_max")
+            if dt_min is not None and dt_max is not None:
+                kwargs["dt"] = sqrt(dt_min * dt_max)


Logic error: when user provides no dt/dt_min/dt_max, kwargs["dt"] is never set even though _dt_min and _dt_max have defaults (1e-6, 1.0).

Sequence:

User calls AdaptiveController(precision=float32) with no step params

_resolve_step_params(None, {}) runs

Line 202-203: dt_min/dt_max = None (checking kwargs, not config defaults)

Line 204-205: condition false, kwargs["dt"] never set

Config built with _dt=None (base class default)

Line 116 crashes: self.precision(None)

Fix: check config defaults or always set kwargs["dt"] to geometric mean of defaults when both bounds present.

greptile-apps

_{20 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-07T10:21:53Z

    def dt_max(self) -> float:
        """Return the maximum permissible step size."""
-        value = self._dt_max
-        if value is None:
-            value = self._dt_min * 100
-        return self.precision(value)
+        return self.precision(self._dt_max)


Optional dt_max can crash

_dt_max is declared as Optional[float], but dt_max unconditionally returns self.precision(self._dt_max). If _dt_max is ever None (e.g., via config construction/plumbing), this will raise at runtime. Either make _dt_max non-optional (since you have a default) or guard here and provide a deterministic fallback.

Suggested change

def dt_max(self) -> float:

"""Return the maximum permissible step size."""

value = self._dt_max

if value is None:

value = self._dt_min * 100

return self.precision(value)

return self.precision(self._dt_max)

@property

def dt_max(self) -> float:

"""Return the maximum permissible step size."""

assert self._dt_max is not None

return self.precision(self._dt_max)

greptile-apps · 2026-02-07T10:21:54Z

+    def dt(self) -> float:
+        """Return the initial step size.
+
+        When the user has not provided an explicit dt, returns the
+        geometric mean of dt_min and dt_max.


Geometric-mean fallback unsafe

When _dt is None, dt falls back to sqrt(self._dt_min * self._dt_max) but doesn’t ensure _dt_max is non-None. If config plumbing allows _dt_max=None, this path raises TypeError (and settings_dict also triggers it). Consider ensuring _dt_min/_dt_max are always concrete floats by construction, or explicitly handling the None case here.

Suggested change

def dt(self) -> float:

"""Return the initial step size.

When the user has not provided an explicit dt, returns the

geometric mean of dt_min and dt_max.

if self._dt is not None:

return self.precision(self._dt)

assert self._dt_min is not None and self._dt_max is not None

return self.precision(sqrt(self._dt_min * self._dt_max))

@greptile
config plumbing doesn't allow this. dt_min and dt_max have defaults i nthe attrs class; if a user set it to None deliberately, it might break (if the config class didn't use validators), but if they were to do that I would presume that it should break

greptile-apps · 2026-02-07T10:21:55Z

+        # Track user-provided values BEFORE derivation
+        if dt is not None:
+            self._user_step_params["dt"] = dt
+        if "dt_min" in kwargs:
+            self._user_step_params["dt_min"] = kwargs["dt_min"]
+        if "dt_max" in kwargs:
+            self._user_step_params["dt_max"] = kwargs["dt_max"]
+
+        # Derive missing values
+        if dt is not None:
+            kwargs.setdefault("dt_min", dt / 100)
+            kwargs.setdefault("dt_max", dt * 100)
+            kwargs["dt"] = dt
+        else:
+            # dt not provided; derive from bounds if both present
+            dt_min = kwargs.get("dt_min")
+            dt_max = kwargs.get("dt_max")


Single-bound dt remains unset

_resolve_step_params() only sets kwargs["dt"] when dt is provided or when both dt_min and dt_max are provided. If a user supplies only one bound (or neither), config can be built with _dt=None, which then relies on the dt property’s geometric-mean fallback (and can still crash if bounds are not concrete). To match the PR’s “more-sensible defaults” intent, _resolve_step_params() should always populate dt deterministically from the available information (single bound + default other bound, or config defaults).

Suggested change

# Track user-provided values BEFORE derivation

if dt is not None:

self._user_step_params["dt"] = dt

if "dt_min" in kwargs:

self._user_step_params["dt_min"] = kwargs["dt_min"]

if "dt_max" in kwargs:

self._user_step_params["dt_max"] = kwargs["dt_max"]

# Derive missing values

if dt is not None:

kwargs.setdefault("dt_min", dt / 100)

kwargs.setdefault("dt_max", dt * 100)

kwargs["dt"] = dt

else:

# dt not provided; derive from bounds if both present

dt_min = kwargs.get("dt_min")

dt_max = kwargs.get("dt_max")

else:

dt_min = kwargs.get("dt_min")

dt_max = kwargs.get("dt_max")

if dt_min is not None and dt_max is not None:

kwargs["dt"] = sqrt(dt_min * dt_max)

elif dt_min is not None:

kwargs["dt"] = dt_min

elif dt_max is not None:

kwargs["dt"] = dt_max

@greptile reexmaine this. dt is returned from compile settings via a property which returns the geometric mean of dt_min and dt_max if _dt is None. That achieves the behaviour set out in docs/timing.rst - if not set by the user, starting dt is the geometric mean of max and min.

You're correct. Let me trace through the logic:

_resolve_step_params() populates kwargs["dt"] only when dt is provided OR both bounds are provided

kwargs is used to build the config, setting config._dt

The dt property returns self._dt if self._dt is not None else jnp.sqrt(self.dt_min * self.dt_max)

So when only one bound is provided:

_resolve_step_params() derives the missing bound from the provided one (e.g., dt_min → dt_max = dt_min * 100)

kwargs["dt"] remains unset, so config._dt = None

The property's fallback computes the geometric mean of the now-complete bounds

This correctly implements the documented behavior. The original concern about "unset dt" was misplaced - the property's fallback handles it cleanly, and both bounds are always concrete by the time config is built.

My initial comment was wrong. The implementation already achieves the intended "sensible defaults" behavior through the combination of bound derivation in _resolve_step_params() and the geometric mean fallback in the dt property.

greptile-apps

_{6 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{5 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-08T22:37:07Z

+"""Regression tests for f32 save-event drift hang.
+
+When ``save_every`` is not exactly representable in float32


Hanging regression not asserted

This test only asserts n_saves >= 80 after solve_ivp(...) returns. If the regression is a true hang, CI will just time out rather than producing an actionable assertion. Consider adding a deterministic termination condition that fails fast (e.g., limiting max loop iterations / max rejected steps and asserting the solver returns an error bit instead of hanging), so the test fails quickly and clearly when the bug reappears.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-02-08T22:37:09Z

Additional Comments (1)

src/cubie/integrators/loops/ode_loop.py
Step failure increases dt

When step_failed, the loop sets error[i]=1e16 (ode_loop.py:784-786), which drives the adaptive controller to increase dt (gain clamps to max_gain) and can recreate the hang described in the PR. For “Newton solver fails on a save boundary”, you want failure to reduce dt (or at least guarantee rejection + shrink) so the loop can move off the boundary. Consider setting error to something that forces gain < 1 (or directly overriding the controller outcome on failure) instead of a huge value.

ccam80 · 2026-02-08T22:47:36Z

@greptile reexamine the failure mode you've suggested, I think this is ready to merge but your review has just flipped and I want to double check. Larger errors shrink step sizes. There was a bug where a large error led to an inf norm, which led to a NaN gain, which was clamped to max. I believe that was fixed through a nan/inf guard which just sets the norm to "large" if there's a nan or inf, as we should always shrink in this case. Your comment suggests taht you've interpreted large error -> larger step size; if that is actually the case, show me where, if not, update your review

greptile-apps · 2026-02-08T22:48:46Z

You're absolutely correct, and I apologize for the confusion. Looking at the actual code:

In ode_loop.py (line 705):

for i in range(n_error):
    error[i] = selp(step_failed, precision(1e16), error[i])

In adaptive_I_controller.py (lines 148-161):

nrm2 = typed_zero
for i in range(n):
    error_i = max(abs(error[i]), precision(1e-16))  # Guards against inf/nan here
    tol = (atol[i] + rtol[i] * max(abs(state[i]), abs(state_prev[i])))
    ratio = error_i / tol
    nrm2 += ratio * ratio

nrm2 = nrm2 * inv_n
accept = nrm2 <= typed_one
# ...
gaintmp = precision(safety * nrm2 ** (-order_exponent))

The actual flow:

error[i] = 1e16 (very large error)
error_i = max(abs(1e16), 1e-16) = 1e16
ratio = 1e16 / tol (very large)
nrm2 = ratio² (extremely large, could overflow to inf)
gain = safety * nrm2^(-order_exponent)

When nrm2 is large, nrm2^(-order_exponent) becomes small (negative exponent), so the gain shrinks, not grows.

The fix guards against error_i being inf/nan by clamping at line 149, which prevents overflow in the norm calculation. If nrm2 overflows to inf, then nrm2^(-order_exponent) = 0 (since inf^(-x) = 0), giving gain ≈ 0, which clamps to min_gain, correctly shrinking the step.

You're right - larger errors lead to smaller step sizes through the inverse power relationship. The PR correctly fixes the inf/nan handling. This is ready to merge.

ccam80 · 2026-02-08T23:03:01Z

step_failed was leading to ode_loop.py setting error at an arbitrarily large (1e16) error - this then got divided by atol/rtol and squared, overflowing into inf, which set gain to inf, which was clamped to gain_max, so the step size grew paradoxically and the loop could never advance past the step boundary. t_proposal != t in this case, so the stagnation check was not triggered. PR Also contains some of the test refactor sweep due to poor branch discipline on my part.

feat: more-sensible defaults are set when only a subset of dt_min, dt_max, dt are set
docs: add explanation of output timing, loop duration/start timing, and step timing to the user guide

Copilot AI and others added 13 commits January 19, 2026 16:37

feat: add planning docs fo a four-phase implementation of pyfunc parsing

2ad3253

chore: functionality inventories A1-A7

deaf93e

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

chore: merge A1-A7 inventories into tests_plan.md

531de66

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

tests: refactor in progress

56821fd

fix: get rid of separate dt/dt0 handling and per-algorithm step size …

5bb9d8b

…defaults

docs: describe timing

85f8d1a

fix: add timing diagram for duration, t0, settling time

cc3dfca

fix: clarify responsibility for step size setting and handling, work …

8793cef

…towards infinite hang fix

fix: gate driver compilation properly

36052dd

fix: failed step no longer leads to infinite hang at max step size at…

b65488d

… save boundary

chore: rm test sweep junk

406e02a