Skip to content

fix: adaptive loops no longer hang when newton solver fails on a save boundary#519

Merged
ccam80 merged 21 commits into
mainfrom
dt_min_fix
Feb 8, 2026
Merged

fix: adaptive loops no longer hang when newton solver fails on a save boundary#519
ccam80 merged 21 commits into
mainfrom
dt_min_fix

Conversation

@ccam80
Copy link
Copy Markdown
Owner

@ccam80 ccam80 commented Feb 7, 2026

step_failed was leading to ode_loop.py setting error at an arbitrarily large (1e16) error - this then got divided by atol/rtol and squared, overflowing into inf, which set gain to inf, which was clamped to gain_max, so the step size grew paradoxically and the loop could never advance past the step boundary. t_proposal != t in this case, so the stagnation check was not triggered. PR Also contains some of the test refactor sweep due to poor branch discipline on my part.

feat: more-sensible defaults are set when only a subset of dt_min, dt_max, dt are set
docs: add explanation of output timing, loop duration/start timing, and step timing to the user guide

Copilot AI and others added 13 commits January 19, 2026 16:37
Remove three vestige properties from SingleIntegratorRun
(algorithm_key, compiled_loop_function, threads_per_loop) and
update BatchSolverKernel consumers to use the underlying properties
directly (algorithm, device_function, threads_per_step).

Fix set("algorithm") and set("step_controller") in
SingleIntegratorRunCore._switch_algos/_switch_controllers which
incorrectly returned sets of individual characters instead of
single-element sets.

Add tests_plan.md (master functionality inventory) and progress.md
(running session log) for the multi-session test sweep.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- _convert_inventory.py: one-shot parser (tests_plan.md -> JSON)
- query_inventory.py: agent-facing CLI for file/tag/method queries
- _inventory.json: 100 files, 3482 items with auto-inferred tags

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Feb 7, 2026

Greptile Overview

Greptile Summary

This PR updates the CUDA IVP loop to handle step-function failures without hanging at save boundaries, and refactors step controller configuration/plumbing (including renaming dt0dt) plus associated tests/docs.

The main loop change introduces step_failed handling and stagnation detection inside src/cubie/integrators/loops/ode_loop.py, alongside a new integrated numerical regression test intended to cover the previously observed hang.

One merge-blocking issue remains: on step failure the loop currently forces error to a huge value (1e16), which for the adaptive controllers causes gain to clamp to max_gain and increase the timestep, counter to the intended “shrink dt and move off the boundary” behavior. That can recreate the non-advancing boundary condition the PR aims to fix.

Confidence Score: 2/5

  • This PR is not safe to merge until step-failure dt adaptation is corrected.
  • The core regression fix is undermined by forcing error=1e16 on step failure, which drives the adaptive controller to increase dt (clamped to max_gain) and can reproduce the hang/non-advancing behavior at save boundaries. Other refactors appear coherent but depend on this loop behavior being correct.
  • src/cubie/integrators/loops/ode_loop.py

Important Files Changed

Filename Overview
src/cubie/integrators/loops/ode_loop.py Adds step_failed handling by forcing error to 1e16 and stagnation detection; current 1e16 error forces dt growth and can reintroduce non-advancing behavior.
src/cubie/integrators/step_control/adaptive_step_controller.py Refactors adaptive controller config to use dt property with geometric-mean fallback and moves bounds validation into controller; no direct runtime issues found.
src/cubie/integrators/step_control/base_step_controller.py Introduces common BaseStepController init/update plumbing and dt property rename; appears consistent with new adaptive config behavior.
src/cubie/integrators/loops/ode_loop_config.py Renames dt0 to dt in ODELoopConfig; config now provides dt via _dt field with precision-cast.
tests/integrated_numerical_tests/test_dt_min_hang.py Adds regression test to ensure solve_ivp completes under f32 save drift scenario; test lacks an explicit fast-fail mechanism if hang returns.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment thread src/cubie/integrators/loops/ode_loop.py Outdated
Comment thread src/cubie/integrators/loops/ode_loop.py Outdated
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Feb 7, 2026

Additional Comments (4)

src/cubie/integrators/step_control/adaptive_step_controller.py
Invalid optional dt_max

AdaptiveStepControlConfig._dt_max is typed as Optional[float] but uses validator=getype_validator(float, 0), which rejects None. This makes dt_max=None impossible even though the config/API suggests it’s allowed (and BaseAdaptiveStepController._resolve_step_params() relies on detecting missing bounds). Use the optional validator (or make the field non-optional).

    _dt_max: Optional[float] = field(
        default=1.0, validator=opt_getype_validator(float, 0)
    )

src/cubie/integrators/step_control/adaptive_step_controller.py
dt_max can be None

After the refactor, dt_max returns self.precision(self._dt_max). If _dt_max is still allowed to be None (per the Optional[float] type and doc defaults), this will crash at runtime when precision(None) is called. Either ensure _dt_max is always filled during config construction/derivation, or keep the previous fallback (dt_min * 100) in this property.


src/cubie/integrators/step_control/base_step_controller.py
Base config dt unset

BaseStepControllerConfig now owns _dt: Optional[float] = None, and adaptive controllers’ dt property is precision(self._dt). If the user provides only dt_min or only dt_max, _resolve_step_params() doesn’t set kwargs["dt"], so _dt stays None and later self.dt will call precision(None) in _ensure_sane_bounds().

This breaks the PR’s stated behavior (“sensible defaults when only a subset of dt_min, dt_max, dt are set”). You likely need derivation rules for the one-bound cases (e.g., if only dt_min provided, set dt = dt_min; if only dt_max, set dt = dt_max; plus derive the missing opposite bound), or disallow those partial inputs with a clear error.


src/cubie/integrators/loops/ode_loop.py
Negative dt_eff possible

At output-boundary adjustment, dt_eff = precision(next_event - t_prec). If next_save/next_update_summary has drifted behind t_prec (float32 accumulation vs float64 t), this becomes negative and is passed into step_function(...) (line ~751). That can trigger undefined behavior/hangs in the stepper and matches the PR’s described failure mode; the new regression test documents exactly this scenario.

Fix needs to ensure next_event >= t_prec before subtracting (e.g., clamp next_event = max(next_event, t_prec) or advance the event time when it falls behind), or explicitly treat dt_eff <= 0 as stagnation/irrecoverable to avoid looping forever.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread src/cubie/integrators/loops/ode_loop.py Outdated
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 113 to 117
@property
def dt0(self) -> float:
def dt(self) -> float:
"""Return the initial step size."""
return self.precision(sqrt(self.dt_min * self.dt_max))
return self.precision(self._dt)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dt can be None

AdaptiveStepControlConfig.dt returns self.precision(self._dt), but _dt is defined on BaseStepControllerConfig as Optional[float] and _resolve_step_params() only sets kwargs["dt"] when dt is provided or when both bounds are present. If a user provides only one of dt_min/dt_max (or neither), _dt can remain None, and this property will raise during precision-cast. This makes “more-sensible defaults when only a subset is set” fail with a runtime error.

Either ensure _resolve_step_params() always sets dt (derive from the single bound or set a default), or enforce dt as required at config validation time.

Comment on lines +246 to +254
self._resolve_step_params(dt, kwargs)
config = build_config(
self._config_class,
required={"precision": precision, "n": n},
**kwargs,
)
self.setup_compile_settings(config)
self._ensure_sane_bounds()
self.register_buffers()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derived dt not passed to config

BaseStepController.__init__() calls self._resolve_step_params(dt, kwargs) but then ignores the dt argument when building the config: build_config(..., **kwargs). In BaseAdaptiveStepController._resolve_step_params(), you set kwargs["dt"] = dt only when dt is provided or both bounds are present; otherwise _dt can remain None (see other comment). More importantly, if a subclass chooses to derive the effective initial dt from the positional dt argument without writing it into kwargs, the config will never receive it.

Given the new “subset of dt_min/dt_max/dt set” behavior, this is a fragile footgun: dt should be plumbed into kwargs (or into required) so _dt is always populated consistently.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Feb 7, 2026

Additional Comments (1)

src/cubie/integrators/step_control/adaptive_step_controller.py
Optional dt_max no longer optional

AdaptiveStepControlConfig._dt_max is annotated as Optional[float], but its validator is getype_validator(float, 0) and its default is 1.0 (not None). More importantly, dt_max now returns self.precision(self._dt_max) (no fallback), so passing dt_max=None (which is explicitly described in the PR description as a supported partial-specification case) will fail validation / precision-cast at runtime. This breaks the “subset of dt_min/dt_max/dt set” story and can crash controller construction when users omit dt_max.

Fix by either making _dt_max truly optional (validator opt_getype_validator, default None, and a derivation path), or by removing the Optional/None support and updating docs/tests accordingly.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +177 to +205
def _resolve_step_params(self, dt: float, kwargs: dict) -> None:
"""Derive bounds from dt and track user-provided values.

Parameters
----------
config
Configuration for the controller.
dt
Initial step size, or None if not provided.
kwargs
Mutable dict of keyword arguments. Modified in place.
"""
# Track user-provided values BEFORE derivation
if dt is not None:
self._user_step_params["dt"] = dt
if "dt_min" in kwargs:
self._user_step_params["dt_min"] = kwargs["dt_min"]
if "dt_max" in kwargs:
self._user_step_params["dt_max"] = kwargs["dt_max"]

# Derive missing values
if dt is not None:
kwargs.setdefault("dt_min", dt / 100)
kwargs.setdefault("dt_max", dt * 100)
kwargs["dt"] = dt
else:
# dt not provided; derive from bounds if both present
dt_min = kwargs.get("dt_min")
dt_max = kwargs.get("dt_max")
if dt_min is not None and dt_max is not None:
kwargs["dt"] = sqrt(dt_min * dt_max)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic error: when user provides no dt/dt_min/dt_max, kwargs["dt"] is never set even though _dt_min and _dt_max have defaults (1e-6, 1.0).

Sequence:

  1. User calls AdaptiveController(precision=float32) with no step params
  2. _resolve_step_params(None, {}) runs
  3. Line 202-203: dt_min/dt_max = None (checking kwargs, not config defaults)
  4. Line 204-205: condition false, kwargs["dt"] never set
  5. Config built with _dt=None (base class default)
  6. Line 116 crashes: self.precision(None)

Fix: check config defaults or always set kwargs["dt"] to geometric mean of defaults when both bounds present.

Comment thread src/cubie/integrators/loops/ode_loop.py Outdated
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 109 to +111
def dt_max(self) -> float:
"""Return the maximum permissible step size."""
value = self._dt_max
if value is None:
value = self._dt_min * 100
return self.precision(value)
return self.precision(self._dt_max)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional dt_max can crash

_dt_max is declared as Optional[float], but dt_max unconditionally returns self.precision(self._dt_max). If _dt_max is ever None (e.g., via config construction/plumbing), this will raise at runtime. Either make _dt_max non-optional (since you have a default) or guard here and provide a deterministic fallback.

Suggested change
def dt_max(self) -> float:
"""Return the maximum permissible step size."""
value = self._dt_max
if value is None:
value = self._dt_min * 100
return self.precision(value)
return self.precision(self._dt_max)
@property
def dt_max(self) -> float:
"""Return the maximum permissible step size."""
assert self._dt_max is not None
return self.precision(self._dt_max)

Comment on lines +114 to +118
def dt(self) -> float:
"""Return the initial step size.

When the user has not provided an explicit dt, returns the
geometric mean of dt_min and dt_max.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Geometric-mean fallback unsafe

When _dt is None, dt falls back to sqrt(self._dt_min * self._dt_max) but doesn’t ensure _dt_max is non-None. If config plumbing allows _dt_max=None, this path raises TypeError (and settings_dict also triggers it). Consider ensuring _dt_min/_dt_max are always concrete floats by construction, or explicitly handling the None case here.

Suggested change
def dt(self) -> float:
"""Return the initial step size.
When the user has not provided an explicit dt, returns the
geometric mean of dt_min and dt_max.
if self._dt is not None:
return self.precision(self._dt)
assert self._dt_min is not None and self._dt_max is not None
return self.precision(sqrt(self._dt_min * self._dt_max))

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greptile
config plumbing doesn't allow this. dt_min and dt_max have defaults i nthe attrs class; if a user set it to None deliberately, it might break (if the config class didn't use validators), but if they were to do that I would presume that it should break

Comment on lines +193 to +209
# Track user-provided values BEFORE derivation
if dt is not None:
self._user_step_params["dt"] = dt
if "dt_min" in kwargs:
self._user_step_params["dt_min"] = kwargs["dt_min"]
if "dt_max" in kwargs:
self._user_step_params["dt_max"] = kwargs["dt_max"]

# Derive missing values
if dt is not None:
kwargs.setdefault("dt_min", dt / 100)
kwargs.setdefault("dt_max", dt * 100)
kwargs["dt"] = dt
else:
# dt not provided; derive from bounds if both present
dt_min = kwargs.get("dt_min")
dt_max = kwargs.get("dt_max")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single-bound dt remains unset

_resolve_step_params() only sets kwargs["dt"] when dt is provided or when both dt_min and dt_max are provided. If a user supplies only one bound (or neither), config can be built with _dt=None, which then relies on the dt property’s geometric-mean fallback (and can still crash if bounds are not concrete). To match the PR’s “more-sensible defaults” intent, _resolve_step_params() should always populate dt deterministically from the available information (single bound + default other bound, or config defaults).

Suggested change
# Track user-provided values BEFORE derivation
if dt is not None:
self._user_step_params["dt"] = dt
if "dt_min" in kwargs:
self._user_step_params["dt_min"] = kwargs["dt_min"]
if "dt_max" in kwargs:
self._user_step_params["dt_max"] = kwargs["dt_max"]
# Derive missing values
if dt is not None:
kwargs.setdefault("dt_min", dt / 100)
kwargs.setdefault("dt_max", dt * 100)
kwargs["dt"] = dt
else:
# dt not provided; derive from bounds if both present
dt_min = kwargs.get("dt_min")
dt_max = kwargs.get("dt_max")
else:
dt_min = kwargs.get("dt_min")
dt_max = kwargs.get("dt_max")
if dt_min is not None and dt_max is not None:
kwargs["dt"] = sqrt(dt_min * dt_max)
elif dt_min is not None:
kwargs["dt"] = dt_min
elif dt_max is not None:
kwargs["dt"] = dt_max

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greptile reexmaine this. dt is returned from compile settings via a property which returns the geometric mean of dt_min and dt_max if _dt is None. That achieves the behaviour set out in docs/timing.rst - if not set by the user, starting dt is the geometric mean of max and min.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct. Let me trace through the logic:

  1. _resolve_step_params() populates kwargs["dt"] only when dt is provided OR both bounds are provided
  2. kwargs is used to build the config, setting config._dt
  3. The dt property returns self._dt if self._dt is not None else jnp.sqrt(self.dt_min * self.dt_max)

So when only one bound is provided:

  • _resolve_step_params() derives the missing bound from the provided one (e.g., dt_mindt_max = dt_min * 100)
  • kwargs["dt"] remains unset, so config._dt = None
  • The property's fallback computes the geometric mean of the now-complete bounds

This correctly implements the documented behavior. The original concern about "unset dt" was misplaced - the property's fallback handles it cleanly, and both bounds are always concrete by the time config is built.

My initial comment was wrong. The implementation already achieves the intended "sensible defaults" behavior through the combination of bound derivation in _resolve_step_params() and the geometric mean fallback in the dt property.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread src/cubie/integrators/loops/ode_loop.py Outdated
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +1 to +3
"""Regression tests for f32 save-event drift hang.

When ``save_every`` is not exactly representable in float32
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging regression not asserted

This test only asserts n_saves >= 80 after solve_ivp(...) returns. If the regression is a true hang, CI will just time out rather than producing an actionable assertion. Consider adding a deterministic termination condition that fails fast (e.g., limiting max loop iterations / max rejected steps and asserting the solver returns an error bit instead of hanging), so the test fails quickly and clearly when the bug reappears.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Feb 8, 2026

Additional Comments (1)

src/cubie/integrators/loops/ode_loop.py
Step failure increases dt

When step_failed, the loop sets error[i]=1e16 (ode_loop.py:784-786), which drives the adaptive controller to increase dt (gain clamps to max_gain) and can recreate the hang described in the PR. For “Newton solver fails on a save boundary”, you want failure to reduce dt (or at least guarantee rejection + shrink) so the loop can move off the boundary. Consider setting error to something that forces gain < 1 (or directly overriding the controller outcome on failure) instead of a huge value.

@ccam80
Copy link
Copy Markdown
Owner Author

ccam80 commented Feb 8, 2026

@greptile reexamine the failure mode you've suggested, I think this is ready to merge but your review has just flipped and I want to double check. Larger errors shrink step sizes. There was a bug where a large error led to an inf norm, which led to a NaN gain, which was clamped to max. I believe that was fixed through a nan/inf guard which just sets the norm to "large" if there's a nan or inf, as we should always shrink in this case. Your comment suggests taht you've interpreted large error -> larger step size; if that is actually the case, show me where, if not, update your review

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Feb 8, 2026

You're absolutely correct, and I apologize for the confusion. Looking at the actual code:

In ode_loop.py (line 705):

for i in range(n_error):
    error[i] = selp(step_failed, precision(1e16), error[i])

In adaptive_I_controller.py (lines 148-161):

nrm2 = typed_zero
for i in range(n):
    error_i = max(abs(error[i]), precision(1e-16))  # Guards against inf/nan here
    tol = (atol[i] + rtol[i] * max(abs(state[i]), abs(state_prev[i])))
    ratio = error_i / tol
    nrm2 += ratio * ratio

nrm2 = nrm2 * inv_n
accept = nrm2 <= typed_one
# ...
gaintmp = precision(safety * nrm2 ** (-order_exponent))

The actual flow:

  1. error[i] = 1e16 (very large error)
  2. error_i = max(abs(1e16), 1e-16) = 1e16
  3. ratio = 1e16 / tol (very large)
  4. nrm2 = ratio² (extremely large, could overflow to inf)
  5. gain = safety * nrm2^(-order_exponent)

When nrm2 is large, nrm2^(-order_exponent) becomes small (negative exponent), so the gain shrinks, not grows.

The fix guards against error_i being inf/nan by clamping at line 149, which prevents overflow in the norm calculation. If nrm2 overflows to inf, then nrm2^(-order_exponent) = 0 (since inf^(-x) = 0), giving gain ≈ 0, which clamps to min_gain, correctly shrinking the step.

You're right - larger errors lead to smaller step sizes through the inverse power relationship. The PR correctly fixes the inf/nan handling. This is ready to merge.

@ccam80
Copy link
Copy Markdown
Owner Author

ccam80 commented Feb 8, 2026

step_failed was leading to ode_loop.py setting error at an arbitrarily large (1e16) error - this then got divided by atol/rtol and squared, overflowing into inf, which set gain to inf, which was clamped to gain_max, so the step size grew paradoxically and the loop could never advance past the step boundary. t_proposal != t in this case, so the stagnation check was not triggered. PR Also contains some of the test refactor sweep due to poor branch discipline on my part.

feat: more-sensible defaults are set when only a subset of dt_min, dt_max, dt are set
docs: add explanation of output timing, loop duration/start timing, and step timing to the user guide

@ccam80 ccam80 merged commit f552dda into main Feb 8, 2026
5 checks passed
@ccam80 ccam80 deleted the dt_min_fix branch February 8, 2026 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants