fix: adaptive loops no longer hang when newton solver fails on a save boundary#519
Conversation
Remove three vestige properties from SingleIntegratorRun
(algorithm_key, compiled_loop_function, threads_per_loop) and
update BatchSolverKernel consumers to use the underlying properties
directly (algorithm, device_function, threads_per_step).
Fix set("algorithm") and set("step_controller") in
SingleIntegratorRunCore._switch_algos/_switch_controllers which
incorrectly returned sets of individual characters instead of
single-element sets.
Add tests_plan.md (master functionality inventory) and progress.md
(running session log) for the multi-session test sweep.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- _convert_inventory.py: one-shot parser (tests_plan.md -> JSON) - query_inventory.py: agent-facing CLI for file/tag/method queries - _inventory.json: 100 files, 3482 items with auto-inferred tags Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…towards infinite hang fix
Greptile OverviewGreptile SummaryThis PR updates the CUDA IVP loop to handle step-function failures without hanging at save boundaries, and refactors step controller configuration/plumbing (including renaming The main loop change introduces One merge-blocking issue remains: on step failure the loop currently forces Confidence Score: 2/5
Important Files Changed
|
Additional Comments (4)
After the refactor,
This breaks the PR’s stated behavior (“sensible defaults when only a subset of dt_min, dt_max, dt are set”). You likely need derivation rules for the one-bound cases (e.g., if only
At output-boundary adjustment, Fix needs to ensure |
| @property | ||
| def dt0(self) -> float: | ||
| def dt(self) -> float: | ||
| """Return the initial step size.""" | ||
| return self.precision(sqrt(self.dt_min * self.dt_max)) | ||
| return self.precision(self._dt) | ||
|
|
There was a problem hiding this comment.
dt can be None
AdaptiveStepControlConfig.dt returns self.precision(self._dt), but _dt is defined on BaseStepControllerConfig as Optional[float] and _resolve_step_params() only sets kwargs["dt"] when dt is provided or when both bounds are present. If a user provides only one of dt_min/dt_max (or neither), _dt can remain None, and this property will raise during precision-cast. This makes “more-sensible defaults when only a subset is set” fail with a runtime error.
Either ensure _resolve_step_params() always sets dt (derive from the single bound or set a default), or enforce dt as required at config validation time.
| self._resolve_step_params(dt, kwargs) | ||
| config = build_config( | ||
| self._config_class, | ||
| required={"precision": precision, "n": n}, | ||
| **kwargs, | ||
| ) | ||
| self.setup_compile_settings(config) | ||
| self._ensure_sane_bounds() | ||
| self.register_buffers() |
There was a problem hiding this comment.
Derived dt not passed to config
BaseStepController.__init__() calls self._resolve_step_params(dt, kwargs) but then ignores the dt argument when building the config: build_config(..., **kwargs). In BaseAdaptiveStepController._resolve_step_params(), you set kwargs["dt"] = dt only when dt is provided or both bounds are present; otherwise _dt can remain None (see other comment). More importantly, if a subclass chooses to derive the effective initial dt from the positional dt argument without writing it into kwargs, the config will never receive it.
Given the new “subset of dt_min/dt_max/dt set” behavior, this is a fragile footgun: dt should be plumbed into kwargs (or into required) so _dt is always populated consistently.
Additional Comments (1)
Fix by either making |
| def _resolve_step_params(self, dt: float, kwargs: dict) -> None: | ||
| """Derive bounds from dt and track user-provided values. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| config | ||
| Configuration for the controller. | ||
| dt | ||
| Initial step size, or None if not provided. | ||
| kwargs | ||
| Mutable dict of keyword arguments. Modified in place. | ||
| """ | ||
| # Track user-provided values BEFORE derivation | ||
| if dt is not None: | ||
| self._user_step_params["dt"] = dt | ||
| if "dt_min" in kwargs: | ||
| self._user_step_params["dt_min"] = kwargs["dt_min"] | ||
| if "dt_max" in kwargs: | ||
| self._user_step_params["dt_max"] = kwargs["dt_max"] | ||
|
|
||
| # Derive missing values | ||
| if dt is not None: | ||
| kwargs.setdefault("dt_min", dt / 100) | ||
| kwargs.setdefault("dt_max", dt * 100) | ||
| kwargs["dt"] = dt | ||
| else: | ||
| # dt not provided; derive from bounds if both present | ||
| dt_min = kwargs.get("dt_min") | ||
| dt_max = kwargs.get("dt_max") | ||
| if dt_min is not None and dt_max is not None: | ||
| kwargs["dt"] = sqrt(dt_min * dt_max) |
There was a problem hiding this comment.
Logic error: when user provides no dt/dt_min/dt_max, kwargs["dt"] is never set even though _dt_min and _dt_max have defaults (1e-6, 1.0).
Sequence:
- User calls
AdaptiveController(precision=float32)with no step params _resolve_step_params(None, {})runs- Line 202-203:
dt_min/dt_max = None(checking kwargs, not config defaults) - Line 204-205: condition false,
kwargs["dt"]never set - Config built with
_dt=None(base class default) - Line 116 crashes:
self.precision(None)
Fix: check config defaults or always set kwargs["dt"] to geometric mean of defaults when both bounds present.
| def dt_max(self) -> float: | ||
| """Return the maximum permissible step size.""" | ||
| value = self._dt_max | ||
| if value is None: | ||
| value = self._dt_min * 100 | ||
| return self.precision(value) | ||
| return self.precision(self._dt_max) |
There was a problem hiding this comment.
Optional dt_max can crash
_dt_max is declared as Optional[float], but dt_max unconditionally returns self.precision(self._dt_max). If _dt_max is ever None (e.g., via config construction/plumbing), this will raise at runtime. Either make _dt_max non-optional (since you have a default) or guard here and provide a deterministic fallback.
| def dt_max(self) -> float: | |
| """Return the maximum permissible step size.""" | |
| value = self._dt_max | |
| if value is None: | |
| value = self._dt_min * 100 | |
| return self.precision(value) | |
| return self.precision(self._dt_max) | |
| @property | |
| def dt_max(self) -> float: | |
| """Return the maximum permissible step size.""" | |
| assert self._dt_max is not None | |
| return self.precision(self._dt_max) |
| def dt(self) -> float: | ||
| """Return the initial step size. | ||
|
|
||
| When the user has not provided an explicit dt, returns the | ||
| geometric mean of dt_min and dt_max. |
There was a problem hiding this comment.
Geometric-mean fallback unsafe
When _dt is None, dt falls back to sqrt(self._dt_min * self._dt_max) but doesn’t ensure _dt_max is non-None. If config plumbing allows _dt_max=None, this path raises TypeError (and settings_dict also triggers it). Consider ensuring _dt_min/_dt_max are always concrete floats by construction, or explicitly handling the None case here.
| def dt(self) -> float: | |
| """Return the initial step size. | |
| When the user has not provided an explicit dt, returns the | |
| geometric mean of dt_min and dt_max. | |
| if self._dt is not None: | |
| return self.precision(self._dt) | |
| assert self._dt_min is not None and self._dt_max is not None | |
| return self.precision(sqrt(self._dt_min * self._dt_max)) |
There was a problem hiding this comment.
@greptile
config plumbing doesn't allow this. dt_min and dt_max have defaults i nthe attrs class; if a user set it to None deliberately, it might break (if the config class didn't use validators), but if they were to do that I would presume that it should break
| # Track user-provided values BEFORE derivation | ||
| if dt is not None: | ||
| self._user_step_params["dt"] = dt | ||
| if "dt_min" in kwargs: | ||
| self._user_step_params["dt_min"] = kwargs["dt_min"] | ||
| if "dt_max" in kwargs: | ||
| self._user_step_params["dt_max"] = kwargs["dt_max"] | ||
|
|
||
| # Derive missing values | ||
| if dt is not None: | ||
| kwargs.setdefault("dt_min", dt / 100) | ||
| kwargs.setdefault("dt_max", dt * 100) | ||
| kwargs["dt"] = dt | ||
| else: | ||
| # dt not provided; derive from bounds if both present | ||
| dt_min = kwargs.get("dt_min") | ||
| dt_max = kwargs.get("dt_max") |
There was a problem hiding this comment.
Single-bound dt remains unset
_resolve_step_params() only sets kwargs["dt"] when dt is provided or when both dt_min and dt_max are provided. If a user supplies only one bound (or neither), config can be built with _dt=None, which then relies on the dt property’s geometric-mean fallback (and can still crash if bounds are not concrete). To match the PR’s “more-sensible defaults” intent, _resolve_step_params() should always populate dt deterministically from the available information (single bound + default other bound, or config defaults).
| # Track user-provided values BEFORE derivation | |
| if dt is not None: | |
| self._user_step_params["dt"] = dt | |
| if "dt_min" in kwargs: | |
| self._user_step_params["dt_min"] = kwargs["dt_min"] | |
| if "dt_max" in kwargs: | |
| self._user_step_params["dt_max"] = kwargs["dt_max"] | |
| # Derive missing values | |
| if dt is not None: | |
| kwargs.setdefault("dt_min", dt / 100) | |
| kwargs.setdefault("dt_max", dt * 100) | |
| kwargs["dt"] = dt | |
| else: | |
| # dt not provided; derive from bounds if both present | |
| dt_min = kwargs.get("dt_min") | |
| dt_max = kwargs.get("dt_max") | |
| else: | |
| dt_min = kwargs.get("dt_min") | |
| dt_max = kwargs.get("dt_max") | |
| if dt_min is not None and dt_max is not None: | |
| kwargs["dt"] = sqrt(dt_min * dt_max) | |
| elif dt_min is not None: | |
| kwargs["dt"] = dt_min | |
| elif dt_max is not None: | |
| kwargs["dt"] = dt_max |
There was a problem hiding this comment.
@greptile reexmaine this. dt is returned from compile settings via a property which returns the geometric mean of dt_min and dt_max if _dt is None. That achieves the behaviour set out in docs/timing.rst - if not set by the user, starting dt is the geometric mean of max and min.
There was a problem hiding this comment.
You're correct. Let me trace through the logic:
_resolve_step_params()populateskwargs["dt"]only whendtis provided OR both bounds are providedkwargsis used to build the config, settingconfig._dt- The
dtproperty returnsself._dt if self._dt is not None else jnp.sqrt(self.dt_min * self.dt_max)
So when only one bound is provided:
_resolve_step_params()derives the missing bound from the provided one (e.g.,dt_min→dt_max = dt_min * 100)kwargs["dt"]remains unset, soconfig._dt = None- The property's fallback computes the geometric mean of the now-complete bounds
This correctly implements the documented behavior. The original concern about "unset dt" was misplaced - the property's fallback handles it cleanly, and both bounds are always concrete by the time config is built.
My initial comment was wrong. The implementation already achieves the intended "sensible defaults" behavior through the combination of bound derivation in _resolve_step_params() and the geometric mean fallback in the dt property.
| """Regression tests for f32 save-event drift hang. | ||
|
|
||
| When ``save_every`` is not exactly representable in float32 |
There was a problem hiding this comment.
Hanging regression not asserted
This test only asserts n_saves >= 80 after solve_ivp(...) returns. If the regression is a true hang, CI will just time out rather than producing an actionable assertion. Consider adding a deterministic termination condition that fails fast (e.g., limiting max loop iterations / max rejected steps and asserting the solver returns an error bit instead of hanging), so the test fails quickly and clearly when the bug reappears.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Additional Comments (1)
When |
|
@greptile reexamine the failure mode you've suggested, I think this is ready to merge but your review has just flipped and I want to double check. Larger errors shrink step sizes. There was a bug where a large error led to an inf norm, which led to a NaN gain, which was clamped to max. I believe that was fixed through a nan/inf guard which just sets the norm to "large" if there's a nan or inf, as we should always shrink in this case. Your comment suggests taht you've interpreted large error -> larger step size; if that is actually the case, show me where, if not, update your review |
|
You're absolutely correct, and I apologize for the confusion. Looking at the actual code: In for i in range(n_error):
error[i] = selp(step_failed, precision(1e16), error[i])In nrm2 = typed_zero
for i in range(n):
error_i = max(abs(error[i]), precision(1e-16)) # Guards against inf/nan here
tol = (atol[i] + rtol[i] * max(abs(state[i]), abs(state_prev[i])))
ratio = error_i / tol
nrm2 += ratio * ratio
nrm2 = nrm2 * inv_n
accept = nrm2 <= typed_one
# ...
gaintmp = precision(safety * nrm2 ** (-order_exponent))The actual flow:
When The fix guards against You're right - larger errors lead to smaller step sizes through the inverse power relationship. The PR correctly fixes the inf/nan handling. This is ready to merge. |
|
step_failed was leading to ode_loop.py setting error at an arbitrarily large (1e16) error - this then got divided by atol/rtol and squared, overflowing into inf, which set gain to inf, which was clamped to gain_max, so the step size grew paradoxically and the loop could never advance past the step boundary. t_proposal != t in this case, so the stagnation check was not triggered. PR Also contains some of the test refactor sweep due to poor branch discipline on my part. feat: more-sensible defaults are set when only a subset of dt_min, dt_max, dt are set |
step_failed was leading to ode_loop.py setting error at an arbitrarily large (1e16) error - this then got divided by atol/rtol and squared, overflowing into inf, which set gain to inf, which was clamped to gain_max, so the step size grew paradoxically and the loop could never advance past the step boundary. t_proposal != t in this case, so the stagnation check was not triggered. PR Also contains some of the test refactor sweep due to poor branch discipline on my part.
feat: more-sensible defaults are set when only a subset of dt_min, dt_max, dt are set
docs: add explanation of output timing, loop duration/start timing, and step timing to the user guide