Skip to content

Fix master CI: expv zero-input NaN, JET-on-1.12 QA, GPU-in-All#229

Draft
ChrisRackauckas-Claude wants to merge 1 commit into
SciML:masterfrom
ChrisRackauckas-Claude:fix-master-ci-1.12-nan-jet-gpu
Draft

Fix master CI: expv zero-input NaN, JET-on-1.12 QA, GPU-in-All#229
ChrisRackauckas-Claude wants to merge 1 commit into
SciML:masterfrom
ChrisRackauckas-Claude:fix-master-ci-1.12-nan-jet-gpu

Conversation

@ChrisRackauckas-Claude

Copy link
Copy Markdown
Contributor

Fixes three independent failures on the master grouped-tests CI.

1. Core: NaN == 0.0 at basictests.jl:307 (zero-input expv)

The real expv!(w, t::Real, Ks) method was missing the iszero(beta) guard the complex method already has. For a zero input vector firststep! skips initializing the Krylov basis V (it only fills V[:,1] when beta != 0), so the final lmul!(beta, mul!(w, @view(V[:,1:m]), expHe)) computes 0 * <uninitialized memory>, which is NaN whenever V holds garbage — explaining why the failure was flaky (heap-dependent: green on some OS/runs, NaN on others). Added the same early-return guard so expv of a zero vector is exactly zero.

Verified locally: full GROUP=Core Pkg.test passes on Julia 1.10 and 1.12 (it reliably produced NaN on 1.10 before).

2. QA: 6 JET failures on the 1 (= Julia 1.12) channel

lts (1.10) was green; only 1 (1.12) failed. On 1.12 JET traces into LinearAlgebra/Base internals — norm(::Vector)norm_recursive_checkiterate(::Nothing), and the broadcast unalias/copyto_unaliased! path over Adjoint{T, Union{}} — and reports abstract-interpretation artifacts there that this package does not control. Scoped the QA report_calls to target_modules = (ExponentialUtilities,) (the standard JET-as-package-QA configuration), which keeps full coverage of this package's own code.

That scoping surfaced two genuine may be undefined findings, which are fixed here so the scoped analysis is clean (not silenced):

  • si in exponential! (exp_baseexp.jl) — conditionally assigned inside if s > 0, used inside a separate if s > 0; now initialized to 0 unconditionally.
  • order / kest in kiops (kiops.jl) — carried across loop iterations via the orderold/kestold "reuse" flags but only conditionally assigned; now seeded with their first-iteration defaults.

Verified locally: QA passes 17/17 on Julia 1.10 and 1.12.

3. Core (windows): "CUDA driver not functional"

On Windows the Core job runs the run_tests "All" aggregate, which pulled in the GPU group, and using CUDA errored on the non-GPU runner. Marked the GPU group in_all = false so it only ever runs under an explicit GROUP=GPU on the self-hosted CUDA runner. Verified locally: GROUP=All now runs only Core/basictests.jl, never GPU/gputests.jl.

Not addressed (reported separately)

  • Core (julia pre, macos-latest): Static Arrays tolerance failure at basictests.jl:265 (expv(t,A,b) ≈ exp(t*A)*b). On linux Julia 1.13-rc1 the worst relative error is 1.25e-15; the macOS-pre failure shows ~1e-7. This is a macOS/1.13-rc-specific accuracy difference I could not reproduce or correctly fix on linux, and I will not loosen the tolerance without being able to prove the macOS deviation is benign.
  • GPU (self-hosted): requires CUDA hardware (infra), out of scope here.

Please ignore until reviewed by @ChrisRackauckas.

Three independent master-CI failures on the grouped-tests workflow:

1. Core (NaN == 0.0 at basictests.jl:307, flaky across OS/version).
   The real `expv!(w, t::Real, Ks)` method lacked the `iszero(beta)`
   guard that the complex method already has. For a zero input vector
   `firststep!` skips initializing the Krylov basis V (it only fills it
   when beta != 0), so `lmul!(beta, mul!(w, V, expHe))` computes
   `0 * <uninitialized memory>`, which is NaN whenever V holds garbage.
   Add the same early-return guard, making expv of a zero vector exactly
   zero (matching the complex method). Verified: full Core suite now
   passes on Julia 1.10 and 1.12 (was reliably NaN on 1.10).

2. QA (6 JET failures on the Julia "1" = 1.12 channel; lts/1.10 was
   green). On 1.12 JET traces into LinearAlgebra/Base internals
   (`norm(::Vector)` -> `norm_recursive_check` -> `iterate(::Nothing)`,
   and the broadcast `unalias`/`copyto_unaliased!` path over
   `Adjoint{T, Union{}}`) and reports artifacts there that this package
   does not control. Scope the QA `report_call`s to
   `target_modules = (ExponentialUtilities,)` — the standard JET-as-QA
   configuration — which keeps full coverage of this package's own code.
   That scoping surfaced two genuine `may be undefined` findings, fixed
   here so the scoped analysis is clean: `si` in `exponential!` and
   `order`/`kest` in `kiops` are now unconditionally initialized before
   use. Verified: QA passes 17/17 on Julia 1.10 and 1.12.

3. Core (windows, all versions: "CUDA driver not functional"). On
   Windows the Core job runs the run_tests "All" aggregate, which pulled
   in the GPU group and `using CUDA` errored on the non-GPU runner. Mark
   the GPU group `in_all = false` so it only runs under an explicit
   GROUP=GPU on the self-hosted CUDA runner. Verified locally: GROUP=All
   now runs only Core/basictests.jl, never GPU/gputests.jl.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants