Skip to content

Fix flaky Enzyme test_forward/test_reverse tolerance (RNG-dependent vs fastpower approximation)#58

Draft
ChrisRackauckas-Claude wants to merge 1 commit into
SciML:mainfrom
ChrisRackauckas-Claude:fix-enzyme-tolerance-rng-flake
Draft

Fix flaky Enzyme test_forward/test_reverse tolerance (RNG-dependent vs fastpower approximation)#58
ChrisRackauckas-Claude wants to merge 1 commit into
SciML:mainfrom
ChrisRackauckas-Claude:fix-enzyme-tolerance-rng-flake

Conversation

@ChrisRackauckas-Claude

Copy link
Copy Markdown
Contributor

Please ignore until reviewed by @ChrisRackauckas.

Problem

tests / Enzyme (julia 1) on main went red on the v1.3.2 run (was green 8 days earlier, with identical test source). The failure:

test_forward: fastpower with return activity Duplicated on (::Float64, Duplicated), (::Float64, Const): Test Failed
  Expression: isapprox(x, y; kwargs...)
   Evaluated: isapprox(0.155, 0.15524589105604497; atol = 0.0001, rtol = 0.001)

Root cause

FastPower's Enzyme @easy_rule returns the exact ^ derivative (y*fastpower(x,y-1), Ω*log(x)). EnzymeTestUtils' test_forward/test_reverse compare that rule against finite differences of the deliberately-approximate fastpower primal. So the measured gap is exactly fastpower's own primal approximation error (~1e-3 relative — the same envelope asserted in test/fast_pow_tests.jl), which sat right on top of the old atol=1e-4, rtol=1e-3.

Whether the lane passed depended on the random perturbation test_forward drew from the global RNG. An analytic sweep over the tangent grid the test samples (tangents in -9:0.01:9, central FD-5, at x=1.0, y=0.5) shows:

config worst abs gap worst rel gap fail @ old tol (1e-4,1e-3) fail @ new tol (1e-3,1e-2)
Tx=Dup, Ty=Const 1.17e-3 2.4% 144080/3241800 (4.44%) 0
Tx=Dup, Ty=Dup 2.04e-3 111058/3243600 (3.42%) 0
Tx=Const, Ty=Dup 6e-14 0 0

The CI failing value 0.15524589105604497 is reproduced exactly at tangent dx=0.3105 (= the exact-^ derivative 0.5·dx), with the FD-of-fastpower reference at 0.1555 — i.e. it is fastpower's primal error, not a wrong rule.

Fix

  • Seed the RNG (Random.Xoshiro(0)) so the randomized test is reproducible.
  • Raise the tolerance to atol=1e-3, rtol=1e-2, consistent with fastpower's documented accuracy. This has zero failures across all ~6.5M tangent draws in the grid, while a genuinely wrong rule would still be off by O(1) relative and is not masked.
  • Add Random to the test [extras]/[targets] and a Random = "1" [compat] entry.

This is the principled fix, not a blanket tolerance loosening: the rule's true error is zero; the only thing being measured is the approximation built into fastpower itself.

Verification (run locally)

Deps resolved match CI: Enzyme 0.13.164, EnzymeTestUtils 0.2.8, FiniteDifferences 0.12.34.

  • Reproduced the failure through the real test_forward with rng=Xoshiro(16) (first tangent dx=0.31): old tolerance → 6 pass / 1 fail (matches CI); new tolerance → 7 pass / 0 fail.
  • Fixed Enzyme group via Pkg.test, julia 1.11: enzyme_forward_tests 52/52, enzyme_reverse_tests 36/36, tests passed.
  • Fixed Enzyme group via Pkg.test, julia lts (1.10): 52/52, 36/36, tests passed.
  • Seeded forward test is deterministic: 52/52 across 3 repeats.
  • Runic: clean (no diff) on both edited files.

Note on the other red lanes in the same run

tests / Core (julia 1) and tests / Core (julia lts) were red in the same run but are not code failures: both ran on self-hosted-4vcpu-8gb (smcsd) runners squatting on the ubuntu-latest label; the "Run tests" step emitted zero log output and never recorded a conclusion (runner OOM/lost-communication while precompiling the Mooncake+Enzyme+ReverseDiff stack in 8 GB). Locally the Core group passes cleanly (fast_log2 1200/1200, fast_pow 5/5, other_ad_engines 4/4, all AD-engine derivative comparisons rel=0.0) on both julia 1.11 and lts. That is a runner-capacity infra issue, out of scope for this PR.

🤖 Generated with Claude Code

…hed tolerance

The Enzyme `@easy_rule` returns the *exact* `^` derivative, but EnzymeTestUtils
`test_forward`/`test_reverse` compare it against finite differences of the
*approximate* `fastpower` primal. Because `fastpower` routes through a Float32
`fastlog2` polynomial, the *slope* of its primal differs from the exact slope by
~1e-2 relative near x=1 (measured: exact d/dx = 0.5 vs FD-of-fastpower = 0.5066,
i.e. 1.3e-2 relative), even at points like (1.0, 0.5) where the primal *value* is
exact. So the FD reference is off from the exact rule by `fastpower`'s inherent
approximation error, not by any rule bug. The old atol=1e-4, rtol=1e-3 sat below
that gap, so whether the lane passed depended on the random tangents drawn from
the global RNG and it went red intermittently (~4% of draws).

Two-part fix:

1. Determinism via StableRNG, not Xoshiro. Seeding the global RNG / `Xoshiro`
   does not actually pin the test, because those streams can change across Julia
   versions, so the flake could reappear on a new Julia. `StableRNGs.StableRNG`
   yields a stream guaranteed identical across Julia versions, passed as the
   `rng=` keyword that EnzymeTestUtils accepts.

2. Tolerance matched to fastpower's documented accuracy (atol=1e-3, rtol=1e-2),
   not reverted to the tight 1e-4/1e-3. Empirically the inherent gap is real: with
   the tight tolerance, 8/10 candidate StableRNG seeds pass the forward grid 52/52
   but seeds 123 and 31415 fail, and the all-seeds failure boundary is rtol~2e-3.
   Reverting to rtol=1e-3 would only "pass" by cherry-picking a lucky seed, which
   would hide the genuine (benign, expected) primal-approximation error. The chosen
   rtol=1e-2 sits ~5x above the measured worst-case relative discrepancy yet far
   below the O(1) relative error a genuinely wrong derivative rule would produce,
   so real regressions are still caught. Verified deterministic forward 52/52 +
   reverse 36/36 across 3 repeats on both Julia 1 and lts, and 52/52 for all 10
   candidate seeds on lts (so the tolerance is seed-independent, not seed-luck).

Swap the test dep Random -> StableRNGs in [extras]/[targets].test/[compat].

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ChrisRackauckas-Claude ChrisRackauckas-Claude force-pushed the fix-enzyme-tolerance-rng-flake branch from 345fcc7 to 9328417 Compare June 24, 2026 12:26
@ChrisRackauckas-Claude

Copy link
Copy Markdown
Contributor Author

Updated to use StableRNGs.jl instead of Random.Xoshiro (per maintainer directive). Force-pushed 345fcc7 -> 9328417 (only-our-commit replaced; force-with-lease confirmed remote tip was unchanged).

Why StableRNG, not Xoshiro: seeding the global RNG / Xoshiro(0) does not actually pin the test, because those streams can change across Julia versions, so the flake could silently reappear on a new Julia. StableRNGs.StableRNG(seed) yields a stream guaranteed identical across Julia versions; it is passed as the rng= keyword that EnzymeTestUtils.test_forward/test_reverse accept. Seed used: StableRNG(123).

Tolerance decision — kept atol=1e-3, rtol=1e-2, did NOT revert to the tight 1e-4/1e-3. I measured whether a stable seed lets the tight tolerance pass, and it does not robustly:

  • The exact rule vs FD-of-approximate-primal gap is inherent: at (1.0, 0.5), exact d/dx = 0.5 but central FD of the fastpower primal gives 0.5066 — a 1.3e-2 relative slope error — even though fastpower(1.0,0.5) is value-exact. This is the Float32 fastlog2 polynomial's slope error, not a rule bug.
  • Forward grid over 10 candidate StableRNG seeds at the tight atol=1e-4, rtol=1e-3: 8/10 pass 52/52, but seeds 123 and 31415 fail. The all-seeds-pass boundary is at rtol ~= 2e-3.
  • Reverting to rtol=1e-3 would only "pass" by cherry-picking a lucky seed, which hides the genuine (benign, expected) approximation error. The chosen rtol=1e-2 sits ~5x above the measured worst-case relative discrepancy yet far below the O(1) relative error a genuinely wrong rule would show, so real regressions are still caught.

Verification (actual test files, not a harness): deterministic across 3 repeats each on Julia 1 (1.12.5, Enzyme 0.13.164) and lts (1.10.11, Enzyme 0.13.166): FORWARD 52/52, REVERSE 36/36. On lts the chosen tolerance also passes 52/52 for all 10 candidate seeds, confirming it is seed-independent rather than seed-luck.

Ignore until reviewed by @ChrisRackauckas.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants