Fix flaky Enzyme test_forward/test_reverse tolerance (RNG-dependent vs fastpower approximation)#58
Conversation
…hed tolerance The Enzyme `@easy_rule` returns the *exact* `^` derivative, but EnzymeTestUtils `test_forward`/`test_reverse` compare it against finite differences of the *approximate* `fastpower` primal. Because `fastpower` routes through a Float32 `fastlog2` polynomial, the *slope* of its primal differs from the exact slope by ~1e-2 relative near x=1 (measured: exact d/dx = 0.5 vs FD-of-fastpower = 0.5066, i.e. 1.3e-2 relative), even at points like (1.0, 0.5) where the primal *value* is exact. So the FD reference is off from the exact rule by `fastpower`'s inherent approximation error, not by any rule bug. The old atol=1e-4, rtol=1e-3 sat below that gap, so whether the lane passed depended on the random tangents drawn from the global RNG and it went red intermittently (~4% of draws). Two-part fix: 1. Determinism via StableRNG, not Xoshiro. Seeding the global RNG / `Xoshiro` does not actually pin the test, because those streams can change across Julia versions, so the flake could reappear on a new Julia. `StableRNGs.StableRNG` yields a stream guaranteed identical across Julia versions, passed as the `rng=` keyword that EnzymeTestUtils accepts. 2. Tolerance matched to fastpower's documented accuracy (atol=1e-3, rtol=1e-2), not reverted to the tight 1e-4/1e-3. Empirically the inherent gap is real: with the tight tolerance, 8/10 candidate StableRNG seeds pass the forward grid 52/52 but seeds 123 and 31415 fail, and the all-seeds failure boundary is rtol~2e-3. Reverting to rtol=1e-3 would only "pass" by cherry-picking a lucky seed, which would hide the genuine (benign, expected) primal-approximation error. The chosen rtol=1e-2 sits ~5x above the measured worst-case relative discrepancy yet far below the O(1) relative error a genuinely wrong derivative rule would produce, so real regressions are still caught. Verified deterministic forward 52/52 + reverse 36/36 across 3 repeats on both Julia 1 and lts, and 52/52 for all 10 candidate seeds on lts (so the tolerance is seed-independent, not seed-luck). Swap the test dep Random -> StableRNGs in [extras]/[targets].test/[compat]. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
345fcc7 to
9328417
Compare
|
Updated to use StableRNGs.jl instead of Why StableRNG, not Xoshiro: seeding the global RNG / Tolerance decision — kept
Verification (actual test files, not a harness): deterministic across 3 repeats each on Julia 1 (1.12.5, Enzyme 0.13.164) and lts (1.10.11, Enzyme 0.13.166): FORWARD 52/52, REVERSE 36/36. On lts the chosen tolerance also passes 52/52 for all 10 candidate seeds, confirming it is seed-independent rather than seed-luck. Ignore until reviewed by @ChrisRackauckas. 🤖 Generated with Claude Code |
Please ignore until reviewed by @ChrisRackauckas.
Problem
tests / Enzyme (julia 1)onmainwent red on the v1.3.2 run (was green 8 days earlier, with identical test source). The failure:Root cause
FastPower's Enzyme
@easy_rulereturns the exact^derivative (y*fastpower(x,y-1),Ω*log(x)). EnzymeTestUtils'test_forward/test_reversecompare that rule against finite differences of the deliberately-approximatefastpowerprimal. So the measured gap is exactlyfastpower's own primal approximation error (~1e-3 relative — the same envelope asserted intest/fast_pow_tests.jl), which sat right on top of the oldatol=1e-4, rtol=1e-3.Whether the lane passed depended on the random perturbation
test_forwarddrew from the global RNG. An analytic sweep over the tangent grid the test samples (tangents in-9:0.01:9, central FD-5, at x=1.0, y=0.5) shows:The CI failing value
0.15524589105604497is reproduced exactly at tangentdx=0.3105(= the exact-^derivative0.5·dx), with the FD-of-fastpowerreference at0.1555— i.e. it isfastpower's primal error, not a wrong rule.Fix
Random.Xoshiro(0)) so the randomized test is reproducible.atol=1e-3, rtol=1e-2, consistent withfastpower's documented accuracy. This has zero failures across all ~6.5M tangent draws in the grid, while a genuinely wrong rule would still be off by O(1) relative and is not masked.Randomto the test[extras]/[targets]and aRandom = "1"[compat]entry.This is the principled fix, not a blanket tolerance loosening: the rule's true error is zero; the only thing being measured is the approximation built into
fastpoweritself.Verification (run locally)
Deps resolved match CI: Enzyme 0.13.164, EnzymeTestUtils 0.2.8, FiniteDifferences 0.12.34.
test_forwardwithrng=Xoshiro(16)(first tangent dx=0.31): old tolerance → 6 pass / 1 fail (matches CI); new tolerance → 7 pass / 0 fail.Pkg.test, julia 1.11:enzyme_forward_tests 52/52,enzyme_reverse_tests 36/36, tests passed.Pkg.test, julia lts (1.10):52/52,36/36, tests passed.Note on the other red lanes in the same run
tests / Core (julia 1)andtests / Core (julia lts)were red in the same run but are not code failures: both ran onself-hosted-4vcpu-8gb(smcsd) runners squatting on theubuntu-latestlabel; the "Run tests" step emitted zero log output and never recorded a conclusion (runner OOM/lost-communication while precompiling the Mooncake+Enzyme+ReverseDiff stack in 8 GB). Locally the Core group passes cleanly (fast_log2 1200/1200, fast_pow 5/5, other_ad_engines 4/4, all AD-engine derivative comparisons rel=0.0) on both julia 1.11 and lts. That is a runner-capacity infra issue, out of scope for this PR.🤖 Generated with Claude Code